Back to All Essays

Statement of Purpose Essay - Brown

Program:Phd, NLP, ML
Type:PHD
License:CC_BY_NC_SA_4_0
Source: Public Success Story (Yong Zheng-Xin)View Original

My research interests lie in the intersection of computational semantics and multi-linguality. In particular, I aim to explore knowledge-efficient natural language processing (NLP) techniques, such as multi-task learning and weakly supervised learning, for representation learning and dataset creation. My impetus for pursuing this research direction stems from the recent challenge of generalizing state-of-the-art NLP systems to low-resource languages, where data scarcity is the main bottleneck for the transfer learning of large-scaled pre-trained language models. I believe that the general solution is a combination of improvements in the NLP algorithms to better utilize additional supervision information (top-down approach) and techniques for automatic creation of linguistic resources (bottom-up approach). In the past three years, I have been fortunate enough to conduct initial research in these directions in the field of semantics. Top-down Approach. I believe that we can overcome the annotation bottleneck, caused by a low-resource scenario, by making NLP models more knowledge efficient. This can be achieved by using neural networks with different inductive biases and leveraging additional supervision information. I first observed the annotation bottleneck when I researched the annotation transfer of semantic frames across parallel texts. While the study of BERTology has demonstrated that transformer-based NLP models capture high-level semantic information in sentences, my research this summer found that, after fine-tuning, these NLP models still fail to learn cross-linguistic divergences in frame labels when the frame-semantic annotations were projected from English to Brazilian Portuguese and German. I identified that the primary reason was that semantic frames are language-specific to an extent–––lexical units and their translation-equivalents in parallel texts do not necessarily evoke the same frame, and multilingual frame-semantic resources are too scarce for models to learn the generalizable semantic differences in frames. The failure of NLP systems in a low-annotation setting motivated my research on linguistically-informed NLP models. I proposed creating lexico-semantic representations of frames using graph neural networks with strong relational inductive biases to take advantage of the rich multi-frame relations from FrameNet. My experiments demonstrated that the approach successfully clusters frames that characterize the same scene even though they are not connected in FrameNet; in other words, integrating the linguistic structure into model structure helps the model learn representations about frames under low resource constraints. The primary challenge of low-resource scenarios is that overparameterized neural networks, such as transformer-based language models and graph neural networks, easily overfit on small training datasets. In my paper recently submitted to NAACL 2021, I came up with a form of data augmentation technique that randomly perturbs existing annotation labels to boost sample sizes, and the model learns to reconstruct the original labels through auxiliary learning. This technique is useful for training neural networks under limited resources as it creates synthetic in-domain data. I also found that weak supervision from the auxiliary task facilitates learning cross-linguistic frame-semantic information. My research experience motivates me to take an interest in making models more knowledge-efficient. I intend to build on initial work on multi-task learning, adding structure and prior domain knowledge to models, and transfer learning. For instance, I am excited about studying how linguistic inductive biases affect multi-task learning so we can create linguistically-informed models that overcome negative transfers that jeopardize model inference in a low-resource setting. Bottom-up Approach. Deep learning models are data-hungry, and the bottom-up solution to the annotation bottleneck is to expand linguistic resources. My enthusiasm about the automatic creation of linguistics resources stems from my past research experience with augmenting FrameNet lexicon. Berkeley FrameNet (BFN) is the largest lexico-semantic dataset that embodies the theory of frame semantics, and yet its lexical and frame coverage is still limited. In my efforts to automatically acquire lexical units, I challenged the assumption that the existing frame labels can characterize all the words currently absent from FrameNet. I proposed a representation learning method that uses an autoencoder to filter out ill-fitting lexical units before adding them to BFN. My method expanded the coverage of BFN on both the lexical and frame levels while ensuring a higher quality of the automatically expanded lexicon, thereby making semantic resources in FrameNet more useful for downstream NLP tasks. My research begs the question of whether we can automatically create new fine-grained task-specific labels without an active learning scheme involving linguists. Label acquisition is powerful because recent work has shown that diverse, fine-grained knowledge about labels helps pre-trained neural networks better adapt to a new domain. Hence, I am interested in learning the representations of the cross-linguistic, semantic relationships between word entities and their labels. The purpose is to augment linguistic resources horizontally (e.g., acquire new lexical items) and vertically (e.g., acquire new labels) to enable NLP systems for low-resource languages. Future Plan. My ambition is to become a Principal Investigator who leads an NLP research group in a university or industry. A Ph.D. will be a great stepping stone as I build up my research record and experiences. As a mentor to students who are Masason Foundation scholarship holders, I find that I relish the process of helping students gain more clarity about specific topics and advising them on their machine learning research projects. I hope to explore more options in teaching and advising during my graduate studies. The Ph.D. in Computer Science program at Brown University is a good match for me. Coming from Minerva Schools at KGI, a college with less focus and mentorship on NLP, I am excited about Brown's research groups, such as LUNAR, that bring together linguistics and computer science students. At Brown, I am especially interested in the work of Professors Ellie Pavlick and Stephen Bach. Under Prof. Pavlick's guidance, I plan to explore training methods (such as multi-task learning) for models to learn linguistic uncertainty, consistent with the full range of possible human inferences, from singly labeled examples. I also hope to extend Prof. Bach's work to semantic role labeling and explore methods to generate Snorkel labeling functions for semantic resources automatically.