Statement of Purpose Essay - MIT
Personal Background In 2018 (mid-Junior year), I began conducting NLP research with Professor Luke Zettlemoyer, where I led two research projects: entity-to-entity sentiment analysis [1], and active learning for coreference resolution [2], the latter of which I have a paper in submission at ACL. For that project, we introduced a novel approach to active learning for coreference resolution called “discrete selection.” Previous approaches to annotation for coreference resolution predominantly used pairwise selection, in which users are presented with two mentions and asked to mark whether they corefer. Discrete selection augments this technique by introducing a simple additional question: if the user deems the two mentions non-coreferring, they are asked to mark the first occurrence of one of the mentions. We also introduce the idea of aggregating scores of antecedents in the same coreference cluster before applying active learning selectors. We show through simulated experiments that discrete selection outperforms traditional pairwise selection with the same amount of annotation time. I spent the summer after my Junior year interning at Facebook, where I worked on detecting bots from mouse movements. After finishing the project in half the requisite time, I proposed and self-drove an additional machine-learning component of my project, and earned the highest intern rating/return offer. Though I learned a lot during my internship, I still felt research was more intellectually intriguing; the experience ultimately further corroborated my desire to pursue a PhD. My TA experience has also informed my decision to pursue graduate school. Throughout Junior and Senior years, I TA’d four upper-level CS courses, where I taught section, held office hours, and helped students understand course material, which I enjoyed immensely. I would be excited to continue teaching in graduate school and beyond. Since joining an applied research team at Facebook AI full-time in October, I have become much more immersed in the research of the field. Here, I collaborated closely with an intern on my team on a fake news/misinformation project, which resulted in two papers [3,4]. I have also been working on whole post integrity embeddings (WPIE), which our team is planning to submit a paper on to KDD. Since joining the team, I am leading a large-scale project on creating pretrained representations of all of Facebook entities (i.e. posts, users, pages). Specific research challenges of this project include having to create multi-modal pretrained representations (Facebook posts can include both texts and images) and work with immensely different data and domains than many current language models. While this work is enjoyable and has reaffirmed my interest in NLP research, it is still largely inspired by current trends in NLP, i.e. pretrained models. I am interested in carving out future solutions. Research Interests Large-scale pretrained models have recently become ubiquitous in NLP. My work at Facebook is almost exclusively devoted to these models. One ultimate goal of these models (and arguably AI in general) is to construct a system that can extrapolate to arbitrary, possibly unseen, tasks. While pretrained models like BERT have managed to make some headway in this problem, they still lag significantly behind humans. There is still only a limited set of tasks that these models can tackle (BERT, for example, is geared towards seq2seq and classification type tasks in particular, and will often handle tasks that fall outside of these realms by reframing other tasks in terms of those two). Moreover, these models still require a fine-tuning step, and cannot extrapolate to unseen tasks in a zero-shot manner. Thus, my question is: can we create a truly cross-task generalizable model that requires no finetuning? I believe one of the biggest inhibitors to our progress in generalizability is the lack of model interpretability. The core idea being, if we know what a model is doing, we can ensure it is learning the “right thing”--ensure it is actually generalizing and not simply overfitting to artifacts in the training data. One approach to achieve both interpretability and generalizability is to decompose our current, massive, “black box” models into a collection of modules, each of which have specific functionality. This idea is backed by two motivations: 1) the belief that we should have fundamentally different neural architectures to deal with different types of tasks, and 2) existing work in generalizability and interpretability. First, many current approaches to cross-task apply the same, one-size-fits-all architecture across tasks, usually by first converting one type of task into a task it has seen before. For example, T5 converts all tasks to seq2seq, and Levy et al. [6] and Obamuyide et al. [7] convert relation extraction/classification to QA and NLI, respectively. However, tasks are not fundamentally equivalent and should not be approached in the same manner. Second, modularity is a theme that has emerged in work for both generalizability and interpretability. Multi-task models often employ different task-specific modules for different tasks (for example, our multi-task for misinformation model [3] has distinct, task-specific layers on top of RoBERTa). On the interpretability side, Andreas et al. [5] introduced the neural module network, which dynamically compose neural modules for QA based on the linguistic structure of questions. The question is whether we can extend the concept of modularity to not just various QA tasks, but to various types of tasks in general--perhaps by formalizing a wide range of tasks into a single exhaustive set of primordial “reasoning” modules. These modules can be reused and reconfigured for many different types of tasks. Thus, having learned parameters for each module in a single task, we can easily extrapolate to another task simply by recomposing the modules without any additional training. Another question is how exactly we should decide on the function and composition of modules. Ideally, we would like to avoid unnecessarily imposing structure, as this oftentimes trades off accuracy: it is difficult for humans to intuit what the most optimal structural choices are, as the abstractions that humans use to define and categorize language do not necessarily represent how a machine would optimally learn language. The reason deep learning has worked so well recently is because many such structural choices are left for the model itself to learn, stashed away in learnable parameters. So in light of this tradeoff, perhaps the compositional structure of the modules should also be a learned function of the input, rather than necessarily follow from linguistic rules. Moreover, perhaps we should avoid having a predefined, strict 1-to-1 mapping of module to function--as in, the functionality of each module could also be dynamically-learned. While these modifications may slightly sacrifice interpretability, modular models are still much more interpretable compared to the singular, massive, black-box models we have today. This is because modularity inherently confers interpretability: it’s easier to analyze individual modules, or the interface between modules, than our current “conglomerate” of parameters. Conclusion This being said, I am open to exploring many types of problems in NLP. In the future, I would like to lead a research career either in academia or industry, and obtaining a PhD would facilitate this goal. I think MIT’s NLP group would be a great fit because my research interests are very aligned with prior work by Prof. Andreas. At MIT, I would especially like to work with Professors Regina Barzilay, Tommi Jaakkola, or Jacob Andreas, but I am open to working with any advisor whose interests suit mine. [1] B. Li, “Document-Level Entity-to-Entity Sentiment Analysis with LSTM-Based Models”, 2018. https://homes.cs.washington.edu/~lib49/papers/ent2ent_sentiment_2018.pdf [2] B. Li, G. Stanovsky, and L. Zettlemoyer, “Active Learning for Coreference Resolution using Discrete Annotation”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. (Under Review). https://homes.cs.washington.edu/~lib49/papers/Active_Learning_for_Coreference_Resolution_using_Discrete_Selection-protected.pdf. To access link, use password: belindaphdapps [3] N. Lee, B. Li, S. Wang, S. Yih, H. Ma, M. Khabsa, “If You Can’t Detect Them, Join Them: A Multitask Based Approach For Identifying Misinformation”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. (Under Review). https://homes.cs.washington.edu/~lib49/papers/ACL2020___MT_Misinfo_protected.pdf. To access link, use password: belindaphdapps [4] N. Lee, B. Li, S. Wang, S. Yih, H. Ma, M. Khabsa, “Language Models for Factual Knowledge”, 2020. (In Progress). [5] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Deep compositional question answering with neural module networks,” CoRR, vol. abs/1511.02799, 2015. [6] O. Levy, M. Seo, E. Choi, and L. Zettlemoyer, “Zero-shot relation extraction via reading comprehension,” CoRR, vol. abs/1706.04115, 2017. [7] A. Obamuyide and A. Vlachos, “Zero-shot relation classification as textual entailment,” in Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), (Brussels, Belgium), pp. 72–78, Association for Computational Linguistics, Nov. 2018. [8] D. A. Hudson and C. D. Manning, “Compositional attention networks for machine reasoning,” CoRR, vol. abs/1803.03067, 2018. [9] D. A. Hudson and C. D. Manning, “Learning by abstraction: The neural state machine,” CoRR, vol. abs/1907.03950, 2019.