Statement of Purpose Essay - Stanford University
My research is in natural language processing (NLP) and, methodologically, I am interested in leveraging theoretical methods to improve empirical NLP. Core to my ethos as a researcher is an unrelenting willingness to apply, extend, and interlace theories from the fields adjacent to NLP (e.g. linguistics, machine learning, algorithms). Recently, I have been enthusiastic to explore how rigorous theoretical approaches interplay with the societal aspects of language technologies through initial investigations into social bias and privacy. During my four years (three as an undergrad, one as an M.S. student) at Cornell University, I have been exceedingly fortunate to be advised by Professor Claire Cardie. Below, I detail some of the work I have done under her guidance along with the future directions I foresee pursuing. Motivation. NLP is an empirical science; judgments about ideas, approaches, or models are predicated on their observed performance on real language data. Principled and rigorous methods can serve as powerful tools towards building ethical, robust, and performant NLP systems. That said, theoretical approaches are severely underutilized in modern NLP: linguistic knowledge is habitually cited but heavy-handed linguistic supervision tends to be outperformed by unconstrained end-to-end learning. Relatedly, deeper considerations that do not reduce linguistics to a discipline merely interested in trees or elementary phenomena such as entailment classification (NLI) are infrequent. Further, while the products of research in machine learning and algorithms are often used as blackbox components, little is done to more richly integrate and develop ideas from these expansive domains. To me, NLP is blessed to be at the intersection of multiple domains with mature theories. Thus far, I have been most excited in my research when I have discovered how to adapt theoretical methods to yield empirical improvements. Similarly, I have found moments when I can unite distinct theoretical approaches to be especially interesting. Theory and NLP. During my second year, I tried for the first of many times to interconnect theory and NLP by addressing a practical deficiency with long short-term memory (LSTM) networks. Despite the name, LSTMs regularly fail to model long-distance dependencies in language and are limited in their realization of certain types of linguistic generalization. LSTMs process inputs sequentially: if two interdependent words are k words apart, then an LSTM must retain this dependency for k timesteps; this is challenging for larger k. My simple yet effective idea was to permute the words in a sentence such that the inter-word dependencies "shrink" because related words are placed closer together. As I showed in [1, 3], we can codify these relations using the linguistic formalism of dependency grammars and can thereafter compute permutations via approaches from combinatorial optimization first conceived in the algorithms literature. I specifically capitalized on the fact that the linguistic formalism yields a tree; the dependency minimization problem is NP-hard for general graphs but admits poly-time solutions for trees. Most surprisingly, given that permuting words may destroy or obscure other linguistic phenomena, I demonstrated this procedure can yield improved downstream performance whereby the sentences in the original training data are substituted for their permuted counterparts. In short, I found that provable optimization that intertwines algorithmic and linguistic notions of optimality can amount to empirical benefits for NLP systems. I presented my work at the ACL 2019 main conference as part of the Student Research Workshop and will present additional findings at NeurIPS 2019 in the Context and Compositionality Workshop. My Master’s thesis [7] builds upon this by generalizing the approach to broader classes of optimality. In particular, my initial method of permuting to compress dependencies is just one way of canonically specifying the order of a bag-of-words. This specific rule leverages the linear ordering of words to encode syntactic information implicitly. In general, most methods in NLP specify the input’s ordering based on the ordering humans use or a related order (bidirectional models). Questioning this implicit assumption that human word orders should be those used by computational models exposes new avenues for research — I have initial findings that suggest that certain word orders improve compositional reasoning and long-distance dependency modeling. By relating the distant theories of psycholinguistics, which considers dependency length minimization, and of algorithms, which describes computationally tractable objectives, I have specified a generalized framework that jointly exploits understanding in linguistics and algorithms to improve NLP. An unanticipated consequence I am further exploring is that this schema is language agnostic yet linguistically specified. This implies that models for different languages could process inputs with a standardized word order. Sadly, most current NLP research (including my own) is not genuinely NLP but only ELP (English Language Processing). My approach offers hope that we can find generalizable best practices for NLP system design across the thousands of natural languages used worldwide, despite their typological diversity, by normalizing some of the variation at the input level. Linguistically, the surface order of a sentence is often juxtaposed with the latent deep structure that specifies its meaning. In the future, I would like to consider order more dynamically, viewing it as a malleable construct that can be deliberately changed to promote improved linguistic generalization in computational models. I would especially like to consider the prospect of encoding information in order implicitly via computational/information-theoretic formalisms. Society and NLP. Language technologies are becoming increasingly prolific; they will contribute to greater social good but also to increased social harm if ethics and fairness are not central to their design. As an intern in the DeepSpeech group at Mozilla, where I was advised by Dr. Kelly Davis, I began by studying popular pretrained representations (e.g. ELMo, BERT). We wanted to better understand these recent contextualized models and the social biases they encode. These models have risen to tremendous prominence specifically because they are contextualized: they compute vector representations for words conditional on the surrounding sentential/phrasal context (at inference time) whereas prior word embeddings (e.g. Word2Vec, GloVe) do not. Consequently, prior methods for studying social bias developed for their context-agnostic predecessors cannot be applied (i.e. they do not type check) and I began by devising a distillation procedure that resolved this issue (i.e. making current models backwards compatible with prior bias/interpretability techniques). With this procedure in hand, I then exhaustively applied the gamut of existing social bias estimators. Shockingly, I uncovered that existing methods for estimating social bias in word embeddings are drastically inconsistent: I showed that some methods claim model M1 is more biased than M2 whereas others indicate M2 is more biased and still others suggest they are equally biased [4]. Galvanized by these troubling findings, I, along with three undergraduate researchers I advise, designed a new bias estimator that is explicitly grounded by observed human bias and rigorously verified for empirical robustness [5]. By drawing upon the mathematical generality of distributions, we disentangled whether embedding algorithms or their training data precipitated embedding biases. Further, we interrogated existing “debiasing” procedures to see if they truly debias; we found they generally do not and, in many cases, they exacerbate bias! While social bias has been extensively studied in NLP, privacy has been comparatively understudied. In spite of this, it is clear that unified approaches for privacy and NLP are necessary given the ever-increasing volume of vital sensitive text (e.g. medical reports, legal documents, personal messages). During my third year, under the guidance of Professors Xanda Schofield (Harvey Mudd College) and Steven Wu (University of Minnesota), I developed a flexible framework which combines privacy and NLP [2]. We observed that general methods for integrating theoretical privacy into machine learning models are poorly suited to the sparsity and distributional properties of textual data. Instead, we can better support pervasive privacy by synthetically generating text using a private mechanism and then training models using this synthetic data (exploiting that theoretical privacy definitions, such as differential privacy, are upheld under composition/post-processing). We are currently working to build a comprehensive suite of generative models that are provably private and empirically well-suited to textual distributions [6]. In the future, I hope to develop provably robust bias estimators and tackle challenging problems such as intersectional bias and causally understanding bias propagation. At present, disciplined approaches to fairness and bias from the algorithms and fairness in ML communities are not frequently applied in NLP. During my PhD, I would explore if such machinery can be tailored towards bias in language. Similarly, many methods from robust learning may be complementary to work on privacy: I am particularly interested in considering if theoretical privacy guarantees imply robustness guarantees or if nascent robust methods for language could be modified to permit privacy-preserving NLP. Career Goals. I hope to serve as a professor at a research institution in the future. Who I am, as a researcher, computer scientist, and human being, has been forever shaped by Professor Cardie’s mentoring. Because of her, I have observed that professorship affords unique and incomparable opportunities in advising students. At a smaller scale, I have already found this to be deeply rewarding in my current advising of undergraduate researchers. Further, teaching has been central to my experience at Cornell — I have served as a TA six times (and received the Outstanding Teaching Assistant Award every semester). I currently co-teach the undergraduate NLP course with Professor Cardie and have found this to be tremendously fulfilling despite being trying at times. Moreover, I am especially interested in fostering undergraduate research and would hope to continue this throughout my PhD. Beyond directly advising undergrads, I am a primary organizer/architect of Research Night, which exposed ≥ 250 undergrads to computing research in its first two iterations (many of whom became undergrad researchers and REU participants), and have both created and facilitated undergrad reading groups for several semesters as part of our ACM chapter. Why Stanford. Professors Percy Liang, Chris Manning, Tatsu Hashimoto, and Dan Jurafsky are faculty I would especially like to work with. Stanford NLP specifically inspires me because of the commitment to faithfully studying language grounded in linguistic approaches and uncompromisingly pursuing robust understanding in NLP systems. These complementary properties resonate with me. In a similar spirit, I have found the influence of faculty from other subareas to be beneficial in my early NLP career, notably Professors Bobby Kleinberg (algorithms; provided guidance in navigating the algorithms literature on linear layouts and precise advice on Lagrangian relaxations for certain dual objectives) and Marty van Schijndel (psycholinguistics; introduced psycholinguistic theories which I further adapted). At Stanford, I would hope to analogously learn from Professors Chris Potts, Noah Goodman, and Tengyu Ma. Further, several young PhDs, specifically Nelson Liu, John Hewitt, and Kawin Ethayarajh, are students I would greatly benefit from learning from and growing alongside. All told, it is clear that Stanford NLP would be a great home to do my PhD. References [1] Rishi Bommasani. Long-Distance Dependencies Don’t Have to Be Long: Simplifying through Provably (Approximately) Optimal Permutations. In Proc. of ACL: Student Research Workshop, 2019. [2] Rishi Bommasani, Zhiwei Steven Wu, and Alexandra Schofield. Towards Private Synthetic Text Generation. NeurIPS: Machine Learning with Guarantees Workshop, 2019. [3] Rishi Bommasani. Long-Distance Dependencies Don’t Have to Be Long: Simplifying through Provably (Approximately) Optimal Permutations. NeurIPS: Context and Compositionality in Biological and Artificial Neural Systems Workshop, 2019. [4] Rishi Bommasani, Kelly Davis, Claire Cardie. Interpreting Pretrained Contextualized Representations via Static Embedding Analysis. In Proc. of ACL, 2020. Under Review [5] Rishi Bommasani, Albert Tsao, Claire Cardie. Generalized Social Bias Estimators. In Proc. of ACL, 2020. Under Review [6] Rishi Bommasani, Zhiwei Steven Wu, and Alexandra Schofield. Private Synthetic Text Generation. In Proc. of ICML, 2020. To be Submitted [7] Rishi Bommasani. Generalized Optimal Linear Orders for Natural Language Processing. Master’s Thesis, Cornell University, 2020. In Preparation.