Carnegie Mellon University PHD Statement of Purpose Essay Sample

My research interests are primarily in the field of natural language processing. In particular, I will work to expand the scope of traditional language tasks to include low-resource languages and non-standard dialects. I speak a dying dialect. Although I was born in the United States, I am ethnically Sri Lankan Tamil. My parents were born into a conflict that was largely motivated by their government’s demand that all its citizens speak the same language. Desperate for a better life and to preserve their identity, they fled to California. While I speak Sri Lankan Tamil fluently, I am significantly more comfortable with English and speak it more frequently because of the opportunities it lends me. However, by preferring the language that economically and socially benefits me as an individual, I am contributing to the decline of my ancestral language. With Sri Lankan Tamil being a nationless dialect, the chance of it surviving more than a few generations is slim. As it fades, it will inevitably take with it the culture and history of its people. Globalized communication has had the side effect of intensifying pressures to convert to specific languages in exchange for economic reward. Most widespread sources of information, like Wikipedia, are primarily in English. Languages preferred by affluent institutions are disproportionately well served by technologies that allow their speakers to interact with these sources. Speakers of languages with less economic power are cut off from a wealth of opportunities. Natural language processing stands to make information and tools more broadly accessible by allowing interactions with technology to take place in human languages, but by only focusing on a small set of languages the research community is contributing to the burdens felt by these socioeconomically disadvantaged groups. Fundamentally, the institutional push towards assimilation has been accelerated by the prioritization of languages linked to wealthier populations in the technology industry. My goal is to make natural language processing functional for all languages, in all their variations, to prevent a further loss of linguistic and cultural diversity. As a double major in computer science and linguistics, I have both the technical background to conduct research and an appreciation for the unique features of language. In addition to spending one year researching the grammaticality of locative inversions in English syntax as part of the Linguistics Research Apprenticeship Program, I learned two understudied languages (Tamil and Telugu) under a Foreign Language and Area Studies Fellowship. I also started exploring natural language processing relatively early by taking graduate-level natural language processing in my second year. I began conducting research in the Berkeley Natural Language Processing Group under Professors Dan Klein and Trevor Darrell in my second year. I spent the first year working within the space of multimodality, imitation learning, and instruction following. During this first project, I proposed ideas and implemented models under the guidance of two graduate students. The goal was to build instruction following agents that could generalize to novel compositions of instructions (e.g. “pick up the ball and go to the key”) in task oriented environments like BabyAI and ALFRED. The initial approach involved representations of language instructions and observation sequences, which could be composed through a sum operation to represent new tasks. However, this approach faced issues due to the bottleneck created by the lack of granular information in language instructions. Instead, I pivoted into developing a combinatorial data augmentation method. This method built on existing compositional data augmentation techniques but had broader applicability due to relaxed feasibility constraints on the generated demonstrations. Currently, this technique performs 24% better on unseen compositional tasks and 10% better on zero-shot generalization to natural language from synthetic training examples. My time spent working on multimodality gave me an appreciation for tasks that involve dealing with structures outside of language, both perception-based (i.e. images) and knowledge-based (i.e. databases). I noticed that nearly all work in multimodality has been done in English, with no clear integrations for low-resource languages. Developing techniques that are more inclusive of low-resource languages - as has been done in recent years with neural entity recognition - is important for improving accessibility as these advancements are applied to mainstream technologies. Thus, I focused the remainder of my time as an undergraduate on low-resource languages and their associated challenges. After my third year, I spent the summer working with MIT’s Language & Intelligence Group alongside Professor Jacob Andreas. There, I examined the problem of rare words in machine translation. Translation systems generally struggle to handle words that occur infrequently in the training data. This problem disproportionately impacts low-resource languages because they have less data and therefore significantly more rare words. This has been partially addressed in the past using the lexical-translation mechanism, which applies a word-level translation to words the model cannot translate confidently and slots them into the right position based on context. This word-level translation is obtained using a lexicon, but current lexicon generation methods are costly in terms of human labor or data and struggle to incorporate words seen only once. My contribution was a neural model which could generate reasonably accurate lexicon entries for words given only one example sentence, improving one-shot translation. Part of this work was presented at the Workshop for Widening Natural Language Processing 2021. While working on this project, I faced many of the challenges inherent to working with low-resource languages, including unreliable datasets, a lack of annotations, and poor tokenization. This reinforced my belief that there is a lot of work left to be done in making natural language processing more equitable. Most existing research with low-resource languages focuses on reducing the amount of data needed to perform a narrow set of tasks (namely machine translation, semantic parsing, etc.). Before this set of tasks can be expanded to include subfields like multimodality, the current understanding on how low-resourceness impacts performance must be changed. Low-resource languages do not experience lagging performance exclusively because they lack data; the core tools in widespread use are structurally unable to represent their typological features. For example, standard forms of English and Mandarin contain minimal complex morphology, so modern toolkits generate ineffective token boundaries for morphologically rich languages. While techniques for reducing the necessity of massive datasets are needed for working with low-resource domains broadly, there is an orthogonal need to remove the inductive bias from the tools researchers use. Otherwise, languages and dialects with features that contrast those seen in popular languages will consistently see worse performance. I want to address this disparity so that models for languages with understudied linguistic features can see improvements mirroring those of more researched languages. In graduate school, I have three overarching topics I plan to address. (1) Firstly, I want to develop techniques that serve languages with understudied typological features. Some projects I have in mind include tokenization techniques that can handle rich morphology, data augmentation techniques that take advantage of free word ordering, and cross-linguistic models that bootstrap off of existing etymological knowledge collected by linguists. (2) Secondly, I want to create benchmarks that target language variation with an emphasis on non-standard forms in text. This will require collecting data, constructing evaluation metrics that analyze adaptability to different dialects, and ultimately building models for tasks like language normalization. (3) Lastly, I want to apply techniques developed by researchers focusing on low-resource languages or multilingualism to other subfields (e.g. multimodality, dialogue systems). For example, applying the pivot method currently seen in cross-lingual neural entity recognition to instruction following would bolster zero-shot performance on uncommon commands in low-resource languages. All three of these goals are vital to making existing technology accessible for all speakers. At Carnegie Mellon, I aim to work with Professors Graham Neubig, Daniel Fried, and Martin Sap. The Language Technology Institute, especially NeuLab, appeals to me because of its long history of working with understudied languages. Low-resource languages are lacking support in many ways, so access to experienced researchers in this niche will be invaluable. Additionally, its strong undergraduate program in computer science will give me opportunities to practice vital soft skills like mentorship and teaching. Carnegie Mellon is the best place for me to use my skills and knowledge to open opportunities for disadvantaged speakers around the world.