Back to All Essays

Statement of Purpose Essay - Carnegie Mellon University

Program:Phd, NLP, Speech
Type:PHD
License:CC_BY_NC_SA_4_0
Source: Public Success Story (Shikhar Bharadwaj)View Original

The last few years have seen incredible progress in text-based Large Language Models (LLMs). While the textual format is amenable to efficient processing, the information it provides remains incomplete, ungrounded, and ambiguous without other modalities such as audio, image, video etc. Conversely, humans build a highly grounded world model by utilizing multiple modalities. For instance, a description of Tarsier feels incomplete without its image. Similarly, distinguishing sarcasm from excitement can be challenging without auditory context. Speech can also serve as an alternative source of training data for languages with limited text on the web. Driven by these observations, I aim to leverage multiple modalities along with text, to infuse LLMs with grounded world knowledge, and to improve them in sparse text-data domains. Below, I explain how my experiences with developing impactful multimodal (speech-text) models, task-aware representation learning methods, and classical NLP tasks (semantic parsing) make me a suitable candidate to pursue this research direction. **Semantic Parsing and Source Code Understanding** As a Master’s student at the Indian Institute of Science (IISc), I got my first taste of the research process working on semantic parsing with Dr. Shirish Shevade. I developed explainable and efficient methods for the Natural Language to Bash command translation task. Besides serving as a pedagogical tool for inexperienced developers, the task is valuable for increasing developer productivity by automating repetitive tasks on the Linux command-line. While exploring the literature, I realized that despite considerable work on semantic parsing in classical NLP, prior work lacked a trustworthy and explainable system. Motivated by this finding, I proposed an explainable model (published at CoNLL 2021) by grounding the generated Bash commands using structured information from Linux manual pages. Through this work, I gained an understanding of the process of formulating a problem statement, studying previous literature to grasp ideas, and designing, testing and analyzing solutions to the problem. I also learnt how to effectively communicate research via presentations and academic writing. Perhaps the greatest learning during my Master’s thesis work was to develop the persistence and tenacity to tackle silent ML bugs! I was motivated to make the semantic parsing system more accessible for people with less compute. To this end, I used constituency parse trees along with the problem structure to create a resource-efficient model (Oral at NAACL 2022), gaining skills in extracting insights from the data, and designing effective solutions based on these insights. As a graduate student, I will draw on these experiences in the research process. Semantic parsing sparked my interest in source code modeling, particularly in exploring novel ways of programming. For instance, can we create a question-answering system to enhance codebase comprehension? As an initial step towards this objective, I developed a semantic querying benchmark over code (accepted at ISEC 2024). Simultaneously, I participated in NLP for Software Engineering competitions (earning second prize at NLBSE workshop, ICSE 2022) to gain insights into the challenges within this domain. My Master’s thesis aimed to make programming more accessible to everyone. As a graduate student, I’m keen on leveraging grounded multimodal LLMs to lower the barriers to programming by building novel interfaces. For example, employing UML diagrams, code graphs, and speech to create a conversational platform for writing code. **Large Scale Multimodality Research** Pursuing multimodal modeling, I transitioned from text-based NLP to multimodal multilingual modeling as a Pre-Doctoral Researcher (a 2-year residency program) at Google Research. My first project was to develop an ambitious spoken language identification (LangId) system for video covering more than 500 languages. Such a system is useful for mining speech to create more inclusive multimodal LLMs covering languages with limited text on the web. An accurate LangId system also allows for better tagging of data, leading to reliable multilingual training mixtures and credible error analyses of multimodal LLMs. Advised by Dr. Partha Talukdar, Dr. Sriram Ganapathy and Mr. Ankur Bapna, I led the multimodal modeling for this project (under review at ICASSP 2024). Looking at the data, I realized that the video title and description carries signal for identifying the language of the video’s spoken content. By incorporating multiple modalities such as text (from title and description of a video), speech and the video upload location, the multimodal LangId system achieved an 8% accuracy boost over the speech only variant. We are currently pursuing active deployment of this model. This was my first long-term project with multiple collaborators. From their inputs, I learnt how to design insightful experiments to get a deeper understanding of the model. This project also reinforced the principle of Occam’s razor in my thought process - simple multimodal modeling that can be easily scaled with data usually gives better overall performance. While working on this project, I had to quickly pick up internal Google tools for handling audio data at the scale of 13 million hours and over 500 languages. Besides model training, I was also involved in creating data pipelines for this project. The engineering skills picked up here also earned me peer recognition for my time-critical contribution to the Google-USM2 work, which shared the same data pipeline. At Google, I also contributed towards developing a task-aware speech representation learning framework (accepted at INTERSPEECH 2023, and ASRU 2023). We realized that representation learning methods do not utilize the language information present in speech datasets. The proposed methods leverage this information to enhance speech representations improving on spoken language identification task without compromising the baseline performance on Automatic Speech Recognition (ASR) and non-semantic tasks. Here, I contributed towards the ideation of this project, debugging the code, and conducting a holistic evaluation of the models. This project taught me how simple techniques like contrastive learning can be modified in interesting ways to learn better representations. I also contributed to create, benchmark and audit multimodal evaluation sets for low resource languages, such as the open sourced Vaani dataset - a joint effort by Google and IISc. At Google, I learnt to make progress on multiple projects simultaneously, and work in a highly collaborative environment across multiple geographies and research teams. I not only refined my technical skills but also delved into more fundamental research problems. **Future Interests and Graduate Studies** Through these excursions into multilingual and multimodal learning, I realized that relying solely on text is insufficient for developing LLMs in certain settings. Many of the world’s languages do not have sufficient text on the web for building LLMs. An effective learning method utilizing speech could unlock a larger and widely available mine of data. Additionally, speech, being a more natural medium for communication, can provide a more inclusive interface with LLMs, especially for the non-literate. Consequently, incorporating speech along with text is imperative for developing more accessible LLMs. In general, non-text modalities can help in building LLMs for sparser data settings by cross-modality transfer. Therefore, during my graduate studies, a fundamental research direction I want to pursue is that of modality alignment, especially modality alignment with limited paired data and via resource-efficient training (in terms of additional parameters and compute). I want to explore how alignment in high-resource domains transfers to low-resource settings. For instance, can we align speech and text for languages that have sufficient paired data, and transfer this alignment to related low resource languages where one of these modalities is missing? Another interesting question is whether grounding LLMs can help in interactions with the real world for an intelligent agent. For instance, can grounding their semantic representations with images/videos allow us to improve their planning capabilities? Drawing on my semantic parsing work, I am also interested in leveraging multiple modalities for building novel interfaces, aiming to make computing more accessible to everyone. **Why a PhD at CMU**: Multimodal LLMs can provide completely new capabilities as well as a unique opportunity to broaden the impact of text-only systems for text-constrained domains. I believe a PhD at CMU is the next logical step towards gaining the in-depth expertise and the connections necessary for such impactful research. The opportunity to do focused research on a dedicated topic will equip me with the necessary research depth and skills. The chance to work with the brilliant faculty and inspirational peers at CMU will be crucial in building a collaborative network for my research. Given my experience in semantic parsing and multimodality, I will be excited to work with Prof. Graham Neubig on using multimodal LLMs for building autonomous agents. His work on WebArena is particularly inspiring because of its potential impact in making technology more accessible. I am also keen on contributing to foundational multimodal LLMs working with Prof. Neubig, drawing from my experience in handling large scale datasets at Google. My background in speech representation learning, speech-text modeling, and multilinguality, is in sync with the work of Prof. Shinji Watanabe on foundational models for speech. During my tenure at Google, I was fortunate to work with datasets similar to YODAS, and I aspire to apply these skills to make significant contributions in academia. I am interested in working with Prof. Louis-Philippe Morency, since his work on multimodal alignment resonates with my proposed research direction. I am drawn to work with Prof. Emma Strubell due to the alignment between SLAB’s motivation and my own research drive centered on democratizing NLP. The lab’s focus on efficient methods for NLP aligns well with my past work. My long term goal is to build a career doing research that impacts people’s lives in a positive manner. The PhD program at CMU provides a unique opportunity to do impactful and exciting research, and leveraging my experiences and expertise I am determined to make the most of this opportunity.