Back to All Essays

Statement of Purpose Essay - Carnegie Mellon University

Program:Phd, NLP
Type:PHD
License:CC_BY_NC_SA_4_0
Source: Public Success Story (Seungone Kim)View Original

My primary research focus lies in the intersection of natural language generation (NLG) and establishing a science of language model behaviors. I am captivated by the versatility of general-purpose language models to tackle a wide range of generation tasks. However, it is unclear how these models acquire high-level generation behaviors and capabilities through various training data and dominant learning paradigms. This understanding is critical, as an old Korean saying goes, ‘A tree with deep roots is not shaken by the wind.’ I aim to build an analysis model that could explain the factors enabling language models to learn specific capabilities. Concretely, my research interests include: (i) developing fine-grained evaluation frameworks that systematically identify what specific capabilities language models lack and (ii) exploring the role of synthetic data to induce desired abilities into language models for further improvement. Similar to complex systems science, building a solid foundation for understanding language model behaviors will provide insights on enhancing their ability to address more complex tasks. 1 HOLISTIC EVALUATION WITH FINE-GRAINED CRITERIA Given two different language models, how can we systematically determine which one is better than the other in various scenarios? The ability of language models to generate fluent long-form text makes marginal improvements on classification benchmarks less significant for detailed inspection. On the other hand, evaluating long-form responses poses a unique challenge, as it’s exceptionally difficult to assess the quality of a given text. Conventional methods, such as employing reference-based evaluation metrics (e.g., Rouge, BERTScore), or using language models as judges based on coarse-grained criteria (e.g., Helpfulness, Harmlessness) often fail to capture the depth and granularity that human evaluation offers. I believe the difficulty in evaluating long-form outputs arises from the ambiguity in defining what constitutes a good output. In contrast, humans naturally discern key factors such as creativity, tone, and cultural sensitivities. In two of my publications [1, 2], my colleagues and I have developed an evaluation framework that includes a fine-grained criteria for each instance and employed language models as a judge. This approach was effective at pinpointing specific areas where language models fall short. For example, when assessing a response for a FinTech startup’s strategy, it is more instructive to check for omitted details about crucial aspects, such as regulations and compliance, rather than assigning a simplistic harmlessness score. However, scaling up fine-grained evaluation poses a significant challenge, as it requires extensive effort to tailor criteria to specific instances. Furthermore, evaluation benchmarks tend to become saturated over time due to contamination issues, necessitating continuous development. One interesting direction I want to explore is creating a benchmark that includes a compositional hierarchy of language model capabilities. The community has already identified several capabilities (e.g., instruction following, chain-of-thought reasoning, tool usage, self-refinement, theory of mind, etc.). We could categorize each evaluation instance, which demands specific abilities, into related sub-tasks with similar criteria. This compositional hierarchy would also streamline community contributions by enabling the addition of new prompts for testing, while still adhering to a unified, fine-grained evaluation framework. Moreover, it would be an interesting research direction to expand the fine-grained evaluation framework into different modalities. For instance, I am currently exploring how vision-language models could be utilized as judges to assess text outputs given an image input [3]. The approach of employing fine-grained criteria can also be effectively applied to evaluate image and video outputs generated by diffusion models, as well as audio outputs. In general, evaluating AI-generated outputs using interpretable, detailed criteria will be crucial for gaining a sophisticated understanding of the model’s behaviors and its ability to generalize. 2 INDUCING DESIRED CAPABILITIES WITH SYNTHETIC DATA Another intriguing direction in my research is the use of generative models to produce high-quality data, thereby inducing specific generative behaviors in language models. The recent success in augmenting synthetic data to improve instruction-following abilities in language models has been a promising advancement. This progress excites me, as it hints at the potential synthetic data to unlock even more capabilities. I am keen to explore how synthetic data can be utilized to specifically address the deficiencies of language models that are identified through fine-grained evaluation. This is analogous to the human process of learning from mistakes, where the language model is iteratively improved by being exposed to data that targets its specific shortcomings. Specifically, I am interested in exploring the concept of continual RLHF, focusing on progressively aligning models with user interactions and using these interactions as seed data to generate instances that simulate similar engagements. One particular area I have explored is the induction of step-by-step reasoning abilities in language models. In my prior publications [4, 5, 6], I have demonstrated that incorporating synthetic rationales and commonsense inferences significantly enhance a model’s reasoning abilities in various settings. However, an notable finding was that despite the trained model excelled at constructing logical flows, it often hallucinated and made simple calculation mistakes. Therefore, as a future direction during my doctoral studies, it would be interesting to identify which capabilities can be acquired through high-quality, machine-generated data and which skills are inherently unattainable, possibly due to constraints such as insufficient parameter size or lack of essential knowledge. Additionally, a key question in my research is how we can efficiently induce multiple capabilities in a language model without it losing its pre-existing abilities. In some of my recent works [7, 8], my colleague and I introduced a modular approach that utilizes specialized checkpoints for each task and user-defined criteria. Utilizing post-hoc weight merging, this method not only showed improved performance on new, unseen tasks but also effectively aligned with diverse user preferences. A notable feature of this system is its efficiency in adding new expert models without the risk of catastrophic forgetting, achieved by maintaining the differential (‘diff’) weights compared to the original model’s weights. This suggests that such a system can successfully integrate new experts and adapt to evolving user interests. However, weight merging presents its own challenges, such as determining the appropriate coefficients for each expert and the requirement for substantial storage to keep all the weights. Investigating a practical solution for maintaining a suite of desired behaviors using ‘diff’ weights, such as enhancing tool-usage abilities, presents an exciting research avenue. 3 GOALS DURING PH.D. STUDIES AT CMU LTI AND CAREER GOALS I’m excited about the opportunity to join the CMU LTI program, a place where I can further mature as a researcher. I’m especially looking forward to working with Professors Graham Neubig and Sean Welleck. With Professor Graham Neubig, given our shared interest on comprehensively evaluating language models in a more sophisticated way, I would like to work on understanding how language models fluently generate long-form text. Wtih Professor Sean Welleck, I would want to explore how we could induce creativity and logical abilities into language models, so that they could prove theorems and discover new knowledge that is even challenging for humans. Especially, I am keen to explore establishing a predictive model that could explain how different training methods and high-quality data affects how language models acquire certain skills. Compared to the Scaling Laws that only predict the validation loss, I believe such predictive model will give us a better understanding of how languauge models work and give us a clue towards how we could make language models handle tasks that are even challenging for humans. I strongly believe in the power of collaboration for achieving significant research outcomes, and I’m excited to contribute to the dynamic and collaborative environment at CMU. My long-term goal is to become a Principal Investigator in an academic research lab, focusing on topics with the potential for lasting impact. I’m excited about the independence that an academic career provides, and I look forward to leading research that positively shapes AI’s role in society. Additionally, inspired by my mentors and advisors, I’m committed to mentoring future researchers, hoping to pass on my passion and joy for research that I’ve come to cherish. REFERENCES [1] Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing Fine-grained Evaluation Capability in Language Models. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following. [2] Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following. [3] Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, and Minjoon Seo. Prometheus-Vision: Inducing Fine-grained Evaluation Capability in Vision-Language Models. To Be Submitted to ACL 2024. [4] Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, and Min-joon Seo. The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning. EMNLP 2023. [5] Seungone Kim, Se June Joo, Yul Jang, Hyungjoo Chae, and Jinyoung Yeo. CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification. EACL 2023. [6] Seungone Kim, Se June Joo, Hyungjoo Chae, Chaehyeong Kim, Seung-won Hwang, and Jinyoung Yeo. Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization. COLING 2022. [7] Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized-Soups: Personalized Large Language Models Alignment via Post-hoc Parameter Merging. To Be Submitted to ICML 2024. [8] Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung Kim, Lajanugen Logeswaran, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Exploring the Benefits of Training Expert Language Models over Instruction Tuning. ICML 2023.