Statement of Purpose Essay - University of Washington
Research Interests My primary research interest is efficient machine learning, as it helps to democratize access to research artifacts. Billion-parameter models have been relatively successful in modeling massive data distributions scraped from the internet. However, several efficiency concerns are overlooked when scaling up these methods during training. Then, for any application, models need to be compressed before deployment, personalized toward end-users, and continuously updated to prevent distribution drift. Challenges within this pipeline and a growing trend away from open research prevent artifacts from being easily accessible. Thus, I want to work on one of the following efficiency themes during my Ph.D.: • model and sample efficiency • model compression for inference efficiency • efficient model updates for distribution drift and personalization OLMo I am fortunate to be a part of OLMo, an open science effort to train a large language model (LLM) at the Allen Institute for AI (AI2). Being actively involved in modeling discussions, running experiments, and writing the online zero-shot evaluation suite for the project has helped me learn about model architectures, training dynamics and stability, and the brittle nature of NLP evaluation, all of which are useful in asking the right research questions. From here, my efficiency goal is stable and commonly adopted low-precision training. Some initial works [19, 12, 16] have already explored this direction with 8-bit precision training, but many open questions remain. One such question is why specific training components like optimizer states or gradient synchronization require higher precision representation when stochasticity has been shown to benefit generalization [17]. To solve this problem, I want to analyze the signal-to-noise ratio in a low-bit gradient synchronization setting and develop methods to compensate if noise levels reach a point of training divergence [7]. In the long term, I want to move from low-bit floating point operations to low-precision training in the integer space. The goal here will be to design new optimizer update rules and gradient propagation methods. Once commonly adopted, stable low-precision training will allow researchers to work on ideas currently not feasible in a resource-limited setting, thus pushing the needle of model efficiency research. Model compression After pretraining, large models power applications via APIs and thus need to be deployed on inference servers with throughput and memory considerations. While the commonly available models today are autoregressive and trained on massive amounts of data, the model compression literature primarily focuses on BERT-style encoder models trained on a limited token budget. In Jha et al. [8], we propose one of the first comprehensive studies of model compression for autoregressive models trained on a large pretraining corpus on a compute vs. performance curve, finding new trends for existing compression methods in the setup of modern LLMs. Overall, the broad scope of our work opens up many new short and long-term research questions on compression efficiency, sample efficiency and inference efficiency. We analyze trade-offs between compression and inference efficiency across different strategies for model pruning. Comparisons against state-of-the-art [18] pruning strategies identify unstructured pruning as sample-efficient and structured pruning as inference-efficient. In addition, many recent results for compression in smaller data regimes, like combining distillation with pruning [10], do not extend to the large data regime and the evaluation setup of modern LLMs, which we define in our work. As a next step, I want to make structured pruning more sample-efficient via influence functions [6] or the Rho-loss [15]. In the long term, I want to understand from a theoretical perspective why learning from data is better than learning from a teacher via distillation when a vast amount of data is available. We continue to train pruned models for an extended token budget and compare their loss curves to that of a randomly initialized model of the same size. This comparison demonstrates an optima shift for the pruned models, where we initially observe them following a monotonically decreasing training dynamic, diverge for a bit, and then start following the loss curve of the randomly initialized model. I want to study this behavior from the perspective of the lottery ticket hypothesis [4], where we observe a non-winning ticket transition to a winning ticket with extended training. Furthermore, from the lens of compression efficiency, I want to explore mode connectivity [5] to design pruning strategies that lead to a faster optima shift. Finally, the last conclusion from our work is about the prunability of attention outputs that makes us believe we can further speed up flash-attention [3] for inference via sparse-kernels. Retrieval models Retrieval-augmented models have been shown to reduce hallucinations [1], improve performance in the long tail [11], and work with protected data [13]. However, the use of retrieval for efficient generation, continuously updating models, and personalization is under-explored. To achieve this, I want to work on models that can retrieve long phrases per timestep and retrieve from multiple data stores in parallel without explicitly training on their data distribution. I have recently started working on an efficient retrieval-only causal text generator, expanding on the ideas of retrieval-only infilling from NPM [14] and long-phrase retrieval-based generations from COG [9]. I plan to use a suffix automaton [2] to scale the contrastive loss for better quality long-phrase generations per timestep than previous methods. In the future, I want to pursue research on retrieving from multiple data stores in parallel. This direction ties up language models with ideas in classical information retrieval. This framework will allow for efficiency by parallelizing retrieval, end-user personalization by providing an abstraction to re-rank retrieved candidates from different data sources, and continuous model updates to prevent distribution drift by hot-swapping data stores. Why a PhD? Before AI2, I worked at PyTorch Lightning, a startup creating open-source libraries to simplify model training. At Lightning, I co-authored TorchMetrics, an API to implement distributed metrics for research reproducibility. Until then, my contributions to open science came from an engineering perspective, making me a confident engineer. But, the experience taught me that research and engineering go hand-in-hand in deep learning. The predoctoral stint at AI2 gave me an excellent platform to develop research ideas and build scientific rigor. Now, I am motivated to get a Ph.D. to develop my research skills further and continue contributing toward open and reproducible science from a research perspective. Open science is critical for democratic control of the community’s research direction by a diverse group of people to benefit everyone. However, more labs are moving towards proprietary research to benefit individual entities. Working toward efficient machine learning ensures that academia remains at the forefront of research, which is one way to ensure research democratization. A Ph.D. at UW is a step toward my goal to stay in academia and be one of the voices advocating to keep research open and democratic. During my Ph.D., I am interested in working with Noah Smith and Luke Zettlemoyer on model and sample efficiency. I am also interested in working with Hannaneh Hajishirzi on retrieval models and personalization. After a Ph.D. I plan to become a professor and lead a research group. As a PI, I want to bring diverse perspectives into the group, ask questions, and mentor people according to a style that fits them. I want to establish a culture of scientific rigor and slow science, make claims only with the proper evidence, and not put out minimum publishable units. References [1] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. ArXiv, abs/2310.11511, 2023. [2] Anselm Blumer, J. Blumer, David Haussler, Andrzej Ehrenfeucht, M. T. Chen, and Joel I. Seiferas. The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci., 40:31–55, 1985. [3] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R’e. Flashattention: Fast and memory-efficient exact attention with io-awareness. ArXiv, abs/2205.14135, 2022. [4] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv: Learning, 2018. [5] T. Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. ArXiv, abs/1802.10026, 2018. [6] Roger Baker Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamil.e Lukovsiut.e, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and Sam Bowman. Studying large language model generalization with influence functions. ArXiv, abs/2308.03296, 2023. [7] Kanan Gupta, Jonathan W. Siegel, and Stephan Wojtowytsch. Achieving acceleration despite very noisy gradients. ArXiv, abs/2302.05515, 2023. [8] Ananya Harsh Jha, Tom Sherborne, Evan Pete Walsh, Dirk Groeneveld, Emma Strubell, and Iz Beltagy. How to train your (compressed) large language model, 2023. [9] Tian Lan, Deng Cai, Yan Wang, Heyan Huang, and Xian-Ling Mao. Copy is all you need. ArXiv, abs/2307.06962, 2023. [10] Chen Liang, Haoming Jiang, Zheng Li, Xianfeng Tang, Bin Yin, and Tuo Zhao. Homodistil: Homotopic task-agnostic distillation of pre-trained transformers. ArXiv, abs/2302.09632, 2023. [11] Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Annual Meeting of the Association for Computational Linguistics, 2022. [12] Naveen Mellempudi, Sudarshan M. Srinivasan, Dipankar Das, and Bharat Kaul. Mixed precision training with 8-bit floating point. ArXiv, abs/1905.12334, 2019. [13] Sewon Min, Suchin Gururangan, Eric Wallace, Hannaneh Hajishirzi, Noah A. Smith, and Luke Zettlemoyer. Silo language models: Isolating legal risk in a nonparametric datastore. ArXiv, abs/2308.04430, 2023. [14] Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen tau Yih, Hannaneh Hajishirzi, and Luke Zettlemoyer. Nonparametric masked language modeling. In Annual Meeting of the Association for Computational Linguistics, 2022. [15] Sören Mindermann, Muhammed Razzak, Winnie Xu, Andreas Kirsch, Mrinank Sharma, Adrien Morisot, Aidan N. Gomez, Sebastian Farquhar, Janina Brauner, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. ArXiv, abs/2206.07137, 2022. [16] Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, and Peng Cheng. Fp8-lm: Training fp8 large language models. ArXiv, abs/2310.18313, 2023. [17] Samuel L. Smith, Erich Elsen, and Soham De. On the generalization benefit of noise in stochastic gradient descent. In International Conference on Machine Learning, 2020. [18] Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models. ArXiv, abs/2306.11695, 2023. [19] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and K. Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. ArXiv, abs/1812.08011, 2018.