Back to All Essays

Statement of Purpose Essay - University of Texas at Austin

Program:Phd, NLP, ML
Type:PHD
License:CC_BY_NC_SA_4_0
Source: Public Success Story (Zeyu Leo Liu)View Original

Pretraining language models on large amounts of unsupervised text has driven substantial progress on natural language processing (NLP) benchmarks. While these new classes of models have closed the gap between human and machine performance on many tasks, they also raise important cautionary questions about how we accurately measure – evaluate & explain – progress in the field. Models have been shown to be brittle (Jia and Liang, 2017), high-quality datasets are expensive to construct (Marcus et al., 1993), and the behavior of pretrained language models is hard to explain. Furthermore, as NLP technologies become more useful to the public at large, practical questions must be addressed. Modern NLP is expensive (Strubell et al., 2019), and the current modeling choices (architecture, training procedure, etc.) are sub-optimal. At a high level, I believe evaluation & explainability techniques walk hand in hand with modeling techniques, where evolution in one drives evolution in another. Prospectively, I believe taking a causal perspective could be a unifying force that drives progress on both fronts. During my Ph.D., I am excited to explore new NLP techniques and problems related to evaluation & explainability, and modeling. I have had the fortune to do initial work in these areas, and I hope to work on projects inspired by taking a causal angle during my Ph.D. Better modeling helps us improve evaluation & explainability. Recent large-scale pretrained language models have drastically changed the landscape of NLP (Devlin et al., 2019; Brown et al., 2020). As practitioners deploy NLP models into more practical scenarios, new desiderata for evaluation & explainability techniques appear. One crucial problem is understanding the robustness of these models under a wide range of conditions. However, the cost of creating datasets to evaluate models is a critical obstacle. Gardner et al. (2020) posit that the model’s decision boundary could be well understood by applying the model to contrastive examples. Inspired by this, I proposed to generate contrastive samples automatically by linguistically perturbing the semantic representation from an existing linguistic parser (DELPH-IN, 2011). We found that models fine-tuned on textual entailment task are fragile even with respect to a simple linguistic variance, such as changes in English tense (Li* et al., 2020, as first author and appeared in BlackboxNLP @ EMNLP 2020). This experience with the downstream models shows the importance of understanding the pretraining procedure, a less pronounced problem with smaller-scale and usually weaker models. Since the pretrained language models learn from a large corpus over a long series of parameter updates, learning how such models behave during this process would give future practitioners better ideas to “debug” the fragileness. Shortly before I began working on this problem, various lightweight methods (so-called “probes”) were proposed to test different kinds of knowledge. I applied those techniques in Liu* et al. (2021)(as first author and appeared in EMNLP Findings 2021) to probe the model across training time. We were the first to systematically reveal that RoBERTa follows some curriculum during pretraining; it learns different knowledge in sequence (linguistic > factual > commonsense) and has very limited reasoning ability. Even with these advances, the current techniques for evaluation & explainability don’t stand on a unified theoretical ground. Specifically, during my Ph.D., I want to strengthen evaluation & explainability with a causal graph. Taking explainability as an example, previously Clark et al. (2019) found that certain attention heads correspond to certain features in input (e.g., nouns). Although interesting, such a finding is too fine-grained to provide high-level characterizations of model-internal representations (e.g., neurons) and their roles in input/output behavior (Geiger et al., 2021). Imposing a causal graph upon groups of neurons could help inspire new ways of modeling, e.g., model distillation (Wu et al., 2022). This leads to a natural follow-up of my work (Liu* et al., 2021). I hope to extend techniques in Geiger et al. (2021) to construct a unified causal graph that explains how different types of knowledge (corresponding to different groups of nodes in the causal graph) interact in the end model and how this structure is arrived at during pretraining. Better ways to evaluate and explain models inspire a better modeling technique. Analysis results and interpretation offer exciting insights into language models. They also motivate researchers to propose more sample-efficient, robust, and transparent language models, which will drive the evolution of evaluation & explainability. I did work in this direction during my Meta AI Residency. The feed-forward module in the transformer can be explained as a key-value memory (Geva et al., 2021). Based on this, we investigated a unified view of efforts in scaling up a feed-forward module’s parameter size. As a result, with the same amount of data, we found a clever parameter tying that improves performance (thus, more sample-efficient) given little additional compute budget and which also enables better interpretability (Liu et al., 2023, as first author and submitted to ICLR 2023, score = 8;6;6;5). In the future, I am interested in developing an efficient transformer design not only to help practitioners have more manageable models but also to help understand their core underlying mechanisms by replacing inefficient parts in the current design. As a first step, a common problem with the newly proposed efficient method is that their promising results need to be extrapolated to larger-scale training. Therefore, I hope to propose a representation metric or benchmark that could be more indicative of the quality of representation from efficient models. Besides architectural inductive bias, I am also interested in other routes in improving models performance. I helped develop methods that insert emergent communication training between pretraining and finetuning (Downey et al., 2022; Steinert-Threlkeld et al., 2022, as third author and awarded Runner-up Best Paper in EmeCom @ ICLR 2022). Such goal-oriented training improves model performance on unsupervised machine translation, especially in low-resource settings. In the future, I plan to extend my interest in causality to modeling techniques — causal representation learning for NLP. During my Ph.D., I want to improve representation learning algorithms to encode more causal and disentangled information (e.g., some dimensions of contextual representations control the topic of words, while others govern another high-level human-interpretable semantics). There are preliminary techniques on domain-invariant representation learning, and researchers tried to specify their desiderata on representation learning (Wang and Jordan, 2021), but adapting them to NLP needs more complication and re-consideration on the choice of task, evaluation, and training procedure. Future Plans I am passionate about exploring fundamental and scientific questions. Meanwhile, I wish to build accessible and well-structured tools to bring more NLP practitioners to work on those problems. Joining a Ph.D. program gives me a structured way to hone my research skill as a researcher and offer me a well-rounded research community to facilitate my growth. Meta AI industry experience has also informed me that being a professor is the career path for me. An academic career gives me unique opportunities to help future generations to learn and do research. Having worked as a TA for many terms, I would love to explore options for advising younger students throughout my graduate program. References Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does bert look at? an analysis of bert’s attention. In BlackBoxNLP@ACL. DELPH-IN. 2011. Ace: the answer constraint engine. http://sweaglesw.org/linguistics/ace/. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. C. M. Downey, Leo Z. Liu, Xuhui Zhou, and Shane Steinert-Threlkeld. 2022. Learning to translate by learning to communicate. CoRR, abs/2207.07025. Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating models’ local decision boundaries via contrast sets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020, pages 1307–1323. Association for Computational Linguistics. Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. 2021. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 9574–9586. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 5484–5495. Association for Computational Linguistics. Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics. Chuanrong Li*, Lin Shengshuo*, Leo Z. Liu*, Xinyi Wu*, Xuhui Zhou*, and Shane Steinert-Threlkeld. 2020. Linguistically-informed transformations (LIT): A method for automatically generating contrast sets. In Proc. of EMNLP BlackboxNLP Workshop (Poster). (*: author alphabetically sorted) http://arxiv.org/abs/2010.08580. Leo Z. Liu, Tim Dettmers, Xi Victoria Lin, Veselin Stoyanov, and Xian Li. 2023. Towards a unified view of sparse feed-forward network in transformer. In Submission to ICLR. review score 8;6;6;5. Leo Z. Liu*, Yizhong Wang*, Jungo Kasai, Hannaneh Hajishirzi, and Noah A. Smith. 2021. Probing across time: What does roberta know and when? In Proc. of EMNLP Finding and Proc. of EMNLP BlackboxNLP Workshop (Poster). (*: equal contribution) https://aclanthology.org/2021.findings-emnlp.71.pdf. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. Shane Steinert-Threlkeld, Xuhui Zhou, Leo Z. Liu, and C. M. Downey. 2022. Emergent communication fine-tuning (ec-ft) for pretrained language models. In Proc. of ICLR EmeCom Workshop (Runner-up Best Paper). https://openreview.net/pdf?id=SUqrM7WR7W5. Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy. Association for Computational Linguistics. Yixin Wang and Michael I. Jordan. 2021. Desiderata for representation learning: A causal perspective. CoRR, abs/2109.03795. Zhengxuan Wu, Atticus Geiger, Joshua Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, and Noah D. Goodman. 2022. Causal distillation for language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 4288–4295. Association for Computational Linguistics.