Back to All Essays

Statement of Purpose Essay - University of Washington

Program:Phd, NLP ML
Type:PHD
License:CC_BY_NC_SA_4_0
Source: Public Success Story (Luiza Pozzobon)View Original

I aim to perform fundamental research both in the general machine learning and natural language processing fields. I am inclined to understand how learning takes place, and how to leverage such information to make robust and efficient models in curiosity-driven research. In contrast to the mainstream direction, research work that focuses on scaling down instead of up fascinates me, as it is not only technically demanding but also pushes the field to have an understanding of how to control for a given behavior. My research taste was built mainly due to experiences acquired during my master’s degree at UNICAMP and my time as a Research Scholar at Cohere For AI. My 8 months as a Research Scholar yielded two papers where I was the first author, both accepted to EMNLP 2023 [Pozzobon et al., 2023a,b]. I also collaborated on a colleague’s paper that was accepted to the ATTRIB Workshop on NeurIPS 2023. Building on those projects, my current interests are in the intersection of data-centric modeling, interpretability, and safety. Data-centric modeling - building robust, efficient, and generalizable models. What is the role of a datapoint in training dynamics? How can models trained with dissimilar subsets of data achieve extremely similar downstream performances? These were questions that captivated me throughout and after collaborating on a Data Pruning project [Marion et al., 2023] during my time as a Research Scholar at Cohere For AI. We found that, with only 30% of the pretraining dataset, we can achieve the same or even surpass the performance of a model trained with 100% of the data. Overall, results are strikingly similar for most subsets of data. This is true only if the number of training steps is the same. This might indicate that given a data domain, model performance might be more closely related to the number of gradient update steps than the number of unique instances in a dataset. This also aligns with prior work that showed how language datapoints can be seen up to 4 times before learning is impacted. It leads to other questions: can we a priori metrify the “bottleneck capacity” of a dataset? How to learn from which data to learn from? How to synthesize data reliably so that smaller models achieve similar performances as big ones? And, in the intersection of data and efficiency: how to learn generalizable features with few datapoints beating so-called scaling laws? I first became interested in generalization in the first year of my master’s degree. More specifically, I focused on the generalization of reinforcement learning algorithms in Contextual Markov Decision Processes (CMDPs), also referred to as the problem of zero-shot policy transfer. I did not have the opportunity to study these more deeply at the time, however, the questions of out-of-distribution generalization and uncertainty estimation still intrigue me deeply as these are challenging and ill-defined topics in many other fields of study besides RL. For Language Modeling, clear definitions, evaluations, and the taxonomy of types of generalization are just recently surfacing but it’s unclear to me what the concept of generalization is as datasets get ever bigger. I also believe uncertainty estimation is currently underexplored in NLP. We’ve recently seen some early work on self-improvement and self-correction and I believe this field will receive more attention in the following years. Coupled with effective synthetic data, this could be the way forward for us to have high-performant models with fewer annotated datapoints and impact to the world, as training LLMs is still extremely expensive. Interpretability & Safety. Still on language modeling, a question that interests me is in the property of in-context learning. It has been deemed as an “emergent ability” of large language models that enables zero or few-shot transfer. I wonder, though, if that is truly an emergent property or if is it possible for us to have a smaller model with the same properties as a larger one. This relates to the previously mentioned interests of efficiency but is a question of model understanding at its core. I believe this type of research, to investigate unexpected model behavior, to be one of the most exciting. With a few peers from my masters, I worked on understanding how generative methods for image synthesizing behaved when given datasets with varying domain distributions. Would they obey the frequency of each domain in the training data when generating images unconditionally? We found that Diffusion models and certain flavors of GANs do obey the training distribution, but some don’t. This indicates how certain algorithmic choices might lead to these undesired behaviors. Similarly, while in the Scholars Program I discovered unexpected behavior from black-box APIs, backed by ML models [Pozzobon et al., 2023a]. The unexpected behavior in this case was simply the non-reproducibility of research that these APIs cause when changes are not well-informed to users. This is a trivial but highly influential finding as ML algorithms are increasingly coupled to general applications and unexpected behavior leads to systems that eventually fail. I discovered this while working on toxicity mitigation with retrieval-augmented models [Pozzobon et al., 2023b] and both works were accepted to EMNLP 2023. Currently, I’m working on expanding this work to multilingual toxicity mitigation. Harms and bias in a multilingual setting are extremely underexplored and the few works that pursue this path do so only for high-resource languages and classification tasks. I aim for this work to be a starting point to measure harm before deploying multilingual models. A note on future directions and style. As an underprivileged student, both LGBTQ+ and Latina, I have first-hand experience with how valuable mentors can be in life and how extremely fortunate I was to encounter mine. As they have mine, I’d like to help unlock paths and ultimately change the life prospects of other students. I plan on actively doing this by collaborating with more junior students and through TA positions. Richard Hamming once wrote the most successful researchers he knew were the ones who left their office doors open. I envision these collaborations and mentorships to be one of the primary ways “my doors” would be open, not aiming specifically for success but to build an active, engaged, and caring research community around me. At UW I look forward to working with professors Luke Zettlemoyer, Ludwig Schmidt, and Noah Smith. With Prof. Luke, I aim to learn how to be an effective empiricist while pursuing the intersection of NLP and decision-making under uncertainty. I’m especially excited about his recent work on making sense of large language model behavior. With Prof. Ludwig, I aim to demystify the role of data in the various stages of learning and to leverage that information to build reliable and generalizable models. Finally, with Prof. Noah, I look forward to working on socially-aware research, while unveiling how algorithms truly work and impact society. More broadly, I look forward to being part of the collaborative and friendly community UW is known to have. References Max Marion, Ahmet Ustün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. "When less is more: Investigating data pruning for pretraining llms at scale." arXiv preprint arXiv:2309.04564, 2023. URL https://arxiv.org/abs/2309.04564. Luiza Pozzobon, Beyza Ermis, Patrick Lewis, and Sara Hooker. "On the challenges of using black-box apis for toxicity evaluation in research." arXiv preprint arXiv:2304.12397, 2023a. URL https://arxiv.org/abs/2304.12397. Luiza Pozzobon, Beyza Ermis, Patrick Lewis, and Sara Hooker. "Goodtriever: Adaptive toxicity mitigation with retrieval-augmented models," 2023b. URL https://arxiv.org/abs/2310.07589.