Back to All Essays

Statement of Purpose Essay - Columbia University

Program:Phd, NLP, Speech
Type:PHD
License:CC_BY_NC_SA_4_0
Source: Public Success Story (Siyan Li)View Original

Statement of Purpose I am passionate about uncovering the path toward a socially-aware dialogue agent capable of human-like communication. My work lies at the intersection of Natural Language Processing (NLP), Machine Learning (ML), and Human-Computer Interaction (HCI). Ever since watching Steven Spielberg’s Artificial Intelligence which features David, a robot boy who yearns to be human and be loved, the dream of building a similar companion has driven me to a career in Computer Science. David communicates, emotes, and comprehends as a picture-perfect embodied agent. My ideal dialogue agent should be capable of displaying creativity for interesting long-form conversations, reciprocating in a socially-appropriate manner, while handling real-life stochasticity with grace. Current virtual assistants remain far below this ideal, failing to maintain long-term conversations and behaving unnaturally in interactions. Over three years in research, I have made significant contributions to NLP through eight publications with 40+ citations at top-tier NLP conferences. In addition to research, I value the experience of mentoring students and teaching, and I aspire to be a faculty one day. By pursuing a Ph.D. at Columbia University, I can push the frontier in conversational AI, informed by my experience in both narrative generation and dialogue systems. My undergraduate research in automated narrative generation with Prof. Mark Riedl at Georgia Institute of Technology can help me make agents creative in long-form dialogues. Long-form dialogues are challenging because they require understanding and memorizing current dialogue states to avoid conflicting responses — capabilities that current agents lack. Over my 1.5 years at Georgia Tech, my research resulted in publications on reducing non-normative content in generations (INLG 2020 [1]), surveying narrative generations (NAACL 2021 [2]), and improving narrative logical coherence through COMET-based filtering (EMNLP 2022 [3]). Through conducting Mechanical Turk studies for these projects, I learned that human evaluation of natural language generations remains challenging, and that lower perplexity does not equal better creativity. Therefore, I believe that human evaluations should be more prominent in the research pipeline, especially for user-facing systems that I will build throughout my Ph.D. My recent first-author EMNLP 2022 [4] work with Prof. Christopher Potts further showcases limitations in language model generations by demonstrating a lack of evidence for GPT-3’s structural knowledge of noun compounds. My research at Georgia Tech and Stanford University has taught me the challenges in narrative generation, as in that language models often fail to generate logically coherent long stories, and has only inspired me to study these problems more. After all, it is nice to have your voice agent crack a joke organically after a long day or make up short stories to teach you brilliant concepts. Instilling long-term coherence into dialogue agents is but one step toward more natural dialogue agents. An equally important, parallel dimension revolves around imbuing agents with more human-like speech behaviors, a problem I tackled through my Master’s at Stanford University. Humans tend to feel more comfortable communicating with agents that exhibit similar behaviors; for instance, Siri does not stop talking when a user interjects, something that does not often happen between humans. Therefore, it is crucial to make agents display such natural speech behaviors to improve user experience and realize the full potential of these dialogue agents. In the fall of 2021, I began collaborating with Prof. Christopher Manning and Dr. Ashwin Paranjape to provide spoken dialogue agents with better turn-taking capabilities. Current-day voice agents wait for a period of silence before responding to the users, resulting in jagged and unnatural conversational flows and damaging user experience. To address this problem, we trained a model to constantly predict the answer to “how long until my next utterance?”. I researched and analyzed prior literature, wrote the code base, and proposed potential research directions. The findings of my research resulted in a first-author paper at SIGDIAL 2022 [5] and an article by Stanford HAI. Building a quick Flask demo for the system was one of my most cherished moments. I realized then that my truest fulfillment stems from implementing user-facing AI systems that handle real-world stochasticity with grace. Having scratched the surface of natural speech behaviors, I am shifting to integrating socially-appropriate conversational behaviors into spoken dialogue agents. Inspired by Meta’s work on modeling speech through acoustic “units”, Prof. Manning and I intend to turn dialogue agent generations into natural response audios. This project will build upon the unit-to-speech pipeline by Meta, reframing text-to-speech as text-to-unit-to-speech to generate naturalistic artifacts (e.g. hesitations and laughter) often captured by unit-based approaches. Next quarter, I will mentor an undergraduate to train a model to translate generated dialogue utterances into more disfluent and natural utterances (e.g. “I just had dinner” → “I, uh, I just had dinner”). Meanwhile, I joined the Amazon Alexa Prize Team this year, allowing me access to real, live users not typically accessible in academia. This opportunity will inform my future dialogue research of better conversational behaviors and more realistic modeling of speaker dynamics. During the summer of 2022, I utilized stochastic real-world data when studying face-to-face negotiation data with Prof. Jonathan Gratch. We explored associating backchannels (short utterances like “yeah”, “uh-huh” and head nods and shakes) with negotiation outcomes. Existing tools were imperfect: the facial action unit analyzer faltered when humans moved beyond certain positions or wore glasses, and our speech recognizer missed and mistook words. To address this drawback, I supplied our model with more reliable features, including head positions and audio. Despite all the obstacles, I trained a model to predict frame-by-frame backchannel opportunities, which was used as a diagnostic tool for negotiation conversations. In preliminary results, I found positive, statistically significant correlations between the model F1 score and the rapport felt by partners. We are still developing this work and plan to submit a publication to the journal Computers in Human Behavior in the coming months. Overall, this project prepared me for a human-centric research career by exposing me to widely different human behaviors with their stochasticity and requiring extensive statistical analyses. Future Agenda: My future research would tentatively be a continuation of my prior work, improving agent capabilities for coherent long-form conversations and the naturalness of agent utterances. I am interested in collaborating with Prof. Zhou Yu because of her work on dialogue generation and real-time interactively systems, and with Prof. Julia Hirschberg for her research on spoken dialogue systems. Given our overlap of research interests, I am confident that a Ph.D. at Columbia will enable me to contribute significantly to my research directions in the pursuit of bringing a dialogue agent such as David to life.