Statement of Purpose Essay - Carnegie Mellon University
Paul Pu Liang (User ID: pliang@cs.cmu.edu) – Statement of Purpose Intellectual Merit: I graduated with a Bachelor’s degree with University Honors from the School of Computer Science, Carnegie Mellon University (CMU). The expected duration of studies was 4 years and I finished in 3. As an undergraduate, I achieved the highest letter grade for PhD courses in Machine Learning (ML), Convex Optimization, Deep Reinforcement Learning and Advanced Multimodal ML. I am now a Master’s student in Machine Learning at CMU with an interest in multimodal deep learning. I find this field fascinating since it immerses me in the domains of Natural Language Processing (NLP), Computer Vision (CV) and Speech Processing simultaneously. Human communication is multimodal in nature since humans express their intents through spoken language, facial gestures, body language and variations in tone of voice. Furthermore, while Artificial Intelligence (AI) has achieved success in many domains, comprehending human communication remains a challenge. For these reasons, I joined the MultiComp Lab under the guidance of Prof. Louis-Phillipe Morency. I am actively involved in research that brings computers to the next level of human communication understanding. I was a key member in building a novel model for multimodal fusion, alignment and representation, called the Multi-Attention Recurrent Network. The model discovers interactions between modalities (e.g. exploiting the interactions between spoken language and expressed gestures) through time to learn complex inter-modality and intra-modality dynamics. We proposed the Long-short Term Hybrid Memory which allows recurrent units to carry joint multimodal representations. I performed all the experiments and it was fascinating that our model beat the state of the art significantly on 6 benchmarked datasets, effectively giving AI new tools to understand human communication. During this project, I discovered a love for academic writing and visualizations. Our visualizations showed how our model focused on certain parts of gestures and language to internally “understand” human face-to-face communication. This was done using multiple sets of attention units to discover multiple co-occurring temporal dynamics across modalities. The improvement over the state of the art on multimodal datasets for sentiment analysis, emotion recognition and personality traits prediction was impressive. Our paper received outstanding reviews and will be published at a top AI conference, AAAI 2018. I was honored to see the quality of my work being recognized by the research community. I also investigated models for Multi-View Sequential learning. We proposed the Memory Fusion Network, a novel multimodal neural memory that tackles misaligned temporal information across multiple views. This can be applied in human communication modeling since humans communicate asynchronously: a smile can be delayed after positive intentions have been delivered through spoken text, voice or body language. By fusing view-specific and cross-view interactions, the model significantly outperforms current baselines on multi-view datasets. I was involved in the implementation, experiments, paper write-up and visualizations. This work will also be published at AAAI 2018. In spring 2017, my collaborators and I developed a multimodal neural model with word-level alignment and a Gated Multimodal Embedding layer trained using reinforcement learning. The novelty was two-fold: (1) alignment: Word-level alignment is an improvement on previous models that aggregate information across time and lose crucial temporal information, and (2) fusion: The Gated Multimodal Embedding layer selectively filters out noisy modalities. This model advanced the state-of-the-art results and fueled a promising research direction in using word level alignment for multimodal datasets. The work was enthusiastically received by the community and was accepted as an oral presentation at ACM ICMI 2017, winning the Honorable Mention award. I have contributed to the scientific community with datasets and software. I am a co-creator of the SQA (Social Question Answering) dataset. As AI is increasingly used in human interactions, it is imperative that AI comprehends the subtleties of social behaviors. This novel QA dataset will benchmark AI’s comprehension of human social interactions, behaviors and nuances. It defines a novel research direction in building socially intelligent machines that must infer high-level multimodal semantics underlying human actions and motivations to understand complex social interactions. I am also a co-creator of the MOSEI (Multimodal Opinion-Level Sentiment and Emotion Intensity) dataset, the largest in multimodal sentiment analysis and emotion recognition. For both datasets, I contributed towards data analysis on YouTube videos before creating training videos for annotation, feature extraction, model implementation, experiments and writing. This was eye-opening and I experienced the difficulties of balancing datasets, ensuring bias-free experimental procedures and incentivizing to obtain accurate crowdsourced annotations. I am currently managing more than 40 students in annotating, nothing of this scale has ever been done in the department. Both datasets will be released as submissions to CVPR 2018 and ECCV 2018. On the software side, I am a co-creator of the CMU Multimodal Data SDK, a valuable resource to encourage multimodal research. Since dealing with complex multimodal data can be intimidating, this SDK provides accurate loading of benchmarked multimodal datasets in sentiment analysis, emotion recognition and personality traits prediction. With the success of word-level alignment, the SDK also supports temporal alignment of modalities at additional resolutions: frames, phonemes, words and videos. Professional Activities: I am fortunate to co-advise researchers in multimodal ML, specifically 3 master’s students in building novel multimodal models and spatially-invariant face embeddings. We meet twice a week to review implementation details, results and research directions. For the CMU Multimodal Data SDK, I am guiding 2 students in implementation, debugging and user testing on datasets. I take pride in guiding my students and hope to inspire the next generation of budding researchers to also work in multimodal machine learning. Furthermore, I will be a co-organizer of a workshop titled “Advancing Artificial Intelligence Understanding of Human Multimodal Language” at ACL 2018. This workshop aims to expose the NLP, CV and speech processing research communities to the fusion of their respective subfields for joint multimodal modeling of human communication. MOSEI and other multimodal datasets will be the centerpiece of this workshop challenge. I am truly honored to be a part of this promising and rewarding research direction. Finally, I was privileged to be a Teaching Assistant (TA) for 10-601, a graduate Introduction to Machine Learning course and for 15-213, a core Introduction to Computer Systems course for graduate and undergraduate students. I proposed candidate exam questions which further cemented my knowledge in core concepts, and utilized my strength for explaining concepts clearly to assist students during office hours. These are 2 of the most popular computer science classes in CMU and is testament to my passion for teaching. Broader Impacts: As AI proliferates into our daily lives, it becomes paramount for machines to perceive human individual behaviors and expressions as well as human social interactions and nuances. Multimodal ML systems have been applied in healthcare for detection of depression, schizophrenia and PTSD; education for intelligent tutoring systems; marketing and advertising. In all these applications, furthering AI’s modeling of human interactions would benefit researchers and the end users. For researchers, intuitive visualizations and analysis of these models would improve the understanding of human social communication in the context of medicine, psychology, human computer interaction and sociology. More accurate judgements and perceptions by autonomous agents would also improve the standard of living of the end users and lead to true human computer interactions. As a sophomore, intrigued by the potential of ML towards forecasting epidemics to save lives, I joined the Delphi research group under Profs. Roni Rosenfeld and Ryan Tibshirani. I was part of developing a nowcasting framework by integrating multiple data sources to forecast onsets of influenza in US states. I worked with a Google Health Trends API for data collection and modeling using search terms that may be indicative of flu levels. My work will contribute towards the construction of an accurate real-time system for nowcasting at the level of municipalities. Multiple viral epidemics such as dengue fever, the opioid epidemic and chikungunya are currently forecasted using novel data sources. During high-risk seasons, an early warning system would benefit hospitals and pharmacies so that they can stock up on manpower and supplies, as well as individuals who can take appropriate flu-preventing measures. Research Goals: A huge step towards modeling human communication would be to learn joint multimodal representations for speech, gestures and language. My graduate research goal is to learn rich unimodal embeddings and perform multimodal fusion to learn joint embeddings that generalize across speakers and topics, in the following 4 steps: (1) Language: I aim to extend the boundaries of deep learning to fuse distributional, propositional and multimodal semantics. This involves learning word representations that marry distributional and propositional semantics from various data sources. I also aim to enrich language with multimodal semantics by leveraging sources of information from multimodal data. (2) Vision: I aim to extend the state-of-the-art face embeddings from a 3D and multimodal affective computing lens. I plan to develop spatially-invariant face embeddings and 3D learning of face structures using 3D kernels from 2D input images. I also aim to investigate how human displays of affect change their facial gestures, culminating in a joint generative model of faces from person’s identity, affect display and personality. This is challenging since facial gestures are extremely nuanced. However, they are crucial towards a multimodal embedding since they play an important role in human communication, particularly when humans prevaricate. (3) Acoustic: I plan to experiment with attention based sequence-to-sequence models for audio reconstruction using raw audio as well as acoustic features. (4) Multimodal Fusion: I will propose novel methods for multimodal spatial and temporal fusion. Multimodal temporal alignment is another challenge to consider since humans communicate asynchronously and the joint multimodal embedding must capture the asynchrony across individual sequences. Future Plans: With my interests in research and teaching, a PhD is my immediate goal. CMU’s inter-disciplinary departments and dynamic academic atmosphere make it my top choice to continue my research with Prof. Morency on multimodal machine learning. I would also like to work with Prof. Ruslan Salakhutdinov on multimodal deep learning and deep reinforcement learning to learn gated/conditional multimodal fusion methods. Additionally, I am interested in Prof. Alex Hauptmann’s research on multimedia analysis and Prof. William Cohen’s research on multimodal Bayesian embeddings. With my thirst for academic knowledge and passion for innovation, I strive to become a professor and leading researcher in the theory and applications of multimodal machine learning. I am confident of contributing to the ML community through datasets, software, conference workshops, teaching and advising students. Beyond ML itself, I aspire to make tangible impacts on human communities through synthesizing multimodal signals for early identification of mental health disorders as well as accurate epidemiological forecasting to improve public health.