Back to All Essays

Statement of Purpose Essay - University of Maryland College Park

Program:Phd, CV
Type:PHD
License:CC_BY_NC_SA_4_0
Source: Public Success Story (Prateksha Udhayanan)View Original

In the rapidly evolving landscape of computer vision, research in the area of visual content synthesis, specifically image and video generation excites me the most. While Generative Adversarial Networks (GANs) opened the door to creating new content, Diffusion Models have recently demonstrated the ability to generate a wider variety and higher quality of images. As we advance into an era of AI-driven content creation, the need for controllable and personalizable visual content synthesis is becoming increasingly essential. My research interests are in building visual content synthesis models that are both controllable and personalizable, with a focus on learning robust features that encapsulate multimodal constraints. My introduction to the field of visual content synthesis began with a summer internship at Adobe Research. During this internship, I worked on synthesizing illustrative videos from procedural documents under the mentorship of Dr. Balaji Vasan Srinivasan and Dr. Stefano Petrangeli. We developed a method to understand the textual content of the procedure, retrieve relevant visuals (images and videos), and stitch them together to synthesize a cohesive video. We proposed re-ranking approaches and a Viterbi-inspired optimization framework to optimally select the combination of visuals within a frame and across the frames of the video. This work was published in WACV 2023 [1] and also filed as a patent in the USPTO. My interactions with senior researchers during the internship significantly shaped my understanding and perspectives on research. I was particularly struck by their ability to comprehend new projects, even those outside their area of expertise, and offer valuable insights and suggestions. These interactions broadened my outlook on research and inspired me to pursue a Ph.D. To gain more research experience and explore different areas of computer vision, I joined Adobe Research as a full-time Research Associate post my graduation. Building on my internship experience in image and video retrievals and re-ranking methodologies, I worked towards developing deep-learning retrieval models capable of learning constraints during training, eliminating the need for post-retrieval re-ranking. This led me to the task of composed image retrieval - performing image retrieval conditioned on multimodal inputs (text and image). Existing techniques that addressed this problem focused on using global features for retrieval, resulting in incorrect localization of regions of interest. Since the texts usually correspond to local changes in an image, we proposed a multi-modal gradient attention computation [2] mechanism to help the model learn local features and enable better localization and retrieval. Our methodology has been filed as a patent in the USPTO. We observed a similar issue in vision-language models that use trainable prompts to improve generalizability - lack of flow of contextual local information from input images to the prompt vectors. To address this, we proposed Contextual Prompt Learning (CoPL) [3], which learns prompt weights dynamically and aligns the resulting prompt vectors with local image features. This work has been accepted at AAAI 2024 and is filed as a patent in the USPTO. This experience deepened my understanding of text-image alignment, enabling me to explore this concept in various vision tasks, beyond retrieval. While retrieval is effective in creating content aligned with user intent, its effectiveness relies on the richness of the underlying corpus and the learned representations. A good creation should iterate between generation and retrieval to plug this shortcoming, which led to my subsequent explorations on image editing with multi-granular control. Building on the knowledge from my work on retrievals and stable diffusion models, we worked towards a training-free approach that involves manipulating and iterating in the latent space of pre-trained diffusion models to reduce noise accumulations and control the spatial extent of the edits. This work has been accepted in WACV 2024 [4]. Expanding on this, I recognized the potential to apply constrained and multi-granular image editing concepts to generate and edit graphic designs. Generating graphic designs is a complex process that requires simultaneous optimization across three dimensions—content, layout, and color. I am currently working on instruction-based graphic design editing, where we leverage large language models to understand the instruction and diffusion models to generate new design elements. We use FlexDM, a transformer-based model, to predict optimal positions and colors for the new elements. To improve the layout of the updated designs, we train the model with additional loss functions to enforce layout constraints. My experiences and explorations in image generation sparked my interest in another realm of visual content synthesis – video generation. When I joined Adobe in 2022, a newly released feature called Moving Elements caught my eye. This feature enabled the creation of cinemagraphs from static images for fluid elements. Building on my learnings from image generation works and driven by my innate interest in this project, I am currently collaborating with Dr. Kuldeep Kulkarni to explore and animate other types of motions beyond fluids. After familiarizing myself with the current model, I found that most works focus on animating fluid elements by assuming a constant flow model. However, this approach falls short for motions involving larger displacements, like pendulums and swings. To tackle this, I am working on methodologies to formulate the motion model for periodic motions with large displacements. This project has solidified my continued interest in tackling exciting challenges in the area of visual content synthesis. Over the last year and a half, my collaboration with researchers at Adobe Research has been both enriching and inspiring. I have greatly admired their ability to grasp new concepts, adapt to emerging technologies, and innovate across various domains. While I have gained valuable exposure and developed innovative thinking, I believe a Ph.D. program is the next step for me to grow as an independent researcher. A focused research environment and formal training will equip me with the skills needed to confidently lead my research initiatives and tackle challenges in the continuously evolving field of computer vision. My research experiences have exposed me to the advances in image and video synthesis models. What motivates me to pursue a Ph.D. in this area are the challenges that exist in building robust generative models that are controllable and personalizable. This includes learning good representations that encapsulate multimodal constraints and building efficient models with low inference time, among many other aspects. My experience conducting research in both academic and industrial settings, complemented with diverse insights derived from a range of projects, encompassing publications, patents and product integrations has deepened my passion for research and has also instilled in me the confidence to pursue graduate studies. I am confident that I would be a great fit for the vibrant research community at the University of Maryland, College Park. Throughout my Ph.D., I hope to acquire deep theoretical knowledge and sharpen my skills in conducting research. I have a keen interest in working with Prof. XX, Prof. YY, and Prof. ZZ. Prof. XX’s work on AA is intriguing. It would be interesting to investigate if we can harness the discriminative capabilities of diffusion models to generate images that are better grounded to complex text prompts. Another work of his that I enjoyed reading was on BB, which offers several ideas that can be explored to study the disentanglement capabilities in diffusion models. Prof. YY’s works CC are extremely aligned with my prior experiences and my future research goals. I am particularly interested in exploring controllable image generation, with a focus on understanding the latent space and studying the disentanglement capabilities of diffusion models. I am also interested in Prof. ZZ’s work on DD. It would be interesting to see if we can leverage multi-view video datasets to better guide the model during training. I am certain that my solid conceptual background, complemented by innovative ideas will help me become a successful doctoral student. Pursuing a Ph.D. at the University of Maryland, College Park will provide me with the experience and expertise to achieve my long-term career goals of leading and managing my research endeavors, and contributing to the academic research community. I look forward to the opportunities, challenges, and discoveries that await me as a graduate student. References [1] Prateksha Udhayanan, Suryateja Bv, Parth Laturia, Dev Chauhan, Darshan Khandelwal, Stefano Petrangeli, and Balaji Vasan Srinivasan. Recipe2video: Synthesizing personalized videos from recipe texts. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Link, pages 2268–2277, 2023. [2] Prateksha Udhayanan, Srikrishna Karanam, and Balaji Vasan Srinivasan. Learning with multi-modal gradient attention for explainable composed image retrieval. Link. [3] Koustava Goswami, Srikrishna Karanam, Prateksha Udhayanan, Balaji Vasan Srinivasan, et al. Contextual prompt learning for vision-language understanding. Under Review. Link. [4] KJ Joseph, Prateksha Udhayanan, Tripti Shukla, Aishwarya Agarwal, Srikrishna Karanam, Koustava Goswami, and Balaji Vasan Srinivasan. Iterative multi-granular image editing using diffusion models. Link, 2023.