Back to All Essays

Statement of Purpose Essay - Carnegie Mellon University

Program:Master of Science in Computer Vision
Type:POST_GRAD
License:UNKNOWN
Source: Public success story (Vaishnavi Khindkar)View Original

I still remember the moment I discovered an alternate proof for a trigonometric theorem in school. It wasn’t part of the curriculum, nor something my teachers had shown me—I had pieced it together myself. The realization that I could construct something entirely new from first principles was exhilarating. This childhood memory planted in me a fascination with reasoning, patterns, and the elegance of solving problems from the ground up. Years later, I felt that same spark, as I read Dr. Fei-Fei Li’s article in Wired, "Mind Reading: Brain Scans Reveal Secrets of Human Vision" during my undergraduate studies. I thought, Could machines ever replicate the contextual reasoning and adaptability of humans? This question has shaped my research journey, inspiring me to explore the intersection of perception, reasoning, and scalability in computer vision. From designing intent-prediction frameworks for autonomous systems to exploring domain-adaptive learning, I have pursued a vision of building human-centered AI systems—systems that are interpretable, adaptable, and capable of solving real-world problems. At Carnegie Mellon University (CMU), I aim to take this vision further, leveraging CMU’s unparalleled faculty and research ecosystem to develop AI systems that reason, adapt, and collaborate seamlessly in unstructured environments. My journey into computer vision began at India’s leading computer vision lab, CVIT at IIIT Hyderabad. Under the guidance of Dr. C V Jawahar, Dr. Vineeth Balasubramanian, and Dr. Chetan Arora, I tackled a critical challenge in autonomous driving: understanding why pedestrians intend to cross (or not cross) a road. Existing datasets like Stanford-TRI or PIE offered binary labels (“cross” or “no-cross”) but failed to capture the nuanced reasoning that influences pedestrian decisions. To address this gap, I designed PIE++, a dataset annotating 5,000 pedestrian interactions with multi-label semantic reasons such as “Approaching vehicle speed is high” or “Pedestrian signals vehicle to stop with a hand gesture.” Developing PIE++ required collaboration with experts in traffic safety and computer vision, which helped me refine annotation schemas that were contextually grounded and scalable. User surveys confirmed the utility of the dataset, with 98% of participants expressing trust in the reasoning annotations. This reinforced my belief that creating high-quality, context-driven datasets is foundational for advancing human-centered AI. Building on PIE++, I realized that understanding intent requires integrating diverse modalities. This motivated the development of MindRead, a multi-modal, multi-task representation learning framework that advances human-intent prediction and reasoning by integrating visual, temporal, and semantic cues in dynamic settings. By leveraging transformers and graph neural networks, MindRead outperformed state-of-the-art methods, achieving a 5.6% improvement in accuracy and a 7% boost in F1-score. This work, which resulted in a first-author paper at IROS 2024, deepened my passion for creating systems that not only predict but also explain human intent in dynamic environments. Earlier at CVIT, I worked on adaptive object detection to enable AI systems to generalize across domains. I developed a self-attention-based feature alignment framework that reduced domain misalignment errors by 26.4% and improved mAP by 4.3% for synthetic to real adaptation task, earning a US patent and a first-author paper at WACV 2022. These projects solidified my ability to design scalable, domain-robust systems capable of learning from real-world variability. Beyond my core research, I have individually sought opportunities to apply AI to socially relevant real-world challenges. Before joining CVIT, I led the Marine Debris Detection project, where I created a dataset of over 5,000 underwater images addressing variations in lighting, turbidity, and debris types. Using CycleGAN for data augmentation and an attention-based architecture, I improved debris detection performance by 11% mAP. This work and dataset, which I open-sourced for community benefit, led to an Indian patent. After my foundational research at CVIT, I joined Product Labs to apply AI in socially impactful ways by optimizing AI systems to improve language accessibility for underserved communities. There, I contributed to the Indian Government’s Bhashini initiative, optimizing an OCR model for 22 Indian languages. By leveraging Triton optimization, I achieved a 74% reduction in inference latency, enabling efficient, real-time processing of diverse inputs. This experience highlighted the computational bottlenecks in scaling AI systems, reinforcing my belief that scalability is essential for building human-centered robotics capable of thriving in dynamic, resource-constrained environments. At CMU, I aim to advance my work on multi-modal learning, context-driven datasets, and adaptive AI systems. The university’s faculty and interdisciplinary focus provide the ideal environment to pursue this vision: (a) Advancing Context-Aware Autonomous Systems. Prof. Deva Ramanan’s work at the Argo AI Center addresses critical challenges in autonomous systems, particularly through efforts like the Argoverse 2.0 dataset, which tackles scalability and diversity. Building on my experience designing PIE++, I hope to collaborate with him to explore how reasoning-based annotations can enrich datasets like Argoverse, enabling autonomous vehicles to model nuanced human intent. Prof. Ramanan’s recent work on motion forecasting through social navigation inspires me to investigate how contextual affordances can further enhance prediction accuracy in dynamic environments. (b) Affordance Perception and Task Generalization. Prof. Martial Hebert’s research at the Robotics Institute aligns closely with my long-term vision of creating systems that understand affordances in unstructured environments. His exploration of task-driven perception and affordance reasoning is foundational for enabling robots to anticipate and adapt to task-specific constraints. Building on my work in adaptive object detection and MindRead, I aim to investigate how multi-modal inputs can extend affordance reasoning to collaborative tasks, such as enabling robots to generalize across dynamic and unfamiliar environments. (c) Real-Time Spatio-Temporal Understanding. Prof. Kris Kitani’s work on action anticipation and spatio-temporal reasoning aligns directly with my research on MindRead, which integrates temporal cues for intent prediction. His innovative exploration of sequential modeling for human behavior prediction in video sequences unveils intriguing possibilities for integrating uncertainty quantification into action forecasting models. I hope to collaborate with him to enhance the robustness of such systems, particularly for real-time applications. (d) Embodied AI for Complex Reasoning. Prof. Deepak Pathak’s work on Open X-Embodiment unifies diverse datasets for robotic learning, addressing the challenge of generalization in embodied AI. His focus on intrinsic motivation to drive robotic learning resonates deeply with my goal of creating scalable, adaptive systems. I am particularly interested in exploring how multi-modal reasoning frameworks can improve task generalization in open-world settings, enabling robots to reason about complex environments and autonomously perform diverse tasks. While my primary focus lies in advancing multi-modal learning, intent reasoning, and human-centered AI, I’ve also been exploring quantum-inspired approaches as a long-term innovation for embodied AI systems. During my time at Product Labs, I observed the computational limitations in classical architectures. This experience led me to explore how quantum principles like superposition and entanglement could enable robots to efficiently integrate diverse inputs into unified representations. For instance, superposition could encode multiple contextual possibilities simultaneously, enhancing adaptability in uncertain environments, while entanglement-inspired models might better capture interdependencies between sensory modalities. Though still conceptual, this exploration reflects my commitment to pioneering unconventional methods that could redefine the boundaries of multi-modal robotics. Looking back, I am reminded of a childhood memory that quietly fueled my fascination with intelligent systems. Growing up in a small town in India, I often accompanied my mother to help underprivileged students with their studies. I was struck by how some struggled to keep pace—not because of a lack of potential, but because they lacked tools that adapted to their needs. This experience shaped my enduring belief that technology should empower individuals, not alienate them. At Carnegie Mellon University’s Computer Vision Group and Robotics Institute, I aim to build scalable, interpretable systems that reason and collaborate with humans in real-world environments. With access to CMU’s exceptional faculty, interdisciplinary culture, and unparalleled resources, I am confident that my work will bridge cutting-edge innovation with societal impact. By advancing human-centered AI, I hope to empower technology to seamlessly adapt to human complexities and, in doing so, create systems that make a meaningful difference in people’s lives.