Statement of Purpose Essay - UC Berkeley
Statement of Purpose Contemporary AI systems have made remarkable progress in learning to recognize patterns and extracting computational representations from data across various modalities. However, the ability to perform complex reasoning and planning is fairly primitive: Limited to basic tabletop settings and a few simulated tasks. Pivotal to this shortcoming is the inability of machine learning algorithms to perform inductive leaps from data and learn causal relationships between concepts, something humans do naturally. Motivated by these challenges, my research endeavor is to explore the computational underpinnings of realizing such capabilities in AI systems. During my undergraduate research, in collaboration with my colleagues, I have worked on the problem to infer a manipulation program: a sequence of actions grounded in the scene, from natural language instruction. Human cognition adopts a hierarchical approach to this task. First, it uses the acquired understanding of visuo-spatial concepts to ground objects referred to by the instruction in the environment. Then, it produces tentative new locations for the objects based on the acquired understanding of the action semantics. Inspired by this, we hypothesize scene understanding and manipulation as the execution of a symbolic, hierarchical program, extracted from the language instruction. We build on the work of Mao et al [1]. My main contribution was towards formulating a generalized, hierarchical symbolic-program space for action manipulation, coding up the hierarchical parser for natural language instruction, and data generation using PyBullet. Our model demonstrates end-to-end learning and strong generalization to novel scene compositions and longer instructions. Related full paper [2] is under review at International Conference of Robotics and Automation, 2023 and appeared at the Workshop for Neuro Causal and Symbolic AI at NeurIPS 2022. Planning for embodied agents often becomes complicated due to the multiplicity in satisfying assignments to spatial and kinematic constraints. This motivates my current research work, directed toward augmenting the above architecture with an action representation that captures the above-mentioned property. My approach is to learn action effect as a probability distribution over the task space conditioned upon the scene constraints. I have conducted initial experiments by using a conditional GAN to parametrize the action space, which has given me promising results. Garrett et al [3] propose streams as a black box interface to incorporate this property. I found that my action representation is amenable towards instantiation as a stream, and planning heuristics such as those proposed in [3], can be employed on top of it. Such a system can be used for deeper reasoning for more complex goals. I am excited to work on this approach in the coming months. Motion plans in real-world settings often fail due to extraneous disturbances. Along with two of my colleagues, I am looking at approaches to augment the above model with robustness to such failure modes of action execution. I proposed to learn action effect as the transformation of an object-centric scene graph representation of the environment parametrized by a GNN, as in [4]. Thereafter, errors in execution can be identified by learning to predict differences in environment state from the scene graph representation. Recovery from the failure state can be achieved by computing an adjusted action sequence to reach the expected final state from the current failure state using forward planning. Initial experiments have shown promising results in this direction. In addition to my work in embodied AI, I have also explored how imposing symbolic structures on top of dense representations can help build more generalizable models in other areas of machine learning. In collaboration with Prof. Manik Verma, we are exploring graph-based approaches to learning better representations of the label space for the problem of multi-label extreme classification. Conventionally, approximate nearest neighbor search graphs, used as fast-inference engines for extreme classification, are constructed on top of pre-trained label embeddings. We are investigating approaches to build nearest-neighbor search graphs conditioned on the query-label distribution, that can provide better hard negatives for training the language encoder. I would be excited to work with Prof. Anca Dragan, Prof. Pieter Abbeel and Prof. Sergey Levine in my PhD. My research so far has focused on learning disentangled representations of the state and action space and composing them in innovative ways. However, much more remains in building AI systems that demonstrate long-horizon reasoning capabilities in real-world settings. (i) What are the cognitive biases that empower intelligent human behavior and learning? Can these be translated to inductive biases for learning algorithms? (ii) How to build embodied-AI agents that are able to learn novel concepts from sparse data points gathered by real-world exploration? (iii) Can we have large pre-trained(foundation) models that can reason and plan to long horizons for novel tasks? These are a few problems that motivate my long-term research goals. Aligned with these problems, I would love to explore the areas of causality, reinforcement learning, and out-of-distribution generalization, apart from task and motion planning and visual reasoning in my future research. In a developing country like India, the research ecosystem in AI is still in its infancy. My advisors are a source of inspiration for me as they guided me all through my work, shared invaluable insight about research in general, and ensured that resource constraints do not hinder progress as much as possible. I believe that being a part of academia gives an individual the opportunity to foster progress further away in the future through interactions with future researchers. In the future, I would love to be a part of this community. In order to undertake these professional and personal endeavors, I seek a Ph.D. from UCB which will give me a solid technical background and the chance to learn from pioneers in their fields, all of which will be immensely valuable to my life-long career. References [1] J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu, “The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=rJgMlhRctm. [2] N. Kalithasan, H. G. Singh, V. Bindal, et al., “Learning neuro-symbolic programs for language-guided robotic manipulation,” in NeurIPS’ 22 Workshop on Neuro Causal and Symbolic AI (nCSI). [Online]. Available: http://arxiv.org/abs/2211.06652. [3] C. R. Garrett, T. Lozano-Perez, and L. P. Kaelbling, “Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning,” in International Conference on Automated Planning and Scheduling, 2020. [4] Y. Zhu, J. Tremblay, S. Birchfield, and Y. Zhu, “Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 6541–6548. doi: 10.1109/ICRA48506.2021.9561548.