Statement of Purpose Essay - New York U niversity
Statement of Purpose I am interested in developing practical methods that make machine learning (ML) systems robust, especially to naturally occurring distribution changes in the environment such that they can be deployed reliably in the real world at large-scale. I have broadly contributed towards this goal during my undergraduate at IIT Delhi and a productive tenure later at Microsoft Research (MSR) as a Research Fellow. Since then, I have published 3 papers in conferences like CVPR, ECML, MLSys and submitted one to ICLR, gaining invaluable skills in conducting research. Below, I describe how few of my past research experiences involving empirical approaches to learn robust classifiers and large-scale ML influenced my interests over time and what were my learnings. Robust training with missing labels. In medical and web domains, often times, the observed training data distribution might not follow the distribution we want to learn about. This is manifested in the form of mislabeled data points, or more specifically systematic missing labels due to inherent biases in the policy or user clicks from which we curate the dataset. For instance, a policy/user tends to predict/click popular labels only. In tasks such as eXtreme multi-label Classification (XC), where only a few labels from an enormous label space (reaching orders of millions) are relevant per data point, obtaining all the relevant labels for a data point is infeasible. Thus, datasets here suffer the same fate [1]. Below, I describe two approaches to learn robust extreme classifiers that I worked on at MSR. 1. Using inductive biases. These datasets usually have a long-tailed distribution, where majority of labels have limited training points, which is further exacerbated due to missing labels. This is a major problem in applications such as recommendation, where these tail or rare labels provide niche and informative results for users. Extreme classifiers are prone to overfit to these rare labels, leading to poor generalization on majority label space. The objective was to address the performance gap on rare labels. I was advised by Dr. Manik Varma in this effort. I observed siamese-style encoders on the contrary perform well on rare labels by aligning semantically similar items together due to information sharing among labels, that is representations of “car” and “vehicle” would be close. However, their overall performance is poor due to under-fitting on data-rich popular or head labels. Looking for something that is best of both worlds, I noticed past works have tried ensemble approaches of siamese encoder and extreme classifiers, but they suffered from misaligned prediction scores. I effectively managed to sidestep this by combining them during training itself. I came up with a simple distillation framework to supervise extreme classifiers using relevance probability scores of the siamese encoder amongst labels and data points. I empirically demonstrated huge improvements (up to 5% absolute) in rare label accuracies across benchmark datasets [3], without compromising already excellent head label accuracies. Classifiers trained using this framework were deployed on Microsoft’s Bing search engine for sponsored search, reinforcing its practical applicability on large-scale. This work is under review at ICLR 2024 [4] where it has garnered positive feedback (5,6,6,8 / 10 points). Working on this, I learnt how good inductive biases (semantically similar items close) in the form of supervision from even low accuracy models (siamese encoders) can help in generalization. 2. Using data-centric approaches. Unfortunately, above approach is limited since there are abundant relevant missing pairs that have no textual or semantic similarity, for instance a data point “ferrari” is relevant to its category “expensive cars” or “luxury vehicle”. In fact, the success of extreme classifiers is because they are not limited to this semantic notion of relevance due to independent parameters for every label [5]. Past works used heuristic approaches to mine pseudo-labels representative of missing labels via auxiliary sources of information (hyper-link graphs). However, they suffer from these mined labels being not task-specific. My ongoing research project advised by Dr. Amit Sharma and Dr. Manik Varma aims to solve this problem at its root cause. Initially, I used Large Language Models (LLMs) to generate relevant pseudo-labels for data points, but they fail to scale for even few thousand points. Further, efficient smaller LMs lack the parametric knowledge to generate diverse pseudo-labels. For instance, a data point like “m118dw printer” has a relevant category “hewlett packard” that smaller LMs might just not know [7]. However, I observed that data points usually have abundant associated information but in an uncurated form. For instance, search queries have webpage titles (and webpage content) associated with them, products names have product descriptions etc. Therefore, I proposed to use a smaller LM (can scale to million data points) to generate task-specific pseudo-labels for a data point using this uncurated information. The smaller LM is aligned/supervised using a LLM like GPT4 for a specific task accordingly. These generated unsupervised but task specific pseudo-labels for data points can now be used to supervise efficient models. This simple method efficiently scaled to 5 million data points and generated new relevant pairs, increasing available training data by 2x for a Wiki-category dataset [3]. While the work is still in progress, this method can potentially learn robust representations which generalize to Out-Of-Distribution (OOD) labels due to the diversity of the generated pairs. Most importantly, I learnt that simple data-centric methods can go a long way in learning good representations. Reliable ML. During my undergrad at IIT Delhi, I worked with Prof. Chetan Arora on the problem of calibration in deep learning. Neural networks are calibrated when their predictions of an event align with the actual probability of that event occurring. This is extremely important in safety-critical applications such as medical diagnosis or self-driving cars where uncertainty estimates (probabilistic predictions) allow ML systems to establish trust and to defer to humans when they are unsure. Training classifiers using hard one-hot labels leads to overfitting and miscalibration [8, 9]. Consequently, training on soft full label distribution that reflect human uncertainty in labels leads to robust and reliable classifiers [10]. Since humans cannot provide full label distribution for every image, I realized that using pretrained models could be one potential way to provide label distributions for training classifiers. Using this insight, I used a pretrained but calibrated classifier to provide accurate uncertainty estimates for an image in the form its probabilistic predictions to supervise another classifier. This simple approach distills not only more accurate, but calibrated students, outperforming or on par with current baselines. I empirically show that we can only distill calibrated classifiers from calibrated teachers only. Interestingly, we can distill calibrated classifiers using smaller teachers too, as long as they are calibrated. With the help of a fellow undergrad, I conducted exhaustive experiments across datasets and demonstrated superior calibration. We submitted this work to CVPR 2024. This work followed naturally from my previous project, again with Prof. Chetan Arora that got accepted to CVPR 2022 [11] as an oral presentation (top 4.2% papers), where we devised a simple loss that acts as an engineered label distribution. This reinforced the importance of good quality labels, and supervision from other ML models can be an effective way to approximate it. Large-scale ML. Being in an industrial research lab offered me the unique opportunity to work with huge scales of architectures and datasets where performing even trivial operations on these would mean hours (or even days) of compute. I quickly adapted to these magnitudes and learned to write efficient code and algorithms that could scale, picking up valuable skills in distributed and parallel computing, especially on GPUs. End-to-End training of large-scale XC models, that is training the encoder and the massive one-vs-all linear layer together, was previously thought to be infeasible in literature [2, 12]. I contributed to the research project towards making this end-to-end training practical and efficient, reducing training time by 15x, where I was advised by Dr. Ramchandran Ramjee and Dr. Manik Varma. I helped in implementation of the project and proposed a simple augmentation fix to learn better classifiers. I conducted exhaustive experiments demonstrating vanilla end-to-end training outperforms complicated modular approaches [2, 5] by up to 5% absolute points. This second-author work was received warmly at MLSys 2023 [13]. I collaborated with data scientists across different product groups within Microsoft to actively deploy my research, learning valuable engineering skills and good code-practices on the way. I applied the distillation framework described previously [4] on a real world large-scale sponsored search dataset collected using Bing click logs containing close to 80 million labels. This deployment led to 25% increase in offline metrics and significant increase in clickthrough-rates and user satisfaction online. This made me realize that simple ideas can work better than complicated ones on large-scale datasets. Future directions. My past experiences informed my research interests and made me realize the importance of developing practical methods. Therefore during graduate studies, I would like to use both theoretical and empirical approaches towards developing practical methods to handle distribution changes in the wild, especially at large scales. A possible direction where my previous research experience can help is exploring the role of data in robustness. One question directly relevant to my current work: Can we develop algorithms to learn task-specific representations that can generalize to hard to characterize natural shifts, leveraging abundant uncurated or unlabeled data? Another direction could be the role of foundational models in robustness: a possible question, how can we effectively exploit their general purpose representations, learnt from naturally occurring diverse data, in training task-specific models that can generalize when deployed in ever-changing real world? Additionally on a slight tangent, I’d also be interested in empirically understanding fundamental properties of ML models and use these learnings to develop better methods towards distributional changes. Further, I would like to complement my research work with extensive theoretical courses during my graduate studies to learn tools for theoretical research work. At NYU. I am interested in working with Prof. Pavel Izmailov since his interests in developing practical methods to make ML models robust aligns with mine. Prof. Pavel’s recent work [14] on training robust models in presence of spurious correlations using a simple data-centric finetuning on last layer falls close to one of the future directions I wish to pursue. Additionally, his past work [15] on understanding properties about neural networks empirically and build practical methods using these learnings [16, 17] (a calibration method) is something I am excited to explore further. Additionally, I would like to work with Prof. He He. Her interests in building ML models robust to biases in training dataset and being aware of “what they do not know” aligns with my current research project and past works on calibration and OOD detection [11, 18]. Further, I would also be interested in collaborating with Prof. Saining Xie. His recent work [19] on curating task-specific data during training falls close to my current research project as well as future directions I would want to pursue to learn useful and robust representations using abundant internet-scale data. References [1] Erik Schultheis, Marek Wydmuch, Rohit Babbar, and Krzysztof Dembczynski. On missing labels, long-tails and propensities in extreme multi-label classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1547–1557, 2022. [2] K. Dahiya, N. Gupta, D. Saini, A. Soni, Y. Wang, K. Dave, J. Jiao, K. Gururaj, P. Dey, A. Singh, D. Hada, V. Jain, B. Paliwal, A. Mittal, S. Mehta, R. Ramjee, S. Agarwal, P. Kar, and M. Varma. Ngame: Negative mining-aware mini-batching for extreme classification. In WSDM, March 2023. [3] Kush Bhatia, Kunal Dahiya, Himanshu Jain, Anshul Mittal, Yashoteja Prabhu, and Manik Varma. The extreme classification repository: Multi-label datasets and code. URL http://manikvarma. org/downloads/XC/XMLRepository. html, 2016. [4] Anonymous. Enhancing tail performance in extreme classifiers by label variance reduction. In Submitted to The Twelfth International Conference on Learning Representations, 2023. under review. [5] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In Proceedings of the ACM International Conference on Web Search and Data Mining, March 2021. [6] A. Mittal, N. Sachdeva, S. Agrawal, S. Agarwal, P. Kar, and M. Varma. Eclare: Extreme classification with label graph correlations. In Proceedings of The ACM International World Wide Web Conference, April 2021. [7] Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, and Xin Luna Dong. Head-to-tail: How knowledgeable are large language models (llm)? aka will llms replace knowledge graphs? arXiv preprint arXiv:2308.10168, 2023. [8] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017. [9] Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip Torr, and Puneet Dokania. Calibrating deep neural networks using focal loss. Advances in Neural Information Processing Systems, 33:15288–15299, 2020. [10] Joshua C Peterson, Ruairidh M Battleday, Thomas L Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9617–9626, 2019. [11] Ramya Hebbalaguppe, Jatin Prakash, Neelabh Madan, and Chetan Arora. A stitch in time saves nine: A train-time regularizing loss for improved neural network calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16081–16090, 2022. [12] Jiong Zhang, Wei-Cheng Chang, Hsiang-Fu Yu, and Inderjit Dhillon. Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. Advances in Neural Information Processing Systems, 34:7267–7280, 2021. [13] Vidit Jain, Jatin Prakash, Deepak Saini, Jian Jiao, Ramachandran Ramjee, and Manik Varma. Renee: End-to-end training of extreme classification models. Proceedings of Machine Learning and Systems, 5, 2023. [14] Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. arXiv preprint arXiv:2204.02937, 2022. [15] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018. [16] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018. [17] Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. Advances in neural information processing systems, 32, 2019. [18] Ramya Hebbalaguppe, Soumya Suvra Ghosal, Jatin Prakash, Harshad Khadilkar, and Chetan Arora. A novel data augmentation technique for out-of-distribution sample detection using compounded corruptions. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 529–545. Springer, 2022. [19] Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241, 2023.