Back to All Essays

Statement of Purpose Essay - University of Illinois Urbana-Champaign

Program:Phd, Systems
Type:PHD
License:CC_BY_NC_SA_4_0
Source: Public Success Story (Sarthak Chakraborty)View Original

Recent burgeoning demands for computing have led to rapid growth in dependence on large cloud-based systems. However, inefficient use of these complex systems can lead to high operational costs, costly disruptions and hold-ups. In terms of technology, my research interests are in optimizing such systems using advances in machine learning, making them more robust, and developing components that can guard against disruptions. Primarily, I want to conduct research in the direction of developing autonomous and intelligent distributed computing systems capable of recovering from degradation. I aspire to realize a fully operational autonomous self-healing system. Through a Ph.D. program in Computer Science at the University of Illinois Urbana-Champaign (UIUC), I aim to take my first steps to realize these goals. Designing such autonomous systems requires multiple subtasks ranging from intelligent ingestion and scheduling in a distributed cluster to workload-specific optimizations, building efficient storage systems, analyzing and recovering from faults, and maintaining end-to-end reliability. In college, I worked on distributed system design, and system optimizations for federated learning and streaming systems, while now, as a part of Adobe Research, I am currently working on ensuring the reliability of cloud microservices by analyzing system data, focusing on automatically localizing faults and predicting outages in large-scale microservice deployments. Explaining the root cause of a fault is challenging because of the dynamic inter-module correlation. A growing body of work aims to detect causal relationships among multiple performance metrics of the entire microservice to recover the architecture of the underlying system and then use various graph-based algorithms to locate the root cause. However, prior approaches for discovering these causal dependencies between the performance metrics either overlook the deployment strategy of microservices, which spawns multiple instances/pods for reliability and load-balancing purposes, or average the metric data monitored for each such instance, resulting in a loss of information. We address this gap through "CausIL" [1], which utilizes (i) metrics variations for each instance deployed per service by considering these instances as conditionally independent of other instances, and (ii) domain knowledge generic to any microservice system to identify the causal relationships. An added advantage is its capability to cope with instances’ distinct and transitional nature, usually observed in a microservice deployment due to autoscaler configuration. The above two properties improve "CausIL" over prior works in identifying the causal structure of the system. In contrast, even though CausIL performs better than the prior works, limitations might arise with the unavailability of domain knowledge. Moreover, regular causal structure detection algorithms suffer from scalability and often use parametric assumptions. Thus, in collaboration with Prof. Saurabh Bagchi and Prof. Murat Kocaoglu from Purdue University, we proposed a hierarchical and localized causal algorithm to detect root causes for a fault in the absence of domain knowledge, which got accepted at NeurIPS 2022 [2]. The key idea was to consider faults as an intervention in the system, thereby leveraging causal techniques to find the interventional targets and hence the root causes by using an additional proxy node in the causal graph. The solution is scalable since a causal identification of the entire system architecture is not necessitated for predicting the interventional targets. These research experiences have provided me with insight into the problems of a deployed system that warrant attention. I am currently working on an end-to-end system that utilizes the alerts defined on these metrics along with incident reports containing rich information on the root causes of previously seen outages to propose potential root causes during a fault. In addition to the above research experience, I was fortunate enough to deploy some of my ideas and prototypes in a practical setting at Adobe Research, which is often required in system research. Significant efforts are needed to handle difficulties that arise in engineering domain-specific features, distributional shifts in data, misleading data, and continual learning. In this process, I worked on (i) explaining potential outages for a production service from service alerts and (ii) predicting time-to-live for streaming jobs without knowledge of the system configuration running them. Having first-hand experience in successfully integrating two research projects into the engineering stack informed my research thinking when working with practical systems. My research interest in improving system performance predates Adobe Research. My undergraduate thesis at the Indian Institute of Technology (IIT) Kharagpur, advised by Prof. Sandip Chakraborty, focused on building an adaptive 360-degree video streaming system to reduce the bandwidth consumption for streaming. We observed that streaming the entire frame of a 360-degree video wastes an unnecessary amount of bandwidth since the user’s gaze (viewport) is confined to some specific regions. This observation led us to design "PARIMA" [3], a novel low-latency intelligent streaming system that predicts viewports in advance via online regression on the viewer’s localized region of interest and video characteristics. In addition, we allocated higher bandwidth to only the predicted viewport regions without compromising the user’s QoE. The work led to my first publication at WWW 2021, which boosted my confidence and enthused me to pursue further research on designing intelligent systems to solve real issues. My master’s thesis project, advised by Prof. Sandip Chakraborty on using smart contracts to enable Federated Learning (FL) and its application to ensure data privacy and accountability further cemented my interest in system design. We developed a system optimized for ML model training across silos with federated networks by employing blockchain interoperability. It was motivated by the idea that cross-silo federated networks can enable enterprises to train a common superior model without sharing data. Our solution, published as a full paper in ICBC 2022 [4], proposed a relay mechanism indigenous to individual federated networks that mediate the process of model exchange after appropriate authentications. In the process, we built a scalable and flexible federated system allowing heterogeneous clients, with Kafka as the communication network between the server and the clients. The experience of designing an end-to-end system proved crucial to instill interest in me for dedicated research in optimized system design. Various research involvements paired with a practical perspective to problems instilled confidence in me to pursue graduate studies to meet my research goals. Such experiences, I believe, uniquely qualify me to tackle complex system design research problems to realize a self-healing, autonomous, reliable and intelligent system. While at UIUC, I plan to make large distributed systems more efficient and robust under dynamic workloads by incorporating learnable components to drive their performance and intelligence. In light of this, I would particularly like to work with Prof. Aishwarya Ganesan, Prof. Indranil Gupta, Prof. Ramnatthan Alagappan and Prof. Yongjoo Park. Prof. Ganesan and Prof. Alagappan’s interest in making storage systems more reliable and performant is critical in building an efficient distributed system, resonating with my research objectives. Their work on using learned indexes for storage systems in "Bourbon" enthused me, a technique that can be adapted to make more efficient reads and writes for range queries. Distributed data centers often require tiered disaggregated storage systems and hence, the work in "Skyros" is relevant which can extend to geo-distributed systems with adaptive I/O loads. Simultaneously, Prof. Gupta and his group’s work on building efficient cloud computing and distributed system aligns with my aspirations, while his works on resource utilization and performance isolation for streaming systems and graph processing systems piqued my interests. On the other hand, query time prediction and scheduling hot over cold clusters for dynamic graph queries is a possible research avenue, crucial to maintaining SLAs. I also believe I can contribute to Prof. Park’s vision of building an adaptive and intelligent data-driven computing system, with AirDB projects targeting efficient storage for cloud systems seems enticing. Additional capabilities to make the cloud systems more reliable on failure can be incorporated. The graduate program at UIUC will provide me with ample exposure to the multi-faceted requirements needed to design a reliable and intelligent system that can help me meet my research objectives. A focused research environment compounded by the coursework offered by the Ph.D. program will be critical for me to become an independent researcher, generate poised ideas having abstractedness and practicality, and build a potentially beneficial system. In conclusion, I believe I bring diverse research experiences, industry-honed practical thinking, and, most importantly, an insatiable desire and motivation to learn. I look forward to the next stage of my life as a graduate student. References: [1] Sarthak Chakraborty, Shaddy Garg, Shiv Kumar Saini, Shubham Agarwal, Ayush Chauhan. "CausIL: Causal Graph for Instance Level Microservice Data." Under Review at The Web Conference 2023. [LINK] [2] Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Kumar Saini, Saurabh Bagchi, Murat Kocaoglu. "Root Cause Analysis of Failures in Microservices through Causal Discovery". In Advances in Neural Information Processing Systems (NeurIPS ’22), November 2022. [LINK] [3] Lovish Chopra*, Sarthak Chakraborty*, Abhijit Mondal, and Sandip Chakraborty. "PARIMA: Viewport Adaptive 360-degree Video Streaming". In Proceedings of the Web Conference 2021 (WWW ’21), May 2021. [LINK] [4] Sarthak Chakraborty, Sandip Chakraborty. "Proof of Federated Training: Accountable Cross-Network Model Training and Inference". In 2022 IEEE International Conference on Blockchain and Cryptocurrency (ICBC ’22), May 2022. [LINK]