Statement of Purpose Essay - University of Michigan
My desire to build software systems was inspired by my entrepreneurship experience that showed how new-generation systems could enable mass adoption of new technologies and has been further fueled by my work at UIUC SysNet and MSR-Asia, where I explored reliability issues in large-scale systems, such as fault-tolerance of distributed systems and correctness issues in cloud management programs. My research has led to 30+ bug fixes in production systems and has fueled further research [4]. Through graduate studies at University of Michigan Ann-Arbor, I hope to become a researcher who builds scalable and reliable software systems that can handle increasingly complex and intensive data and processing needs. Fail-slow fault tolerance of distributed databases My first research project focused on the critical but largely-overlooked problem [1, 2, 3] of fail-slow fault tolerance in distributed systems. Fail-slow faults, such as slow-running hardware, can lead to system-wide performance degradation but are difficult to reproduce and study in a controlled and consistent manner. Guided remotely by Prof. Tianyin Xu and Prof. Shuai Mu, I co-developed Slooo, a structured, reusable fault-injection framework that automates the evaluation of system behaviors under fail-slow faults. Slooo has been used to study fail-slow faults [4], incorporated into a graduate-level assignment at UIUC, and recently submitted to a top SE conference. This project provided valuable technical skills and insights that enabled me to pursue my interest in computer science as a non-CS student with no prior CS background. Connecting research to real-world applications Inspired by the need for fault-tolerant computing services, I developed the idea of "City Brain as Distributed HPC Containers" and led a team of 4 students to win the top 3 out of 142 in a top business contest for college students in China. Our idea was to build city-wide networks of wirelessly-interconnected HPC containers to support the need for both latency and performance in emerging applications, such as autonomous driving, smart city, and cloud gaming. The maintenance cost of such container networks can be alleviated by fault-tolerant software design. Our project was featured in ZJU-UIUC institute news and considered by the Ningxia Province’s government for its smart city plans. Seeing the potential impact of computer systems research – how improvements in systems enable new applications and make existing applications more reliable, keeps me excited and motivated to continue in this field. Ensuring the correctness of cluster management programs in declarative systems During the development of Slooo, I noticed that writing tests was a time-consuming and error-prone process, as the output of Slooo reflects in the state of external programs. This inspired my second research project. Under the supervision of Prof. Tianyin Xu and Prof. Owolabi Legunsen, we created an automatic testing tool for cloud management programs on Kubernetes – operators. Operators automate the full-lifecycle management of specific applications, and their correctness is critical due to their increasing popularity in managing large application clusters. Our tool automates the testing process with automatic input exploration and oracle generation, effectively tackling the challenges of combinatorial explosion in exhaustive input exploration and codifying application-specific oracles. Our approaches have identified 50+ previously unknown bugs in 10 popular operators. I designed the input exploration strategies to test an operator’s error-handling logic, raised additional checkers for concrete bug descriptions, and co-implemented the core architecture that supports parallel testing and custom Kubernetes runtime. Besides technical contributions, I mentored remote students and co-present the project at an undergraduate event. For future work, we plan to understand the semantic correctness of the generated input by connecting the custom interface of operators with Kubernetes native resources, enabling more efficient and strategic input generation. Exploration I believe that being a systems researcher requires strong system-building skills, a decent understanding of computer science fundamentals, and deep, holistic insights into academia and industry. As a junior researcher, I am still working towards achieving this level of expertise. To gain a more comprehensive view of systems research and the various techniques involved in solving system problems, I recently joined Microsoft Research Asia as a cloud architecture research intern. Working with Dr. Shilin He, I am currently focused on identifying and sharing configuration-induced problems in workloads running on increasingly heterogeneous cloud environments, such as Azure Kubernetes Service and Azure Machine Learning. I have identified a challenge in applying current configuration management techniques to these environments and plan to continue exploring this area in the future. Future Work As the digital economy continues to expand, computing power and the abstractions built upon it play an increasingly crucial role in future economic innovations. To address the increasing heterogeneity and number of compute-intensive workloads, new abstractions and system designs are needed. However, these solutions often introduce extra architectural complexity, which can pose challenges in terms of security, performance, and reliability. My undergraduate studies focused on exploring the reliability challenges associated with large-scale computing systems, and I am also interested in the performance aspect of those systems in the face of emerging heterogeneous hardware and workloads such as machine learning training and deployment. I am excited about the opportunity to further my knowledge and skills at University of Michigan, where I can engage in cutting-edge research, receive mentorship from top faculty, and take on challenging coursework. The school culture also nourishes entrepreneurial thinking and values diversity, interdisciplinary teamwork, inventiveness, and other positive characteristics, which contribute to the development of future leaders who understand how to direct technical advancements for social good and business impact. I am looking forward to pursuing my Ph.D. studies under the supervision of Prof. Ryan Huang. My undergraduate research experience has introduced me to some of his notable works, such as Violet, RESIN, and GrayFailure. I am extremely excited to work with him on making our computing infrastructure more reliable. Teaching and Sharing Limitations of educational resources during my growth path have pushed me to intensively seek external resources, including courses and research opportunities to support my educational goals. As a first-generation student from a low-income family, I want to share my experience and help underrepresented groups overcome challenges in accessing quality educational resources. During my undergraduate years, in addition to serving as an academic mentor to Chinese and international students from middle-east and south-east Asia, I held workshops about career planning and undergraduate studies. Later, I became a Microsoft Learn Student Ambassador to expand my reach to larger communities. In my Ph.D. journey, I will strive to improve our community through teaching and sharing inclusively. My interest in computer systems is also rooted in the desire for equal access to STEM education. I believe that accessible and affordable computing power can alleviate economic and geographical gaps in challenges faced by underrepresented groups, including access to quality education. References [1] Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems - SoCC’13 [2] Fail-slow fault tolerance needs programming support - HotOS’21 [3] What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems - SoCC’14 [4] DepFast: Orchestrating Code of Quorum Systems