Statement of Purpose Essay - University of Washington
I am applying to your Ph.D. program to research programming languages and formal methods, driven by my involvement in the open-source community. In particular, I am interested in studying the structures that represent the semantics of Python libraries in data science, machine learning, artificial intelligence, and scientific computing. My experience working in this open-source community has prepared me to integrate theoretical research with practical applications. Researching the design and implementation of abstractions for domain-specific libraries can help preserve the independence and growth of decentralized multi-stakeholder Python libraries. I aim to explore how computer science and mathematics can offer a precise language to describe and manipulate the abstractions used in the field. Pursuing this research at the University of Washington would provide a collaborative environment to engage in academic research to meet the real-world needs of the open-source community. I have experience in the combination of research, community engagement, and addressing real-world user needs through my participation in open source, particularly with JupyterLab and Vega. Starting with a personal initiative to enhance the collaboration of my research group, I contributed to integrating the Vega-Lite visualization tools into JupyterLab, collaborating with project creators Dominik Moritz and Brian Granger. I relied on these relationships to build a general-purpose, open-source tool for Python data science users to optimize their interactive visualizations based on the needs of a database client. This improved Vega's debugging capabilities and contributed to published research on visualization profiling. As a core JupyterLab developer, I helped clients integrate with the Jupyter ecosystem and shared the responsibility of maintaining and enhancing the project. For example, I wrote and received a Chan Zuckerberg Initiative grant to improve the collaborative experience in Jupyter, leading to an integration of conflict-free replicated data types into JupyterLab. Overall, I enjoyed the opportunity to integrate research and user needs while navigating complex multi-stakeholder projects. My engagement in the open-source Python data science community motivates my research on the interrelated technical and social challenges it faces. The shift toward heterogeneous hardware has increased the complexity of developing numeric libraries and led to fragmentation in the ecosystem, centralizing control in single-stakeholder projects like Jax, PyTorch, and Mojo. The "exilic spaces," as defined by O’Hearn and Grubačić, of decentralized democratic open-source projects based on mutual aid and cooperative relationships, are a powerful living alternative to the market economy. By protecting them through tools that promote a bottom-up ecosystem to meet user needs and facilitate innovation, we can engage in the vital practice of navigating the difficulties of democratic governance. I hope my work can play a role in increasing the diversity and accessibility of the ecosystem by building tools that strengthen its independence. My path towards exploring how programming language abstractions can aid the PyData community began at Quansight. Working with Travis Oliphant, creator of NumPy, we sought to research a new foundation for the array ecosystem in Python. I reached out to Lenore Mullin, based on her 1980s work on the 'Mathematics of Arrays' formalizing APL with the lambda calculus, and we ended up working together to integrate her work into Python. This led to the development of a generic framework, metadsl, for representing and optimizing expressions in Python, but securing funding for this type of experimental technology proved challenging. While at Quansight, I also assisted the Data APIs consortium with analyzing the usage of array API patterns in Python, giving me the opportunity to witness the difficulties in providing users with universal interfaces. After leaving Quansight, I joined a startup Linea that was working to bring database and programming language theory to assist Python data scientists. As one of their first engineers, I engaged with academic researchers as the leader in developing Linea’s abstract analysis engine, which transformed Python code into a functional dataflow graph in order to support program slicing and lineage analysis. After leaving Linea, I re-focused on developing a system for abstract interpretation in Python. I learned about the work on e-graphs from the University of Washington, a technique that comes from equational reasoning. I wrote a Python library exposing the research software to a broader audience and presented it at the E-GRAPHS workshop at PLDI. I also developed an interactive visualizer, now integrated into the project, to make the work more accessible. To deepen my academic collaboration, I applied for and received an NSF fellowship to support pursuing a Ph.D. in computer science. My ongoing work with e-graphs in Python gives me a jumping-off point for my overall research goals in pursuing a Ph.D. In egglog, the framework building e-graphs on top of datalog, I am interested in studying the addition of parameterized types and higher-order functions to improve expressivity and the relationship to existing compute engines, such as SQL and differential dataflow. I also want to explore how existing functional abstractions like the Mathematics of Arrays and the RSVDG can be implemented within the e-graph framework to provide a general-purpose compiler framework for numeric computation, AI, and data science. More broadly, I am interested in studying, through the lens of human-computer interaction, how frameworks for type-driven programming can assist in cross-domain and language translation to assist inter-group collaboration between library developers, as well as how concepts from the foundations of mathematics can help give us the language to reason about the semantics of programming languages. More specifically, at the University of Washington, I am interested in working with Zachary Tatlock, Gilbert Bernstein, Dan Grossman, and Luis Ceze on studying the abstract interpretation and compilation of machine learning code in Python. Moreover, I want to connect the academic programming language community closer with the Python ecosystem, a high-impact field that would benefit from a foundational approach to studying abstraction. By enrolling in your graduate program, I could develop longer-term working relationships with other researchers in an environment supportive of experimentation and thus connect abstract research with the needs of existing open-source communities. Footnotes Bruno Latour, "Down to Earth: Politics in the New Climatic Regime" pg. 92 ↩