Statement of Purpose Essay - University of North Carolina at Chapel Hill
Graph Neural Networks (GNNs) have emerged as powerful tools for managing graph-structured data, leading to significant advancements in algorithm design. Despite these advancements, the foundational assumption or inductive bias inherent in GNNs remains somewhat limited, creating a noticeable gap in their application across various scientific domains, each with its unique set of challenges. Driven by this motivation, my research interests primarily lie in two main directions: (1) Addressing the inductive bias in GNNs [6, 2, 8, 3, 1] and (2) their application in the field of medical and biological fields, specifically in single-cell RNA-sequencing (scRNA-seq) [7, 4] and tumor biology [9], as well as in other diverse areas [5]. These focus areas are at the heart of what I aim to explore under the umbrella of AI4Science during my Ph.D. studies at the University of North Carolina at Chapel Hill. The following sections detail my breakdown of AI4Science into two components: ‘AI’ and ‘4Science’. AI: Graph Neural Networks During my master’s at KAIST under Prof. Chanyoung Park, I specialized in AI, focusing on GNNs for their capability to handle non-Euclidean data and complex relationships. Despite GNNs’ popularity, I identified a significant oversight: the inductive bias in GNNs often presupposes balanced data scenarios, contrasting real-world applications. I tackled the long-tail problem [6] in graph-structured data, where imbalanced class distribution forms a long-tailed shape. However, I discovered existing works overlooked the graph’s degree distribution, which is also long-tailed. This insight was especially crucial because, in a graph, tail-degreed (a node with few neighbors) nodes take the majority part of a graph. Driven by this motivation, I developed a GNN model addressing both class and degree long-tailedness, featuring four expert models for class and degree long-tailedness combinations. This work was the first approach that addresses both the class and degree long-tailedness and published at CIKM 2022. We extended such joint consideration viewpoint to Sequential Recommendation System [3] in terms of user sequence length and item frequency, resulting in a publication at SIGIR 2023. More recently, since my time as a visiting researcher at Tokyo Tech under Prof. Tsuyoshi Murata, I tackled the inductive bias of fully observed features [8] in GNNs. Efforts in the graph domain to address this bias, especially where missing features are common, have often relied on hypothesizing missing situations by manually masking parts of the feature matrix, typically lacking initial missing features. This led me to ponder the implications if the feature matrix were inherently incomplete and the graph structure not provided, as in bio-medical domains with profound missing feature issues. I found that initially missing features often lead to suboptimal kNN graphs, bottlenecking effective graph-based imputation. To address this, I explored graph-based imputation methods in the biomedical domain, finding limited generalizability. Stepping aside from directly building kNN graphs, I trained a simple MLP to extract feature gradients, leading to the creation of an initial graph structure. This approach, incorporating feature propagation and graph regeneration, aimed to enhance graph-based imputation efficacy in biomedical data. This research is under review at ICLR 2024. 4Science: Single-cell RNA-sequencing & Tumor Biology Building on the AI perspective discussed above, my research direction is geared toward making significant contributions to scientific fields. In my view, the key is not just to naively apply existing AI methods to longstanding scientific problems, but to identify and address significant bottlenecks within specific domains using AI. This approach involves tailoring AI solutions to effectively tackle the unique challenges of the target problem. In the scRNA-seq domain, for instance, the prominent challenge is the dropout phenomenon [7], where many genes appear unexpressed in individual cells due to low and uneven mRNA distribution, leading to a high prevalence of zeros in the cell-gene count matrix. I observed that current graph-based imputation methods targeting the dropout issue don’t adequately address the impact of zeros and non-zeros, resulting in non-zero values being disproportionately influenced during diffusion and relying too heavily on the initial kNN graph constructed from a sparse and noisy matrix. To mitigate this, I pioneered the use of feature propagation, which preserves non-zero values throughout iterations, focusing on imputing zero values, and suggested refining the graph structure using this imputed matrix. This innovative approach, well-suited to the scRNA-seq domain, earned us the best paper award at ICML Workshop on Computational Biology 2023. We have further expanded this research, incorporating not just cell-cell but also gene-gene interactions [4], in collaboration with Prof. Manolis Kellis at MIT, and have submitted our manuscript to Nature Methods. Furthermore, during my internship with Dr. Tianlong Chen, an incoming professor at the University of North Carolina at Chapel Hill, I delved into analyzing tumor-related multiplexed immunofluorescence tissue images, incorporating a GNN perspective. The primary challenge was ensuring scalability and accurate phenotype prediction in each image, considering cellular heterogeneity [9]. I noticed that existing works fell short in handling such heterogeneous environments, where cells are often connected to various other cell types, deviating from the typical homophilous assumptions of GNNs. To address this, I proposed a novel multiplex network approach, adding a cell-type network layer to foster a cellular assortativity on top of the geometric layer in a scalable manner using a precomputing technique. This work is currently under review at CVPR 2024. Why Especially UNC-CH? The University of North Carolina at Chapel Hill, known for its prestigious Computer Science program and its collaborative atmosphere, has distinguished departments such as Biology and the School of Medicine that align seamlessly with my research goal, AI4Science. I am confident that I can enhance my understanding of AI and positively contribute to scientific areas. For example, working with the incoming Professor Tianlong Chen, I am prepared to contribute to AI by moving the frontiers of GNNs and recent Large Language Models, as well as impacting scientific areas covering scRNA-seq, tissue phenotyping, and the analysis of protein sequences and structures. Moreover, collaborating with Professor Natalie Stanley will enable me to explore the multimodality of single-cell data, integrating scRNA-seq, scATAC-seq, and CITE-seq data seamlessly in a scalable way. [1] Tai Hasegawa, Sukwon Yun, Xin Liu, Yin Jun Phua, and Tsuyoshi Murata. Degnn: Dual experts graph neural network handling both edge and node feature noise. Under Review at PAKDD 2024. [2] Junghurn Kim*, Sukwon Yun*, and Chanyoung Park. S-mixup: Structural mixup for graph neural networks. CIKM 2023. (*: equal contribution). [3] Kibum Kim, Dongmin Hyun, Sukwon Yun, and Chanyoung Park. Melt: Mutual enhancement of long-tailed user and item for sequential recommendation. SIGIR 2023. [4] Junseok Lee*, Sukwon Yun*, Yeongmin Kim, Tianlong Chen, Manolis Kellis, and Chanyoung Park. Single-cell rna sequencing data imputation using bi-level feature propagation. Under Review at Nature Methods. (*: equal contribution). [5] Yunhak Oh*, Sukwon Yun*, Dongmin Hyun, Sein Kim, and Chanyoung Park. Muse: Music recommender system with shuffle play recommendation enhancement. CIKM 2023. (*: equal contribution). [6] Sukwon Yun, Kibum Kim, Kanghoon Yoon, and Chanyoung Park. Lte4g: long-tail experts for graph neural networks. CIKM 2022. [7] Sukwon Yun*, Junseok Lee*, and Chanyoung Park. Single-cell rna-seq data imputation using feature propagation. ICML 2023 Workshop on Computational Biology. (*: equal contribution). [8] Sukwon Yun, Yunhak Oh, Junseok Lee, Xin Liu, Tsuyoshi Murata, Dongmin Hyun, Sein Kim, Tianlong Chen, and Chanyoung Park. Toward generalizability of graph-based imputation on biomedical missing data. Under Review at ICLR 2024 Openreview link. [9] Sukwon Yun, Jie Peng, Chanyoung Park, and Tianlong Chen. Multiplexed immunofluorescence image analysis through an efficient multiplex network. Under Review at CVPR 2024.