Abstract
Protein Gene Ontology (GO) annotation is crucial for linking genetic products to their biological functions. However, a significant number of protein sequences have limited or sparsely distributed annotations, contributing to the 'core-periphery' imbalance in functional knowledge. Core proteins are extensively annotated with general GO terms, while peripheral proteins have limited, highly specific annotations, often representing under-explored biological functions. To address this challenge, we propose ProGO-PSL, a method based on the protein sequence large graph that leverages both explicit and implicit information. Explicit information is derived from annotated functional data, while implicit information is inferred from unlabeled sequence data and sequence similarities. ProGO-PSL employs a Gated Linear Unit-based Transformer to effectively fuse these two types of information, addressing the limitation of traditional methods that fail to predict GO terms without bias due to the core-periphery imbalance in the data. The key contributions of ProGO-PSL are: i) a novel large graph model that enhances core-peripheral GO term prediction by leveraging both explicit and implicit information, ii) a fusion strategy that enhances performance for GO terms with varying information level under imbalanced distribution conditions, and iii) interpretable representations that provide insights into the core-periphery distribution problem in protein gene ontology annotation. Our approach improves the balance between core and peripheral protein annotations, offering a more comprehensive and accurate understanding of protein functions, particularly in under-explored regions of the protein universe.