ProGO-PFL:

Revisiting Protein Function Prediction: Bridging Large Biological Networks and Ontological Knowledge via Weak Associations

Home | Download | Citation



Introduction

Abstract: Predicting protein function through Gene Ontology (GO) annotation remains a fundamental challenge in computational biology. Traditional methods primarily rely on strong associations, such as homology relationships and interaction networks, derived from static gene-encoded information. However, these approaches often fail to capture the dynamic and context-dependent nature of protein functions. Inspired by the behaviour of intrinsically disordered proteins (IDPs), which engage in transient and flexible interactions crucial for biological processes, we propose ProGO-PFL—a framework that bridges large biological networks with ontological knowledge through weak association learning. Our approach introduces three key innovations: (i) construction of weak association data using protein language models to predict association scores between protein-protein and protein-GO pairs; (ii) a protein function large graph model with a balanced sampling algorithm for effective learning from massive weak association data; and (iii) a dual learning framework combining transductive and inductive approaches for known and novel proteins, respectively. Experimental evaluation demonstrates that ProGO-PFL outperforms state-of-the-art methods across all three GO domains: Biological Process (Fmax=0.3722), Cellular Component (Fmax=0.6073), and Molecular Function (Fmax=0.6144), with improvements of 6\%-29.5\% over previous methods. Further analyses reveal that our weak association learning approach successfully annotates previously uncharacterized proteins with high accuracy, particularly excelling at predicting deep hierarchical GO terms that other methods fail to identify.
The flowchart of ProGO-PFL model is shown in Fig.1

ProGO-PFL web server
Fig.1 The flowchart of ProGO-PFL (i) Graph Construction (Figure 1a): We establish three distinct graph structures representing different relationships: a Protein interaction graph, a GO relation graph, and a Protein-GO annotation graph. These graphs serve as the foundational components for our heterogeneous network. (ii) Heterogeneous Graph Merging and Sampling (Figure 1b): The individual graphs from the initial phase are merged into a heterogeneous graph that encompasses all nodes and edges. This heterogeneous graph undergoes a sampling process to facilitate efficient computational processing in subsequent stages. (iii) Protein-GO Association Prediction (Figure 1c): In this final phase, protein features are extracted using Evolutionary Scale Modeling (ESM), while GO terms are represented through one-hot encoding. These feature representations are integrated into the sampled heterogeneous graph. Through fully connected layers, we achieve dimensional alignment of protein and GO term features. The model employs a Dual-Level Attention Mechanism within the Heterogeneous Graph Attention Network (HGAT), which effectively leverages both type information and feature representations from neighboring nodes. The final node representations are obtained through an output fully connected layer, and the predicated protein-GO term association matrix is generated via inner product computation of these representations.