Figure 1. The flowchart of ProFun-SOM.
ProFun-SOM: Protein Function Prediction for Specific Ontology based on Multiple Sequence Alignment Reconstruction
- Motivation: Protein function analysis is one of the most important problems in protein sequence analysis. Although current computational methods have facilitated the studies of protein function analysis with the help of large-scale sequenced protein data, their prediction accuracy improvement is prevented by the complexity of protein function categories. For example, the gene ontology dataset contains thousands of GO terms representing different functions. The gene ontology is divided into three ontologies: Cellular Component Ontology (CCO), Molecular Function Ontology (MFO) and Biological Process Ontology (BPO). Therefore, it is more reasonable to construct ontology-specific models to capture the different characteristics of these three ontologies.
- Results: The evolutionary information contained in multiple sequence alignments (MSAs) helps to differentiate proteins of different functions. However, the noise in MSAs (e.g., non-homology errors) easily leads to prediction bias. To make full use of the MSAs, we proposed a novel protein function predictor called ProFun-SOM with an MSA reconstructor and a functional predictor. The MSA reconstructor first denoises the MSA inputs by reconstructing them from a hidden space, followed by the functional predictor using the denoised MSAs for function prediction. To capture the biological characteristics of each ontology, we individually trained three ProFun-SOM models for CCO, MFO, and BPO. Experimental results on the CAFA3 dataset showed that ProFun-SOM obviously outperformed the other state-of-the-art methods in terms of the Fmax and AuPRC scores for both single ontology tasks and mixed ontology tasks (CCO, MFO, and BPO).