Motivation: Protein remote homology detection is one of the fundamental problems in computational biology, aiming to find protein sequences in a database of known structures that are evolutionarily related to a given query protein. Some computational methods treated this problem as a ranking problem, and achieved the state-of-the-art performance, such as PSI-BLAST, HHblits, and ProtEmbed. This raises the possibility to combine these methods to improve the predictive performance. In this regard, we are to propose a new computational method called ProtDec-LTR for protein remote homology detection, which is able to combine various ranking methods in a supervised manner via using the Learning to Rank (LTR) algorithm derived from natural language processing.
Results: Experimental results on a widely used benchmark dataset showed that ProtDec-LTR can achieve an ROC1 score of 0.8442, and an ROC50 score of 0.9023 outperforming all the individual predictors and some state-of-the-art methods. These results indicate that it is correct to treat protein remote homology detection as a ranking problem, and predictive performance improvement can be achieved by combining different ranking approaches in a supervised manner via using Learning to Rank.
Figure 1. The flowchart of ProtDec-LTR. There are three main phases in ProtDec-LTR, including feature representation, LTR training and testing phases. The proteins were first embedded into feature matrix constructed based on three basic ranking methods: Psi-BLAST, HHblits and ProtEmbed, and then they were fed into LTR for training the model. For an unseen protein, its homologous proteins can be detected by the trained model. Therefore, these three methods were combined in a supervised framework by using LTR.