ProtDec-LTR 2.0

An improved method for protein remote homology detection by combining pseudo protein and supervised learning to rank

| Home | Server | Tutorial | Document | Citation |



Document of ProtDec-LTR 2.0

Content

1. The flow chart of ProtDec-LTR 2.0

2. Profile-based pseudo protein sequence

3. The pseudo protein-based predictors

4. Learning to rank

5. Dataset

6. The performance comparison

7. References


1. The flow chart of ProtDec-LTR 2.0 (back to content)

The flow chart of ProtDec-LTR 2.0 is shown in Fig. 1, in which there are three important modules for detecting remote homologous proteins.

The first module is to transform raw protein sequences into profile-based pseudo protein sequences as inputs for protein homology detection. In previous version, a protein sequence was directly input as a query [1]. However, the sequence identity among remote homologous proteins is usually low about at 35%. It is hard to achieve high sensitivity only according on raw protein sequences. This is a new module for ProtDec-LTR 2.0, by which it can incorporate conservation information during the evolutionary process.

The second one is to search the candidate remote homologous proteins in a large non-redundant database via basic ranking methods. In previous version, the searching was implemented by inputting raw protein sequence and searching against the raw protein sequence database. In the updated ProtDec-LTR 2.0, three pseudo-protein predictors (Pse-PSI-BLAST, Pse-HHblits and Pse-Hmmer) are constructed.

The third one is to refine the three basic ranking lists to produce a more accurate result by using supervised LTR algorithm. Three ranking lists are obtained by using the three basic pseudo-protein predictors, and then they are embedded as a feature matrix to input the framework of LTR. As a result, the three ranking predictors are combined in a supervised manner considering the advantages of all the three individual predictors for more accurate protein remote homology detection.

The flowcart of ProtDec-LTR
Figure 1. The flowcart of ProtDec-LTR 2.0. It accepts the raw proteins as inputs and returns a ranking list of homologous proteins. The raw proteins are first transformed as profile-based pseudo proteins and fed them into three pseudo-protein predictors. At last the three basic ranking lists are combined as one more accurate ranking list by using trained LTR model.

2. Profile-based pseudo protein sequence (back to content)

A profile-based pseudo protein sequence is not a real protein sequence. It is transformed from a profile of real protein sequence. As demonstrated in previous studies [2-4], profile-based pseudo protein sequences extracted from profiles are useful for improving protein remote homology detection. The main steps of generating the profile-based pseudo protein sequence are shown in Fig. 2 and simply descripted as following.

Firstly, for a protein sequence P, it is searched against the NCBI's nrdb90 [5] database by running PSI-BLAST [6] with parameters (-num_iteratives 3 -evalue 0.001) to generate a MSA. Then the frequency profile of sequence P, a matrix M of size 20*L (20 is the number of native amino acids and L is the length of sequence P), can be calculated based on the frequency of each amino acid at each site in generated MSA.

Secondly, for each column in M, we sort the amino acids in the descending order according to their frequency values, and then select the amino acids with the maximal frequency value in each column. These selected amino acids are combined to form a new pseudo protein sequence P’, which is called profile-based pseudo protein sequence.

The higher scores in M represent more conserved sites in protein sequence P. Such representation of proteins defined by frequency profiles would be more sensitivity than raw protein sequences for detecting remote homologs. The profile-based pseudo protein sequence P’ were used to replace the raw protein sequence P as input for protein homology detection.

The flowcart of ProtDec-LTR
Figure 2. The transformation of generating profile-based pseudo protein. A profile-based pseudo protein sequence is not a real protein sequence. It is transformed from a profile, but it has the same amino acid length with the raw protein sequence.

3. The pseudo protein-based predictors (back to content)

In the updated ProtDec-LTR 2.0, we construct three pseudo-protein predictors (Pse-PSI-BLAST, Pse-HHblits and Pse-Hmmer) by combining the three state-of-the-art protein predictors (PSI-BLAST [6], HHblits [7] and Hmmer [8]) and profile-based pseudo protein sequence.

The protein predictors directly search a protein query in a protein database. However, each of the pseudo-protein predictor is fed into profile-based pseudo protein sequences generated at the first step, and the searching is performed against a pseudo-protein database in which the raw protein sequences are transformed into profiled-based pseudo protein in advance. The searching process of pseudo-protein predictor is as following:

  • Transform the protein query into pseudo-protein query;
  • Transform the protein database into pseudo-protein database;
  • Search the pseudo-protein query in the pseudo-protein database, and generate a list of pseudo protein hits;
  • Map the pseudo proteins back to raw proteins.

The flowchart of searching process of pseudo-protein predictor is shown in Fig. 3. In this study, all the basic predictors were performed with default parameters.

The flowcart of ProtDec-LTR
Figure 3. The flow chart of searching process of pseudo-protein predictor. The raw protein query and protein database should be transformed as pseudo protein query and pseudo protein database in advance. Then searching the pseudo protein query in the pseudo protein database.

4. Learning to rank (back to content)

Learning to rank [9] is the application of machine learning in the construction of ranking models for information retrieval systems, which has been successfully applied in many well-known searching engines, such as Bing [10] and Google [11]. The training data of learning to rank consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item. The ranking model's purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way which is "similar" to rankings in the training data in some sense. The training and testing phases of Learning to Rank was shown in Fig. 4.

Similar as the application of LTR in information retrieve, for protein remote homology detection, each protein sequence is treated as a "document". Three ranking lists are obtained by using the three aforementioned ranking methods, and then they are embedded as a feature matrix to train the LTR model. Finally, for an unseen query sample, its homologous proteins can be detected by the trained model of LTR. As a result, the three ranking predictors are combined in a supervised manner considering the advantages of all the three individual predictors for more accurate protein remote homology detection. For more information of LTR, please refer to [1].

Figure 4. The training and testing phases of Learning to Rank. In training phases, the training samples are represented as feature matrices, and then they are used to train a ranking function F(∙). In the testing phases, the testing samples are re-ranked by using the learnt ranking function F(∙) to detect their homologous proteins.

5. Dataset (back to content)

Two benchmark datasets were used to evaluate the performance of predictors: SCOP [12] and SCOPe [13].

The SCOP benchmark dataset was constructed based on SCOP v1.59, which contains 7,329 proteins with less than 95% sequence identity. It is a widely used dataset, and it can provide good comparability with other related methods [1, 4]. There are 1,073 superfamilies and 1,827 families in this dataset.

The SCOPe benchmark dataset was constructed based on the SCOPe version v2.06 released on 06-April-2017 (the latest version), containing 28,010 proteins with less than 95% sequence identity with 2,008 superfamilies and 4,851 families.

6. The performance comparison (back to content)

Two performance measures were employed to evaluate the performance of each method, including ROC1 score and ROC50 score [14]. ROC1 and ROC50 scores represent the area under ROC curve up to the first false positive and the 50th false positives, respectively. A score of 1 means perfect prediction, whereas a score of 0 means that none of the proteins is correctly identified. In this study, if the detected proteins and the query protein are in the same SCOP superfamily, the detected proteins are considered as true positives, otherwise they are false positives. The jackknife validation is employed to evaluate the performance of methods, because it is deemed the most objective cross-validation approach.

Table 1 shows the performance of various methods on SCOP v1.59, from which we can see that the performance of the three predictors (PSI-BLAST, HHblits and Hmmer) can be improved by using the pseudo protein approach. ProtDec-LTR2.0 obviously outperforms ProtDec-LTR in term of ROC1, and is highly comparable with ProtDec-LTR in term of ROC50.

Table 1. The performance comparison between ProtDec-LTR2.0 and other related methods on SCOP v1.59 via jackknife validation.

Methods ROC1 ROC50
ProtDec-LTR2.0 0.8911 0.8955
ProtDec-LTR 0.8510 0.8969
Pse-PSI-BLAST 0.7900 0.8127
Pse-HHblits 0.8246 0.8737
Pse-Hmmer 0.8016 0.8212
PSI-BLAST 0.7718 0.7794
HHblits 0.8187 0.8669
Hmmer 0.7796 0.7830
Coma 0.6989 0.7785
ProtEmbed 0.8136 0.8897
dRHP-PseRA 0.8314 0.8924

In order to further evaluate its performance, ProtDec-LTR2.0 is evaluated on the updated benchmark dataset SCOPe v2.06, and the results are shown in Fig. 5, from which we can see that ProtDec-LTR2.0 obviously outperforms the basic predictors in terms of ROC1and ROC 50.

Figure 5. Performance comparison of various methods on SCOPe benchmark dataset via jackknife validation. The graph plots the percentage of sequences, for which the method exceeds a given performance threshold. The higher curve means the method performs better. ROC1 and ROC50 are used as the performance measures for (A) and (B), respectively. ProtDec-LTR2.0 achieves the best performance with a ROC1 score of 0.969 and a ROC50 score of 0.981, obviously outperforming other methods.

7. References (back to content)

1. Liu B, Chen J, Wang X. Application of Learning to Rank to protein remote homology detection, Bioinformatics 2015;31:3492-3498.
2. Liu B, Zhang D, Xu R et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection, Bioinformatics 2014;30:472-479.
3. Liu B, Wang X, Lin L et al. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis, BMC bioinformatics 2008;9:510.
4. Chen J, Long R, Wang X-l et al. dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation, Scientific Reports 2016;6:32333.
5. Holm L, Sander C. Removing near-neighbour redundancy from large protein sequence collections, Bioinformatics 1998;14:423-429.
6. Altschul SF, Madden TL, Schäffer AA et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic acids research 1997;25:3389-3402.
7. Remmert M, Biegert A, Hauser A et al. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment, Nature methods 2012;9:173-175.
8. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching, Nucleic acids research 2011;39:W29-37.
9. Liu T-Y. Learning to rank for information retrieval, Foundations and Trends in Information Retrieval 2009;3:225-331.
10. Liu T-Y, Xu J, Qin T et al. Letor: Benchmark dataset for research on learning to rank for information retrieval, Foundations and Trends in Information Retrieval 2009;3(3):225-331.
11. Sculley D. Large scale learning to rank, NIPS Workshop on Advances in Ranking 2009; 58-63.
12. Murzin AG, Brenner SE, Hubbard T et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures, Journal of molecular biology 1995;247:536-540
13. Fox, N.K., Brenner, S.E. and Chandonia, J.-M.. SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures, Nucleic acids research 2014;42:D304-D309
14. Gribskov, M. and Robinson, N.L. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching, Computers & chemistry 1996;20:25-33


Harbin Institute of Technology, Shenzhen.

网站备案号: 粤ICP备19041859号-1