Figure 1. The flowchart of SMI-BLAST.
SMI-BLAST: A novel supervised search framework based on PSI-BLAST for protein remote homology detection
- Motivation: As one of the most important and widely used mainstream iterative search tool for protein sequence search, an accurate Position Specific Scoring Matrix (PSSM) is the key of PSI-BLAST. However, for protein remote homology detection, more non-homologous sequences are detected by PSI-BLAST, because its PSSMs are constructed mainly based on the non-homologous sequences of the query protein.
- Results: To uncover the reasons, we figured out three types of Incorrectly Selected Homology (ISH) errors in PSSMs. In order to solve these errors, a new search tool Supervised-Manner-based Iterative BLAST (SMI-BLAST) is proposed based on PSI-BLAST. SMI-BLAST obviously outperforms PSI-BLAST on the Structural Classification of Proteins-extended (SCOPe) dataset. Compared with PSI-BLAST on the ISH error subsets of SCOPe dataset, SMI-BLAST detects 1.6~2.87 folds more remote homologous sequences, and outperforms PSI-BLAST by 35.66% in terms of ROC1 scores. Furthermore, this framework is applied to JackHMMER, DELTA-BLAST and PSIBLASTexB, the new predictors also obviously outperforms original search methods, proving the generality of the proposed SMI-based framework.
About Incorrectly Selected Homology(ISH)
Figure 2. The five situations of PSI-BLAST selecting sequences to construct PSSM profile for protein remote homology detection.
To analyse the problems of PSSM on protein domain databases for protein remote homology detection, we summarize three situations as Incorrectly Selected Homology (ISH) errors from the results of PSI-BLAST. ISH errors indicate that true positives exist in the ranking list but the selected list is null or contains false positives. Figure 2 shows three types of ISH errors and other situations of PSSM:
i) True-PSSM (Figure 2A). PSSM is constructed by all true positives in the selected list, which is an ideal situation for PSSM and can describe the correct evolutionary information of query sequences;
ii) ISH-MIX error (Figure 2B). Because the selected list contains false positives and true positives, incorrect evolutionary information is added into PSSM and more false positives are produced at later iterations;
iii) ISH-NULL error (Figure 2C). No sequence exists in the selected list can be used to construct PSSM but true positives exists in the candidate list. Therefore, PSI-BLAST almost cannot produce any results after the next iteration with null PSSM;
iv) ISH-ALL error (Figure 2D). The sequences in the selected list are all false positives but true positives exist in the candidate list. Then, error PSSM is constructed, which almost cannot detect any true positives at the next iteration;
v) False-PSSM (Figure 2E). The ranking list contains no true positive, and therefore there is no more adjustment space for PSSM. In order to construct and keep an ideal situation during the iteration process, rectifying the above errors of PSSM is necessary.