Feature analysis

In order to further explore the reasons why the proposed FoldRec-C2C predictor can correct the two errors discussed in section “Seq-to-cluster model”, the final predictive results of the test proteins in the LINDAHL dataset are visualized in Fig. 2(a), where the test proteins and training proteins are shown as blue points and green points, respectively. The test proteins in the same cluster are connected by blue lines, and the test proteins in different clusters are connected by black lines, meaning that although their similarities can be detected by HHblits, they are not in the same cluster based on the results of spectral clustering method. If two clusters are connected by the red line, all the proteins in these two clusters are in the same protein fold. Two examples were selected to show how the proposed cluster-to-cluster model solves the aforementioned two errors. One example is the prediction of the test proteins in fold 2_1 (SCOP ID). Fig. 2(b) shows the predictive results of the FoldRec-C2C based on S2S, where the gray lines indicate the similairty socres between any test protein and traning protein calculated by the S2S, and the predictive results are shown in red lines. Fig. 2(c) shows the results of FoldRec-C2C based on S2C, where the similairty socres between any test protein and cluster in training set calculated by the S2C are shown in gray lines, and the predictive results are shown in red lines. Fig. 2(d) shows the results of FoldRec-C2C based on C2C, where the read lines represent the similairty socres between the cluster in the test set and the cluster in the training set, which can be considered as the final predicitve results of FoldRec-C2C. From Fig. 2(b-d) we can see the followings: i) S2C is more accurate than S2S, and C2C is the most accurate model which can correctly identify all the test proteins in the fold 2_1; ii) Although the test proteins in the fold 2_1 were clustered into two clusters by spectral clustering method, both the two cluseters are correctly connected to the cluster of fold 2_1 in the training set, indicating that even the spectral clustering method fails to correctly cluster all the test proteins, the C2C model is able to correct this error. Another example is the prediction of the test proteins in fold 4_50 (see Fig. 2(e-g)), from which we can see the followings: i) The S2S model incorrectly detects the test proteins in fold 4_50; ii) the S2C model correctly predicts some of these proteins by considering the relationship among training proteins, but it still fails to predict some proteins; iii) The C2C model correctly predicts all these proteins in the fold 4_50 by considering both the relationship among test proteins, and the relationship among training proteins. These two examples show that the proposed FoldRec-C2C predictor based on C2C can correct the errors caused by the S2S model, and therefore, it outperforms the other existing mehtods.

Figure 2. Visualization of the predictive results of FoldRec-C2C. Subfigure (a) shows the overall predictive results of all the test proteins in LINDAHL dataset. Subfigures (b-d) visualize the predictive results of test proteins in fold 2_1 (SCOP ID) detected by S2S (b), S2C (c), and C2C (d), respectively. Subfigures (e-g) visualize the predictive results of test proteins in fold 4_50 (SCOP ID) detected by S2S (e), S2C (f), and C2C (g), respectively. These results were visualized with the help of Gephi [1] software tool.

Reference

1. Bastian M, Heymann S, Jacomy M. Gephi: an open source software for exploring and manipulating networks. In: Third international AAAI conference on weblogs and social media. 2009.