With the avalanche of biological sequences generated in the post-genomic age, one of the most challenging problems in computational biology today is how to effectively formulate the sequence of a biological sample (such as DNA, RNA or protein) with a discrete model or a vector that can effectively reflect its sequence pattern information or capture its key features concerned. This is because almost all the existing machine-learning algorithms can only handle vectors but not sequence samples. If using the sequential model, i.e., the model in which all the samples are represented by their original sequences, it is hardly able to train a machine-learning model that can cover all the possible cases concerned, as elaborated in (1).
However, a vector defined in a discrete model may completely lose the sequence-order information. To cope with such a dilemma, the idea of pseudo amino acid composition or PseAAC (2,3) was proposed. Ever since it was introduced in 2001, the concept of PseAAC has been widely used in almost all the areas of computational proteomics (see a long list of references cited in a recent paper (4)).
Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, the corresponding approaches were proposed recently to deal with DNA sequences (5-7) and RNA sequences (8).
Since this kind of pseudo component approaches have been widely and increasingly used in many areas of computational biology, a number of web servers and stand-alone programs were developed for generating different pseudo components for DNA sequences (9), RNA sequences (8), and protein sequences (4,10-12).
However, there are some major disadvantages for the aforementioned web servers and stand-alone programs, as reflected by the following facts: 1) lack of flexibility, i.e., they can each only handle one type of biological sequences (DNA, RNA, or protein); 2) un-catching up, i.e., they have missed some pseudo component modes proposed very recently; 3) limitation, i.e., they cannot cover all the possible physicochemical properties, nor those defined by users.
Here, we are to propose a powerful web server, called Pse-in-One, by which users can generate all the possible pseudo components for DNA, RNA, and protein sequences. It covers a total of 28 different modes, of which 14 for DNA sequences (5-7,9,13-16), 6 for RNA sequences (8,17), and 8 for protein sequences (2,3,16,18,19,20). All these modes can be deemed as different pseudo components. Using them many prediction modes have been developed in various areas of computational biology. Using Pse-in-One users only need to input DNA, RNA, or protein sequences as well as their selected or defined features, and they can immediately obtain the corresponding feature vectors suitable for any of the existing machine-learning programs to conduct various analyses. Particularly, the feature vectors thus obtained can also be intuitively visualized via a graphical representation called “heat map”.
To the best of our knowledge, Pse-in-One is so far the first web server that can generate all the possible pseudo components for DNA, RNA, and protein sequences, and even those defined by users themselves, and hence it is extremely flexible.
The Pse-in-One web server has been widely and increasingly used by scientists all around the world in dealing with varieties of problems in computational biology, as reflected by the citation numbers of some pioneering papers given in the References Section.
REFERENCES
1.Chou, K.C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). J. Theor. Biol., 273, 236-247. (PMID: 21168420, cited by 870)
2.Chou, K.C. (2001) Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics 43, 246-255. (PMID: 11288174, cited by 1533)
3.Chou, K.C. (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 21, 10-19. (PMID: 15308540, cited by 694)
4.Du, P., Gu, S. and Jiao, Y. (2014) PseAAC-General: Fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein datasets. International Journal of Molecular Sciences, 15, 3495-3506. (PMID: 24577312, cited by 161)
5.Chen, W., Feng, P.M., Lin, H. and Chou, K.C. (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition Nucleic Acids Res., 41, e68. (PMID: 23303794, cited by 418)
6.Guo, S.H., Deng, E.Z., Xu, L.Q., Ding, H., Lin, H., Chen, W. and Chou, K.C. (2014) iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics, 30, 1522-1529. (PMID: 24504871, cited by 279)
7.Lin, H., Deng, E.Z., Ding, H., Chen, W. and Chou, K.C. (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res., 42, 12961-12972. (PMID: 25361964, cited by 300)
8.Chen, W., Zhang, X., Brooker, J., Lin, H., Zhang, L. and Chou, K.C. (2014) PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics, doi:10.1093/bioinformatics/btu602.
9.Chen, W., Lei, T.Y., Jin, D.C., Lin, H. and Chou, K.C. (2014) PseKNC: a flexible web-server for generating pseudo K-tuple nucleotide composition. Anal. Biochem., 456, 53-60. (PMID: 24732113, cited by 216)
10.Shen, H.B. and Chou, K.C. (2008) PseAAC: a flexible web-server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem., 373, 386-388. (PMID: 17976365, cited by 296)
11.Du, P., Wang, X., Xu, C. and Gao, Y. (2012) PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. Anal. Biochem., 425, 117-119. (PMID: 22459120, cited by 186)
12.Cao, D.S., Xu, Q.S. and Liang, Y.Z. (2013) propy: a tool to generate various modes of Chou's PseAAC. Bioinformatics, 29, 960-962. (PMID: 23426256, cited by 230)
13.Noble, W.S., Kuehn, S., Thurman, R., Yu, M. and Stamatoyannopoulos, J. (2005) Predicting the in vivo signature of human gene regulatory sequences. Bioinformatics, 21 Suppl 1, i338-343. (PMID: 15961476, cited by 64)
14.Friedel, M., Nikolajewa, S., Suhnel, J. and Wilhelm, T. (2009) DiProDB: a database for dinucleotide properties. Nucleic Acids Res., 37, D37-40. (PMID: 18805906, cited by 62)
15.Dong, Q., Zhou, S. and Guan, J. (2009) A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 25, 2655-2662. (PMID: 19706744, cited by 93)
16.Wei, L., Liao, M., Gao, Y., Ji, R., He, Z. and Zou, Q. (2013) Improved and Promising Identification of Human MicroRNAs by Incorporating a High-quality Negative Set. IEEE/ACM Trans Comput Biol Bioinform. (PMID: 24216114, cited by 116)
17.Guo, Y., Yu, L., Wen, Z. and Li, M. (2008) Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res., 36, 3025-3030. (PMID: 18390576, cited by 349)
18.Liu, B., Wang, X., Lin, L., Dong, Q. and Wang, X. (2008) A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics, 9, 510. (PMID: 24216114, cited by 95)
19.Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T. and Kanehisa, M. (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res., 36, D202-205. (PMID: 17998252, cited by 628 )