Document
1. Datasets
We have developed a novel toxicity dataset for the prediction of toxic peptides (TXP). In the first-level binary classification task, we have established a benchmark dataset comprising both positive and negative sequences. The positive sequences, derived from our prior research[1], are utilized for the prediction of TXP and other therapeutic functional peptides. The negative sequences are extracted from the Swissprot dataset[2]. This meticulously processed dataset includes 2345 TXP sequences and an equivalent number of Non-TXP sequences. For the second-level multi-functional benchmark dataset, the positive TXP dataset is further categorized into nine distinct functional classes: "TXP with mono-label", "Anti-Bacterial Peptide", "Anti-Cancer Peptide", "Antimicrobial Peptide", "Anti-Fungal peptide", "Cell-Penetrating peptide", "Anti-Parasitic peptide", "Drug Delivery Vehicle peptide", and "Anti-Viral peptide". The comprehensive two-level benchmark dataset is available for download at the following link:
Benchmark dataset: Benchmark dataset.zip
2. Source Code
ToxPre-2L is a novel two-level predictor for predicting the function of TXP. The first-level stage subpredictor AdaptMVTL of ToxPre-2L is to predict whether a query peptide is of TXP or not. If the output is yes, then the second-level stage subpredictor MLMVTLowRankBin of ToxPre-2L will predict its functional types. Comprehensive guidelines and the source code are accessible via the following link:
Source Code of ToxPre-2L: Code.zip
3. Other Notes
We utilized four feature extraction methods to encode the sequences, including the k-mer, distance-based residue (DR), distance pair (DP), and pseudo amino acid composition (Pse-AAC). The K-mer[3, 4] feature considers the local sequence information by calculating the composition of subsequences with a fixed length K. The DR[5] feature calculates the composition of amino acid pairs between the LG spaces in the peptide. PseAAC[6] and DP[5, 7, 8] features incorporate the physicochemical properties and sequence order information to represent the peptide sequence. The above four features were extracted by Bioseq-Analysis 2.0[9] (http://bliulab.net/BioSeq-Analysis2.0) with default parameters.
4. References
[1] Lv H, Yan K, Liu B. TPpred-LE: therapeutic peptide function prediction based on label embedding. BMC biology. 2023; 21(1): 238. [2] Bairoch A, Apweiler R. The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic Acids Research. 1996; 24(1): 21-5. doi: DOI 10.1093/nar/24.1.21. PubMed PMID: WOS:A1996TR97500005. [3] Khatun MS, Hasan MM, Shoombuatong W, Kurata H. ProIn-Fuse: improved and robust prediction of proinflammatory peptides by fusing of multiple feature representations. Journal of Computer-Aided Molecular Design. 2020; 34(12): 1229-36. doi: 10.1007/s10822-020-00343-9. PubMed PMID: 32964284. [4] Zulfiqar H, Guo Z, Ahmad RM, Ahmed Z, Cai P, Chen X, et al. Deep-STP: a deep learning-based approach to predict snake toxin proteins by using word embeddings. Frontiers in Medicine. 2024; 10. doi: 10.3389/fmed.2023.1291352. [5] Liu B, Zhang D, Xu R, Xu J, Wang X, Chen Q, et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics. 2014; 30(4): 472-9. [6] Shen H-B, Chou K-C. Ensemble classifier for protein fold pattern recognition. Bioinformatics. 2006; 22(14): 1717-22. [7] Zou X, Ren L, Cai P, Zhang Y, Ding H, Deng K, et al. Accurately identifying hemagglutinin using sequence information and machine learning methods. Front Med (Lausanne). 2023; 10: 1281880. doi: 10.3389/fmed.2023.1281880. PubMed PMID: 38020152; PubMed Central PMCID: PMC10644030. [8] Zhu W, Yuan SS, Li J, Huang CB, Lin H, Liao B. A First Computational Frame for Recognizing Heparin-Binding Protein. Diagnostics (Basel). 2023; 13(14). doi: 10.3390/diagnostics13142465. PubMed PMID: 37510209; PubMed Central PMCID: PMC10377868. [9] Liu B, Gao X, Zhang H. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Research. 2019; 47(20): e127. Epub 2019/09/11. doi: 10.1093/nar/gkz740. PubMed PMID: 31504851.