FusionEncoder-Document

Document

1. Datasets

The benchmark dataset used in this study was constructed by ^[1]. It contains 4229 disordered and 616 ordered proteins collected from the Protein Data Bank (PDB). Specifically, the disordered proteins collected from (Zhang, et al., 2012) with sequence similarity less than 25%. The ordered proteins satisfies following criteria: (i) Each protein’s structure file contains only one chain, ensuring that no ordered regions are formed from intrinsically disordered regions (IDRs) through binding with other proteins; (ii) The resolution of each protein is less than or equal to 2 greater than or equal to 30 amino acids; (iv) The similarity between sequences is less than 25%; (v) Each residue has atomic coordinates recorded in the PDB; (vi) Nonstandard amino acids are removed. The datasets can be downloaded from the following links:

Training Dataset: TrainingDataset-DM4845

Independent Dataset: DISORDER723.fasta MXD494.fasta disorder_nox.fasta disorder_pdb.fasta

We also list the training sets filtered using different thresholds： Training_data_filtered 2.zip

2. Feature Extraction Tools

In FusionEncoder, we employed two types of residue feature extraction methods:

(1) traditional biological features extraction: including PSSM (obtained by searching the NR90^[2] database through three iterations of PSI-BLAST^[3]) , aaIndex , and Energy

(2) semantic features extracted from PPLMs: including ESM2^[4], Prot-T5^[5], DR-BERT^[6], Onto-Protein^[7]

3. Other Notes

Additionally, FusionEncoder and its associated tools rely on several databases, which have been compiled for researchers' convenience. These databases can be downloaded directly via the links provided below:

blosum62: blosum62.txt

nrdb90: nrdb90.tar.gz

4. References

[1] Liu, Y., Wang, X. and Liu, B. IDP–CRF: intrinsically disordered protein/region identification based on conditional random fields. International journal of molecular sciences 2018;19(9):2483. [2] Holm, L., & Sander, C. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 1998;14(5),423-429. [3] Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 1997;25(17), 3389-3402. [4] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379(6637), 1123-1130. [5] Elnaggar, A. et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112-7127 (2021). [6] Nambiar, A., Forsyth, J. M., Liu, S., & Maslov, S. DR-BERT: a protein language model to annotate disordered regions. Structure 2024;32(8), 1260-1268.. [7] Zhang, N., Bi, Z., Liang, X., Cheng, S., Hong, H., Deng, S., ... & Chen, H. (2022). Ontoprotein: Protein pretraining with gene ontology embedding. arXiv preprint arXiv:2201.11147. 2022.

Sicen Liu, Shutao Chen, Tao Bai, Bin Liu*.
FusionEncoder: identification of intrinsically disordered regions based on multi-feature fusion. (Submitted)

FusionEncoder

Home

Server

Document

Contact

Document

1. Datasets

2. Feature Extraction Tools

3. Other Notes

4. References

Cite