Document
1. Datasets
In this study, we utilize a benchmark dataset established by [1] to comprehensively train our method. Specifically, disordered proteins were initially clustered at a 25% sequence identity cutoff using BLASTClust[2,3]. Within each cluster, a representative sequence was selected according to the following priorities: (a) the protein exhibiting the greatest number of disordered residues; (b) the fewest disordered regions, favoring longer continuous disordered segments; and (c) the longest overall protein sequence. This procedure resulted in 4,178 protein chains. These were subsequently combined with 91 fully disordered protein chains from DisProt v5.0 (Sickmeier, et al., 2007), followed by a second round of BLASTClust clustering at 25% sequence identity. For each resulting cluster, the representative was chosen as: (a) a fully disordered protein, if available; or (b) the longest protein sequence. The final dataset, referred to as DM4229, contains 4,229 non-redundant protein chains, comprising 4,157 from the PDB and 72 from DisProt. In total, the dataset includes 1,036,634 residues, of which 103,252 (about 10%) are annotated as disordered. The datasets can be downloaded from the following links:
Training Dataset: TrainingDataset
Independent Dataset: MXD494 SL329 DISORDER723.fasta CASP Disprot504
2. Feature Extraction Tools
In MoSE, we employed three types of residue feature extraction methods:
a. traditional biological features extraction: PSSM (obtained by searching the NR90[4] database through three iterations of PSI-BLAST[3]).
b. semantic features extracted from PPLMs: including ESM2[5], DR-BERT[6]
3. Other Notes
Additionally, MoSE and its associated tools rely on several databases, which have been compiled for researchers' convenience. These databases can be downloaded directly via the links provided below:
blosum62: blosum62.txt
nrdb90: nrdb90.tar.gz
4. References
[1] Zhang, T., Faraggi, E., Xue, B., Dunker, A. K., Uversky, V. N., & Zhou, Y. SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method. Journal of Biomolecular Structure and Dynamics 2012;129(4), 799-813. [2] Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 1997;25(17), 3389-3402. [3] Zou, Q., Lin, G., Jiang, X., Liu, X., & Zeng, X. Sequence clustering in bioinformatics: an empirical study. Briefings in bioinformatics 2020;21(1), 1-10. [4] Holm, L., & Sander, C. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 1998;14(5),423-429. [5] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379(6637), 1123-1130. [6] Nambiar, A., Forsyth, J. M., Liu, S., & Maslov, S. DR-BERT: a protein language model to annotate disordered regions. Structure 2024;32(8), 1260-1268.