HyperIDR-Document

Document

1. Datasets

We trained our models on the benchmark dataset DM4845 ^[1]^[2]. DM4845 comprises 4,229 protein sequences with pairwise sequence identity ≤25%, which reduces homology bias and encourages generalizable representations, and 616 ordered proteins. In total, it contains 1,177,148 residues, including 103,252 disordered and 1,073,896 ordered residues. To ensure robust performance estimation, we adopted five-fold cross-validation at the protein level: sequences were partitioned into five non-overlapping folds, with four folds used for training and the remaining fold for validation in each round. All preprocessing and evaluation were performed with strict separation between folds (no sequence overlap across train/validation), and non-standard/ambiguous residues were masked and excluded from loss computation via padding masks.

To comprehensively assess generalization, we evaluated our models on five widely used, independent IDR test sets: MXD494 ^[3], SL329 ^[4], DISORDER723 ^[4], CASP ^[5] and CAID3-Disorder-PDB. The MXD494 set contains 494 proteins with 152,414 ordered residues and 44,087 disordered residues. The SL329 set comprises 329 proteins and 39,544 disordered residues. The DISORDER723 set contains 723 proteins with 201,703 ordered residues and 13,526 disordered residues. The CASP set comprises 211 proteins with 46,344 ordered residues and 3,929 disordered residues. The Disorder-PDB benchmark contains 319 protein sequences and was reported as part of the third Critical Assessment of protein Intrinsic Disorder prediction (CAID3) ^[6].

Training Dataset: DM4845_training.fasta

The redundancy-reduced training dataset： Training dataset for MXD494 Training dataset for SL329 Training dataset for DISORDER723 Training dataset for CASP Training dataset for Disorder-PDB

Independent Test Datasets: MXD494.fasta SL329.fasta DISORDER723.fasta CASP.fasta Disorder-PDB.fasta

2. Feature Extraction Tools

In HyperIDR, we selected one feature extraction strategie:

semantic features extracted from PPLMs: ESM2^[7].

3. References

[1] Liu, Y., Wang, X. and Liu, B. RFPR-IDP: reduce the false positive rates for intrinsically disordered protein and region prediction by incorporating both fully ordered proteins and disordered proteins. Briefings in bioinformatics 2021;22(2):2000-2011. [2] Zhang, T., et al. SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method. Journal of Biomolecular Structure and Dynamics 2012;29(4):799-813. [3] Peng, Z.-L. and Kurgan, L. Comprehensive comparative assessment of in-silico predictors of disordered regions. Current Protein and Peptide Science 2012;13(1):6-18. [4] Sirota, F.L., et al. Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset. BMC genomics 2010;11:1-17. [5] Wang, S., Ma, J. and Xu, J. AUCpreD: proteome-level protein disorder prediction by AUC-maximized deep convolutional neural fields. Bioinformatics 2016;32(17):i672-i679. [6] Mehdiabadi, M., et al. Critical Assessment of Protein Intrinsic Disorder Round 3‐Predicting Disorder in the Era of Protein Language Models. Proteins: Structure, Function, and Bioinformatics 2025. [7] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science 2023;379(6637),1123-1130.

Sicen Liu, Hanghua Su, Jianyang Chi, Shutao Chen, Bin Liu*.
HyperIDR: a multi-scale semantic hypernetwork for identification of intrinsically disordered regions.(Submitted)

HyperIDR

Home

Server

Document

Contact

About

Document

1. Datasets

2. Feature Extraction Tools

3. References

Cite