UniDBind-Document

Document

1. Datasets

In this study, the benchmark datasets were constructed following the data preparation protocol of HybridDBRpred ^[1] to ensure consistency with prior studies. The datasets include training, validation, and test splits, and cover both structure-annotated and intrinsically disordered proteins, enabling evaluation across heterogeneous structural contexts. Residue-level DNA-binding annotations were obtained from both structure-derived and disorder-derived sources. Structure-based annotations were collected from BioLip ^[2], which curates protein–DNA interactions from experimentally resolved complexes in the Protein Data Bank (PDB) ^[3], while disorder-based annotations were obtained from DisProt ^[4]. To improve annotation completeness, residue-level binding sites were mapped onto full-length UniProt sequences using SIFTS ^[5], allowing integration of information across multiple complexes corresponding to the same protein. To reduce redundancy, all protein sequences were clustered at 25% sequence identity using BlastClust ^[6]. Representative sequences were selected from each cluster to ensure broad coverage of the sequence space. The test dataset was constructed from clusters that do not overlap with those used for training, ensuring that the training and validation datasets share less than 25% sequence identity with the test set. The resulting test dataset comprises 435 proteins with 201,154 residues, including 2,940 DNA-binding residues (DBR) and 19,755 residues interacting with non-DNA ligands. The training and validation datasets were derived from the remaining clusters while maintaining similar distributions of structure-annotated and disorder-annotated proteins. The training dataset contains 591 proteins (241,284 residues), and the validation dataset includes 267 proteins (116,244 residues), with comparable proportions of DNA-binding and non-DNA-binding residues across all splits.

2. Feature Extraction Tools

In UniDBind, we employed three types of residue feature extraction methods:

(1) evolutionary features: PSSM (obtained by searching the NR90^[7] database through three iterations of PSI-BLAST^[8]) .

(2) physicochemical descriptors: Seven commonly used physicochemical properties, including charge, hydrophobicity, polarity, flexibility, residue volume, molecular weight, and isoelectric point (consistent with amino acid property indices compiled in AAindex^[9]).

(3) semantic features extracted from PPLMs: ESM2^[10].

3. References

[1] Zhang J, Basu S, Kurgan L. HybridDBRpred: improved sequence-based prediction of DNA-binding amino acids using annotations from structured complexes and disordered proteins[J]. Nucleic Acids Research, 2024, 52(2): e10-e10. [2] Yang J, Roy A, Zhang Y. BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions[J]. Nucleic acids research, 2012, 41(D1): D1096-D1103. [3] Burley S K, Berman H M, Kleywegt G J, et al. Protein Data Bank (PDB): the single global macromolecular structure archive[J]. Protein crystallography: methods and protocols, 2017: 627-641.. [4] Aspromonte M C, Nugnes M V, Quaglia F, et al. DisProt in 2024: improving function annotation of intrinsically disordered proteins[J]. Nucleic Acids Research, 2024, 52(D1): D434-D441. [5] Velankar S, Dana J M, Jacobsen J, et al. SIFTS: structure integration with function, taxonomy and sequences resource[J]. Nucleic acids research, 2012, 41(D1): D483-D489. [6] Altschul S F, Madden T L, Schäffer A A, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs[J]. Nucleic acids research, 1997, 25(17): 3389-3402. [7] Holm, L., & Sander, C. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 1998;14(5),423-429. [8] Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 1997;25(17), 3389-3402. [9] Kawashima, S., et al. AAindex: amino acid index database, progress report 2008. Nucleic acids research 2007;36(suppl_1):D202–D205. [10] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science 2023;379(6637),1123-1130.

Sicen Liu, Hanghua Su, Bin Liu*.
UniDBind: Unified sequence-based prediction of DNA-binding across structure and disordered protein.(Submitted)

UniDBind

Home

Server

Document

Contact

About

Document

1. Datasets

2. Feature Extraction Tools

3. References

Cite