Document
1. Datasets
The datasets used in this study are derived from the RCSB PDB[1], with pairs showing over 80% similarity between the training and independent test datasets removed using CD-HIT[2]. Additionally, dataset labels were extracted from PDB-BRE[3] and PLIP[4] based on the complexes' 3D structures. The processed datasets can be downloaded from the following links:
Training Dataset: TrainingDataset-KEIPA.pkl
Independent Dataset: LEADS-PEP.pkl Test167.pkl Test251.pkl
Note: The datasets above include labels for various types of non-covalent bonds and binding residues.
2. Tools
KEIPA utilizes various tools for peptide and protein feature extraction, including SCRATCH-1D (v1.2)[5], IUPred2A[6], ncbi-blast (v2.13.0)[7], ProtT5[8], and trRosetta[9]. To run KEIPA locally, these tools must be properly configured. Detailed instructions for installation and configuration are provided in the following links:
SCRATCH-1D: https://download.igb.uci.edu/
IUPred2A: https://iupred2a.elte.hu/
ncbi-blast: https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html
ProtT5: https://zenodo.org/records/4644188
trRosettaX: https://yanglab.qd.sdu.edu.cn/trRosetta/
3. Other Notes
Additionally, KEIPA and its associated tools rely on several databases, which have been compiled for researchers' convenience. These databases can be downloaded directly via the links provided below:
blosum62: blosum62.txt
nrdb90: nrdb90.tar.gz
4. References
[1] Burley, S. K. et al. RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 51, D488-D508 (2023). [2] Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152 (2012). [3] Chen, S., Yan, K. & Liu, B. PDB-BRE: A ligand-protein interaction binding residue extractor based on Protein Data Bank. Proteins Struct. Funct. Bioinf. 92, 145-153 (2024). [4] Adasme, M. F. et al. PLIP 2021: Expanding the scope of the protein-ligand interaction profiler to DNA and RNA. Nucleic Acids Res. 49, W530-W534 (2021). [5] Cheng, J., Randall, A. Z., Sweredoski, M. J. & Baldi, P. SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res. 33, W72-W76 (2005). [6] Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 46, W329-W337 (2018). [7] Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinnis, S., & Madden, T. L. NCBI BLAST: a better web interface. Nucleic Acids Res. 36(suppl_2), W5-W9 (2008). [8] Elnaggar, A. et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112-7127 (2021). [9] Du, Z. et al. The trRosetta server for fast and accurate protein structure prediction. Nat. Protoc. 16, 5634-5651 (2021).