Document
1. Datasets
The datasets used in this study are derived from the RCSB PDB[1], with pairs showing over 80% similarity between the training and independent test datasets removed using CD-HIT[2]. The processed datasets can be downloaded from the following links:
Training Dataset: Train-Sequences.fasta
Independent Dataset: LEADS-PEP.fasta Test167.fasta Test251.fasta Test1440.fasta
2. Tools
PepLM-GNN utilizes ProtT5[3] for peptide and protein feature extraction. To run PepLM-GNN locally, this tool must be properly configured. Detailed instructions for installation and configuration are provided in the following links:
ProtT5: https://zenodo.org/records/4644188
3. References
[1] Burley, S. K. et al. RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 51, D488-D508 (2023). [2] Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152 (2012). [3] Elnaggar, A. et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112-7127 (2021).