PepLM-GNN

Document


1. Datasets

The datasets used in this study are derived from the RCSB PDB[1], with pairs showing over 80% similarity between the training and independent test datasets removed using CD-HIT[2]. The processed datasets can be downloaded from the following links:

Training Dataset:   Train-Sequences.fasta

Independent Dataset:   LEADS-PEP.fasta   Test167.fasta   Test251.fasta   Test1440.fasta


2. Tools

PepLM-GNN utilizes ProtT5[3] for peptide and protein feature extraction. To run PepLM-GNN locally, this tool must be properly configured. Detailed instructions for installation and configuration are provided in the following links:

ProtT5:   https://zenodo.org/records/4644188


3. References

[1] Burley, S. K. et al. RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 51, D488-D508 (2023). [2] Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152 (2012). [3] Elnaggar, A. et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112-7127 (2021).

Cite

Upon the usage the users are requested to use the following citation:

Ke Yan, Meijing Li, Shutao Chen, Tianyi Liu, and Bin Liu*.
PepLM-GNN: A Graph Neural Network Framework Leveraging Pre-trained Language Models for Peptide-Protein Binding Prediction. (Submitted)