PepLM-GNN-Document

Document

1. Datasets

The datasets used in this study are derived from the RCSB PDB^[1], with pairs showing over 80% similarity between the training and independent test datasets removed using CD-HIT^[2]. The processed datasets can be downloaded from the following links:

Training Dataset: Train-Sequences.fasta

Independent Dataset: LEADS-PEP.fasta Test167.fasta Test251.fasta Test1440.fasta

2. Tools

PepLM-GNN utilizes ProtT5^[3] for peptide and protein feature extraction. To run PepLM-GNN locally, this tool must be properly configured. Detailed instructions for installation and configuration are provided in the following links:

ProtT5: https://zenodo.org/records/4644188

3. References

[1] Burley, S. K. et al. RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 51, D488-D508 (2023). [2] Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152 (2012). [3] Elnaggar, A. et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112-7127 (2021).

Ke Yan, Meijing Li, Shutao Chen, Tianyi Liu, and Bin Liu*.
PepLM-GNN: A Graph Neural Network Framework Leveraging Pre-trained Language Models for Peptide-Protein Binding Prediction. (Submitted)

PepLM-GNN

Home

Server

Document

Contact

About

Document

1. Datasets

2. Tools

3. References

Cite