ProGOPSL

Document


If you use ProGO-PSL for research, please cite this paper:

Jiangyi Shao, Shutao Chen, Bin Liu*;
Hybrid Information-driven Protein Gene Ontology Annotation via the Protein Sequence Large Graph (Submitted)



Benchmark dataset (SwissProt released April 2022)

Multiple Sequence Alignments (MSAs) of benchmark dataset

Independent test set (SwissProt newly added between May 2022 and March 2025)

Multiple Sequence Alignments (MSAs) of independent test set

Source code of ProGO-PSL:

Installation and Usage Guide

Requirements

  • Python 3.10+
  • Required Python libraries (install via requirements.txt):
    pip install -r requirements.txt
  • GPU support is recommended for deep learning tasks

Usage Examples

Training Stage 1:
    python scripts/construct_gendis.py -c configs/training_msa-v1/bpo-7-26.yml \
        /path/to/dataset_state_dict.pkl \
        /path/to/MSAs/ \
        /path/to/save/model/
Training Stage 2:
    python scripts/construct_gendis.py -c configs/training_msa-v1/bpo-8-24.yml \
        /path/to/dataset_state_dict.pkl \
        /path/to/MSAs/ \
        /path/to/save/model/
Testing:
    python scripts/construct_gendis.py -c configs/evaluating_msa-v1/bpo-8-24.yml \
        /path/to/dataset_state_dict.pkl \
        /path/to/MSAs/ \
        /path/to/trained/model/

Configuration

Sample Configuration File (configs/training_netgo-v1/bp.yml):

    mode: train
    task: biological_process
    epochs: 100
    batch_size: 32
    lr: 0.0001
    top_k: 40
    max_len: 2000

Key Parameters

  • General Arguments:
    • file_address: Path to the dataset file
    • working_dir: Directory for MSA files
    • model_saving: Directory to save trained model
  • Training Parameters:
    • --mode: Operation mode (train, test)
    • --batch-size: Batch size (default: 32)
    • --epochs: Number of training epochs
    • --lr: Learning rate
  • Hardware Options:
    • --gpu-ids: GPU IDs to use
    • --amp: Use automatic mixed precision

Evaluation Details

The evaluation process includes metrics such as:

  • Fmax Score: Maximum F-score across thresholds
  • AuPRC: Area Under Precision-Recall Curve

License

This project is distributed under the MIT License. See LICENSE.md for more details.