ProGOPSL

Document


If you use ProGO-PSL for research, please cite this paper:

Jiangyi Shao, Shutao Chen, Bin Liu*;
Hybrid Information-driven Protein Gene Ontology Annotation via the Protein Sequence Large Graph (Submitted)



Multiple Sequence Alignments (MSAs)

SwissProt Dataset

Source code of ProGO-PSL:

Installation and Usage Guide

Requirements

  • Python 3.10+
  • Required Python libraries (install via requirements.txt):
    pip install -r requirements.txt
  • GPU support is recommended for deep learning tasks

Usage Examples

Training Stage 1:
    python scripts/construct_gendis.py -c configs/training_msa-v1/bpo-7-26.yml \
        /path/to/dataset_state_dict.pkl \
        /path/to/MSAs/ \
        /path/to/save/model/
Training Stage 2:
    python scripts/construct_gendis.py -c configs/training_msa-v1/bpo-8-24.yml \
        /path/to/dataset_state_dict.pkl \
        /path/to/MSAs/ \
        /path/to/save/model/
Testing:
    python scripts/construct_gendis.py -c configs/evaluating_msa-v1/bpo-8-24.yml \
        /path/to/dataset_state_dict.pkl \
        /path/to/MSAs/ \
        /path/to/trained/model/

Configuration

Sample Configuration File (configs/training_netgo-v1/bp.yml):

    mode: train
    task: biological_process
    epochs: 100
    batch_size: 32
    lr: 0.0001
    top_k: 40
    max_len: 2000

Key Parameters

  • General Arguments:
    • file_address: Path to the dataset file
    • working_dir: Directory for MSA files
    • model_saving: Directory to save trained model
  • Training Parameters:
    • --mode: Operation mode (train, test)
    • --batch-size: Batch size (default: 32)
    • --epochs: Number of training epochs
    • --lr: Learning rate
  • Hardware Options:
    • --gpu-ids: GPU IDs to use
    • --amp: Use automatic mixed precision

Evaluation Details

The evaluation process includes metrics such as:

  • Fmax Score: Maximum F-score across thresholds
  • AuPRC: Area Under Precision-Recall Curve

License

This project is distributed under the MIT License. See LICENSE.md for more details.