Code Download
Clone the full AntigenLM repository with training and downstream pipelines.
Introduction
AntigenLM Overview
A dedicated protein language model for microbial antigens, integrating lineage-guided transfer and structure-aware masking to boost immune-specific representation learning.
Large scale computational analysis of microbial antigen immunogenicity is pivotal for vaccine development, enabling the rapid prioritization of key antigens and epitopes from vast candidate pools. Crucially, the efficiency of this process hinges on high-quality, transferable antigen representations. However, antigen representation learning is constrained by two major challenges. First, coupled data and domain biases: high-quality immune annotations are severely scarce, making immune-specific supervision difficult to achieve. Moreover, the severe underrepresentation of antigens in the pre-training corpora of general-purpose protein language models (PLMs) biases these models toward broadly conserved features, failing to yield representations sensitive to immune specificity. Second, intrinsic obstacles in feature extraction: immunological determinants are inherently weak and localized, making them difficult to capture. Moreover, the naive long-sequence truncation mechanisms of general PLMs readily sever critical structural fragments, exacerbating the loss of immune signals. Therefore, we propose the field's first dedicated protein language model for microbial antigens-AntigenLM. To resolve coupled data and domain biases, we developed a lineage-guided hierarchical transfer learning strategy that progressively specializes representations from microbes to microbial antigens, effectively mitigating biases induced by distribution shifts. Concurrently, to tackle intrinsic feature extraction obstacles, the model integrates structure-aware masking and overlapping sliding-window mechanisms. Together, they heighten model sensitivity to localized immune determinants while safeguarding the structural integrity of key sequence fragments. Evaluated across a multi-level CD8⁺T cell pathway framework-spanning antigen prioritization, presentation, and TCR recognition-AntigenLM substantially outperforms general-purpose PLMs like ESM series. Crucially, it demonstrates remarkable generalization, achieving a 0.944 PR-AUC in highly imbalanced (1:10) screening and driving zero-shot TCR recognition AUROC from 0.593 to 0.891. Overcoming the limits of generic pre-training, its strong cross-domain transferability provides a robust representational foundation for vaccine discovery and broad computational immunology tasks.
Input antigen sequence and optional TCR/HLA, then run predictions.
AntigenLM Interface
Enter an antigen sequence, view its AntigenLM representation, and run three downstream predictors in one place.
-
-
-
Code, datasets, and pretrained checkpoints for AntigenLM.
Resources
Grab the GitHub repository, curated datasets, and pretrained models in a single place.
Clone the full AntigenLM repository with training and downstream pipelines.
Five curated datasets for microbial antigen modeling and fine-tuning.
Pretrained checkpoints for immediate embedding and downstream evaluation.
If you have any questions, please contact us:
Bin Liu’s lab at Beijing Institute of Technology (BIT) is focusing on developing techniques grounded in the natural language processing (NLP) to uncover the meanings of “book of life”. The research areas of Bin Liu’s lab include:
For more relevant research, please see http://bliulab.net.
We sincerely thank the laboratory members, partners, and reviewers for their dedicated efforts and valuable time invested in this project. Additionally, we thank the National Natural Science Foundation of China (No. 62325202, U22A2039) for its support of this work.
The project referenced and used the free and open source CSS framework for UI.