Home

Introduction

AntigenLM Overview

AntigenLM: Lineage guided protein language modelling for microbial antigen representation and CD8+ immune recognition

A dedicated protein language model for microbial antigens, integrating lineage-guided transfer and structure-aware masking to boost immune-specific representation learning.

Large scale computational analysis of microbial antigen immunogenicity is pivotal for vaccine development, enabling the rapid prioritization of key antigens and epitopes from vast candidate pools. Crucially, the efficiency of this process hinges on high-quality, transferable antigen representations. However, antigen representation learning is constrained by two major challenges. First, coupled data and domain biases: high-quality immune annotations are severely scarce, making immune-specific supervision difficult to achieve. Moreover, the severe underrepresentation of antigens in the pre-training corpora of general-purpose protein language models (PLMs) biases these models toward broadly conserved features, failing to yield representations sensitive to immune specificity. Second, intrinsic obstacles in feature extraction: immunological determinants are inherently weak and localized, making them difficult to capture. Moreover, the naive long-sequence truncation mechanisms of general PLMs readily sever critical structural fragments, exacerbating the loss of immune signals. Therefore, we propose the field's first dedicated protein language model for microbial antigens, AntigenLM. To resolve coupled data and domain biases, we developed a lineage-guided hierarchical transfer learning strategy that progressively specializes representations from microbes to microbial antigens, effectively mitigating biases induced by distribution shifts. Concurrently, to tackle intrinsic feature extraction obstacles, the model integrates structure-aware masking and overlapping sliding-window mechanisms. Together, they heighten model sensitivity to localized immune determinants while safeguarding the structural integrity of key sequence fragments. Evaluated across a multi-level CD8+ T cell pathway framework spanning antigen prioritization, presentation, and TCR recognition, AntigenLM substantially outperforms general-purpose PLMs like ESM series. Crucially, it demonstrates remarkable generalization, achieving a 0.944 PR-AUC in highly imbalanced (1:10) screening and driving zero-shot TCR recognition AUROC from 0.593 to 0.891. Overcoming the limits of generic pre-training, its strong cross-domain transferability provides a robust representational foundation for vaccine discovery and broad computational immunology tasks.

Figure 1: Hierarchical training pipeline for AntigenLM
Figure 1. Hierarchical training pipeline for AntigenLM with structure-aware fine-tuning. (a) Protein sequences were collected from NCBI and UniProt across bacteria, viruses and fungi, and microbial antigenic peptides were curated from IEDB, followed by MMseqs2 redundancy reduction (50%). Secondary structures were predicted by PSIPRED and grouped into alpha-helix, beta-sheet, and coil. (b) We performed stepwise BERT-style MLM training along "microbes -> pathogenic microbes -> microbial antigens" yielding MicroLM, PathogLM and AntigenLM, with sliding-window tokenization for variable-length sequences. (c) During fine-tuning, we applied secondary-structure-conditioned dynamic masking (e.g., 15% for helix and sheet and 40% for coil) and updated only the top encoder blocks while freezing lower layers. (d) The resulting representations were transferred to downstream tasks including protective antigen recognition, antigen-TCR recognition and antigen-HLA-I recognition.

Server

Each task uses its own antigen input together with the task-specific paired sequence when required.

AntigenLM Interface

AntigenLM Server

Enter the antigen sequence for each task card separately, then run protective, antigen-TCR, or antigen-HLA-I prediction.

Need high-throughput prediction?

For large-scale offline prediction, please download and install AntigenLM locally. Details are available on the GitHub page.

Model AntigenLM 300M
Tasks Protective Antigen Prediction, Antigen-TCR Binding, and Antigen-HLA-I Binding

Task 01

Protective Antigen Recognition

Use the antigen sequence entered in this card to score whether it is likely to be protective.

Protective score -

Ready for prediction.

Task 02

Antigen-TCR Binding

Use the antigen sequence in this card together with the paired TCR sequence below.

Antigen-TCR score -

Ready for prediction.

Task 03

Antigen-HLA-I Binding

Use the antigen sequence in this card together with the paired HLA amino-acid sequence below.

Antigen-HLA-I score -

Ready for prediction.

Document

User guide for running the AntigenLM web server correctly.

AntigenLM Document

Guidance for task selection, sequence submission, result interpretation, and local installation of AntigenLM.

This section provides structured documentation for the AntigenLM web server, including task definitions, input requirements, score interpretation, and local environment setup. Links for datasets, pretrained checkpoints, and source code are centralized on the Download page rather than repeated here.

1. Overview

The Document page summarizes how to use the AntigenLM web interface for downstream prediction tasks. Each task card on the Server page works independently, includes a Load Example option for quick validation, and only uses the sequences entered within that specific panel, which helps researchers evaluate different inputs without cross-interference and verify the expected submission format before running custom sequences.

2. Task Scope and Output Interpretation

Each downstream task addresses a distinct biological question and therefore requires a different input configuration. Protective Antigen Prediction uses only an antigen amino-acid sequence, Antigen–HLA-I Binding requires an antigen sequence together with the paired HLA-I pseudo sequence, and Antigen–TCR Binding requires an antigen sequence together with the paired TCR sequence entered in the corresponding task card before inference. All returned scores range from 0 to 1, and higher values indicate stronger model confidence for the positive outcome, so higher Protective score values suggest a more likely protective antigen, higher Antigen–HLA-I score values suggest stronger predicted binding or presentation potential, and higher Antigen–TCR score values suggest stronger predicted recognition or binding potential; these scores should be used for prioritization rather than treated as direct experimental confirmation.

3. Installation

3.1 Create Conda Environment

conda create -n antigenlm python=3.10
conda activate antigenlm

3.2 Requirements

We recommend installing the local environment using the provided environment.yaml file to ensure compatibility across AntigenLM training and downstream evaluation workflows:

conda env update -f environment.yaml --prune

If this approach fails or Conda is not available, you can manually install the main dependencies as listed below.

torch==2.1.0+cu121
torchvision==0.16.0+cu121
torchaudio==2.1.0+cu121
transformers==4.46.3
tokenizers==0.20.3
sentencepiece==0.2.0
safetensors==0.5.3
huggingface-hub==0.36.2
accelerate==1.0.1
deepspeed==0.10.3
peft==0.13.2
fair-esm==2.0.0
numpy==1.24.1
pandas==2.0.3
scipy==1.10.1
scikit-learn==1.3.2
tqdm==4.67.1
biopython==1.83

Before execution, make sure the dataset paths, pretrained model locations, and checkpoint directories in the configuration files point to valid local resources. After setup, the training and downstream pipelines can be launched with the provided entry points, such as DeepSpeed for microLM pretraining and Python or torchrun for protective antigen, antigen–HLA-I, and antigen–TCR tasks.

Download

Code, datasets, and pretrained checkpoints for AntigenLM.

Resources

Everything you need to start with AntigenLM.

Grab the GitHub repository, curated datasets, and pretrained models in a single place.

Updated Hugging Face + GitHub + Zenodo

Model Download

Pretrained checkpoints for immediate embedding and downstream evaluation.

  • Final model for downstream embeddings
  • Intermediate model from LoRA fine-tuning
  • Base pretrained microbe model

Code Download

Clone the full AntigenLM repository with training and downstream pipelines.

Zenodo Download

All datasets and models can also be downloaded from Zenodo.

Contact

If you have any questions, please contact us:

Corresponding Author:
Bin Liu, Professor, Beijing Institute of Technology, E-mail: bliu@bliulab.net
Main Developer:
Kai Chen, PhD Candidate, Beijing Institute of Technology, E-mail: kchen@bliulab.net
Other Authors:
Ke Yan, Associate Researcher, Beijing Institute of Technology, E-mail: kyan@bliulab.net

About

Bin Liu's lab at Beijing Institute of Technology (BIT) is focusing on developing techniques grounded in the natural language processing (NLP) to uncover the meanings of "book of life." The research areas of Bin Liu's lab include:

  1. Developing the Biological language models (BLMs);
  2. Studying the natural language processing techniques;
  3. Applying BLMs to biological sequence analysis;
  4. Protein remote homology detection and fold recognition;
  5. Predicting DNA/RNA/Peptide/Ligand binding proteins and their binding residues;
  6. Disordered protein/region prediction based on sequence labelling models;
  7. Predicting noncoding RNA-disease associations;
  8. Identifying protein complexes;
  9. DNA/RNA sequence analysis.

For more relevant research, please see http://bliulab.net.

Acknowledgements

We sincerely thank the laboratory members, partners, and reviewers for their dedicated efforts and valuable time invested in this project. Additionally, we thank the National Natural Science Foundation of China (No.62325202, 62473049) and the Zhongguancun Academy (Project No. 20240101).

The project referenced and used the free and open source CSS framework for UI.