Home

Introduction

AntigenLM Overview

AntigenLM: Lineage guided protein language modelling for microbial antigen representation and CD8+ immune recognition

A dedicated protein language model for microbial antigens, integrating lineage-guided transfer and structure-aware masking to boost immune-specific representation learning.

Large scale computational analysis of microbial antigen immunogenicity is pivotal for vaccine development, enabling the rapid prioritization of key antigens and epitopes from vast candidate pools. Crucially, the efficiency of this process hinges on high-quality, transferable antigen representations. However, antigen representation learning is constrained by two major challenges. First, coupled data and domain biases: high-quality immune annotations are severely scarce, making immune-specific supervision difficult to achieve. Moreover, the severe underrepresentation of antigens in the pre-training corpora of general-purpose protein language models (PLMs) biases these models toward broadly conserved features, failing to yield representations sensitive to immune specificity. Second, intrinsic obstacles in feature extraction: immunological determinants are inherently weak and localized, making them difficult to capture. Moreover, the naive long-sequence truncation mechanisms of general PLMs readily sever critical structural fragments, exacerbating the loss of immune signals. Therefore, we propose the field's first dedicated protein language model for microbial antigens, AntigenLM. To resolve coupled data and domain biases, we developed a lineage-guided hierarchical transfer learning strategy that progressively specializes representations from microbes to microbial antigens, effectively mitigating biases induced by distribution shifts. Concurrently, to tackle intrinsic feature extraction obstacles, the model integrates structure-aware masking and overlapping sliding-window mechanisms. Together, they heighten model sensitivity to localized immune determinants while safeguarding the structural integrity of key sequence fragments. Evaluated across a multi-level CD8+ T cell pathway framework spanning antigen prioritization, presentation, and TCR recognition, AntigenLM substantially outperforms general-purpose PLMs like ESM series. Crucially, it demonstrates remarkable generalization, achieving a 0.944 PR-AUC in highly imbalanced (1:10) screening and driving zero-shot TCR recognition AUROC from 0.593 to 0.891. Overcoming the limits of generic pre-training, its strong cross-domain transferability provides a robust representational foundation for vaccine discovery and broad computational immunology tasks.

Figure 1: Hierarchical training pipeline for AntigenLM — Figure 1. Hierarchical training pipeline for AntigenLM with structure-aware fine-tuning. (a) Protein sequences were collected from NCBI and UniProt across bacteria, viruses and fungi, and microbial antigenic peptides were curated from IEDB, followed by MMseqs2 redundancy reduction (50%). Secondary structures were predicted by PSIPRED and grouped into alpha-helix, beta-sheet, and coil. (b) We performed stepwise BERT-style MLM training along "microbes -> pathogenic microbes -> microbial antigens" yielding MicroLM, PathogLM and AntigenLM, with sliding-window tokenization for variable-length sequences. (c) During fine-tuning, we applied secondary-structure-conditioned dynamic masking (e.g., 15% for helix and sheet and 40% for coil) and updated only the top encoder blocks while freezing lower layers. (d) The resulting representations were transferred to downstream tasks including protective antigen recognition, antigen-TCR recognition and antigen-HLA-I recognition.

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen-HLA-I interaction prediction, and antigen-TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

Server

Each task uses its own antigen input together with the task-specific paired sequence when required.

AntigenLM Interface

AntigenLM Server

Enter the antigen sequence for each task card separately, then run protective, antigen-TCR, or antigen-HLA-I prediction.

Need high-throughput prediction?

For large-scale offline prediction, please download and install AntigenLM locally. Details are available on the GitHub page.

Model AntigenLM 300M

Tasks Protective Antigen Prediction, Antigen-TCR Binding, and Antigen-HLA-I Binding

Task 01

Protective Antigen Recognition

Use the antigen sequence entered in this card to score whether it is likely to be protective.

Antigen sequence

Protective score -

Ready for prediction.

Task 02

Antigen-TCR Binding

Use the antigen sequence in this card together with the paired TCR sequence below.

Antigen sequence TCR sequence

Antigen-TCR score -

Ready for prediction.

Task 03

Antigen-HLA-I Binding

Use the antigen sequence in this card together with the paired HLA amino-acid sequence below.

Antigen sequence HLA amino-acid sequence

Antigen-HLA-I score -

Ready for prediction.

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen-HLA-I interaction prediction, and antigen-TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

Document

User guide for running the AntigenLM web server correctly.

AntigenLM Document

Guidance for task selection, sequence submission, result interpretation, and local installation of AntigenLM.

This section provides structured documentation for the AntigenLM web server, including task definitions, input requirements, score interpretation, and local environment setup. Links for datasets, pretrained checkpoints, and source code are centralized on the Download page rather than repeated here.

1. Overview

The Document page summarizes how to use the AntigenLM web interface for downstream prediction tasks. Each task card on the Server page works independently, includes a Load Example option for quick validation, and only uses the sequences entered within that specific panel, which helps researchers evaluate different inputs without cross-interference and verify the expected submission format before running custom sequences.

2. Task Scope and Output Interpretation

Each downstream task addresses a distinct biological question and therefore requires a different input configuration. Protective Antigen Prediction uses only an antigen amino-acid sequence, Antigen–HLA-I Binding requires an antigen sequence together with the paired HLA-I pseudo sequence, and Antigen–TCR Binding requires an antigen sequence together with the paired TCR sequence entered in the corresponding task card before inference. All returned scores range from 0 to 1, and higher values indicate stronger model confidence for the positive outcome, so higher Protective score values suggest a more likely protective antigen, higher Antigen–HLA-I score values suggest stronger predicted binding or presentation potential, and higher Antigen–TCR score values suggest stronger predicted recognition or binding potential; these scores should be used for prioritization rather than treated as direct experimental confirmation.

3. Installation

3.1 Create Conda Environment

conda create -n antigenlm python=3.10
conda activate antigenlm

3.2 Requirements

We recommend installing the local environment using the provided environment.yaml file to ensure compatibility across AntigenLM training and downstream evaluation workflows:

conda env update -f environment.yaml --prune

If this approach fails or Conda is not available, you can manually install the main dependencies as listed below.

torch==2.1.0+cu121
torchvision==0.16.0+cu121
torchaudio==2.1.0+cu121
transformers==4.46.3
tokenizers==0.20.3
sentencepiece==0.2.0
safetensors==0.5.3
huggingface-hub==0.36.2
accelerate==1.0.1
deepspeed==0.10.3
peft==0.13.2
fair-esm==2.0.0
numpy==1.24.1
pandas==2.0.3
scipy==1.10.1
scikit-learn==1.3.2
tqdm==4.67.1
biopython==1.83

Before execution, make sure the dataset paths, pretrained model locations, and checkpoint directories in the configuration files point to valid local resources. After setup, the training and downstream pipelines can be launched with the provided entry points, such as DeepSpeed for microLM pretraining and Python or torchrun for protective antigen, antigen–HLA-I, and antigen–TCR tasks.

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen-HLA-I interaction prediction, and antigen-TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

Download

Code, datasets, and pretrained checkpoints for AntigenLM.

Resources

Everything you need to start with AntigenLM.

Grab the GitHub repository, curated datasets, and pretrained models in a single place.

Updated Hugging Face + GitHub + Zenodo

Dataset Download

Five curated datasets for microbial antigen modeling and fine-tuning.

Antigen Protein FASTA
Antigen protein sequences
Antigen Protein with SS CSV
Antigen sequences with secondary structure
Pathogen Protein FASTA
Pathogenic microbe sequences
Pathogen Protein with SS CSV
Pathogen sequences with secondary structure
Microbial Protein FASTA
Base microbe pretraining data

Model Download

Pretrained checkpoints for immediate embedding and downstream evaluation.

AntigenLM 300M
Final model for downstream embeddings
PathogLM 300M
Intermediate model from LoRA fine-tuning
MicroLM 300M
Base pretrained microbe model

Code Download

Clone the full AntigenLM repository with training and downstream pipelines.

GitHub Repository

Zenodo Download

All datasets and models can also be downloaded from Zenodo.

Zenodo Resources

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen-HLA-I interaction prediction, and antigen-TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

Contact

If you have any questions, please contact us:

Corresponding Author:

Bin Liu, Professor, Beijing Institute of Technology, E-mail: bliu@bliulab.net

Main Developer:

Kai Chen, PhD Candidate, Beijing Institute of Technology, E-mail: kchen@bliulab.net

Other Authors:

Ke Yan, Associate Researcher, Beijing Institute of Technology, E-mail: kyan@bliulab.net

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen-HLA-I interaction prediction, and antigen-TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

About

Bin Liu's lab at Beijing Institute of Technology (BIT) is focusing on developing techniques grounded in the natural language processing (NLP) to uncover the meanings of "book of life." The research areas of Bin Liu's lab include:

Developing the Biological language models (BLMs);
Studying the natural language processing techniques;
Applying BLMs to biological sequence analysis;
Protein remote homology detection and fold recognition;
Predicting DNA/RNA/Peptide/Ligand binding proteins and their binding residues;
Disordered protein/region prediction based on sequence labelling models;
Predicting noncoding RNA-disease associations;
Identifying protein complexes;
DNA/RNA sequence analysis.

For more relevant research, please see http://bliulab.net.

Acknowledgements

We sincerely thank the laboratory members, partners, and reviewers for their dedicated efforts and valuable time invested in this project. Additionally, we thank the National Natural Science Foundation of China (No.62325202, 62473049) and the Zhongguancun Academy (Project No. 20240101).

The project referenced and used the free and open source CSS framework for UI.

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen-HLA-I interaction prediction, and antigen-TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

Home

AntigenLM: Lineage guided protein language modelling for microbial antigen representation and CD8+ immune recognition

Cite

INTRODUCTION

NOTE

Server

AntigenLM Server

Protective Antigen Recognition

Antigen-TCR Binding

Antigen-HLA-I Binding

Cite

INTRODUCTION

NOTE

Document

Guidance for task selection, sequence submission, result interpretation, and local installation of AntigenLM.

1. Overview

2. Task Scope and Output Interpretation

3. Installation

3.1 Create Conda Environment

3.2 Requirements

Cite

INTRODUCTION

NOTE

Download

Everything you need to start with AntigenLM.

Cite

INTRODUCTION

NOTE

Contact

Cite

INTRODUCTION

NOTE

About

Acknowledgements

Cite

INTRODUCTION

NOTE