Home

Introduction

AntigenLM Overview

AntigenLM: Lineage guided protein language modelling for microbial antigen representation and CD8⁺ immune recognition

A dedicated protein language model for microbial antigens, integrating lineage-guided transfer and structure-aware masking to boost immune-specific representation learning.

Large scale computational analysis of microbial antigen immunogenicity is pivotal for vaccine development, enabling the rapid prioritization of key antigens and epitopes from vast candidate pools. Crucially, the efficiency of this process hinges on high-quality, transferable antigen representations. However, antigen representation learning is constrained by two major challenges. First, coupled data and domain biases: high-quality immune annotations are severely scarce, making immune-specific supervision difficult to achieve. Moreover, the severe underrepresentation of antigens in the pre-training corpora of general-purpose protein language models (PLMs) biases these models toward broadly conserved features, failing to yield representations sensitive to immune specificity. Second, intrinsic obstacles in feature extraction: immunological determinants are inherently weak and localized, making them difficult to capture. Moreover, the naive long-sequence truncation mechanisms of general PLMs readily sever critical structural fragments, exacerbating the loss of immune signals. Therefore, we propose the field's first dedicated protein language model for microbial antigens-AntigenLM. To resolve coupled data and domain biases, we developed a lineage-guided hierarchical transfer learning strategy that progressively specializes representations from microbes to microbial antigens, effectively mitigating biases induced by distribution shifts. Concurrently, to tackle intrinsic feature extraction obstacles, the model integrates structure-aware masking and overlapping sliding-window mechanisms. Together, they heighten model sensitivity to localized immune determinants while safeguarding the structural integrity of key sequence fragments. Evaluated across a multi-level CD8⁺T cell pathway framework-spanning antigen prioritization, presentation, and TCR recognition-AntigenLM substantially outperforms general-purpose PLMs like ESM series. Crucially, it demonstrates remarkable generalization, achieving a 0.944 PR-AUC in highly imbalanced (1:10) screening and driving zero-shot TCR recognition AUROC from 0.593 to 0.891. Overcoming the limits of generic pre-training, its strong cross-domain transferability provides a robust representational foundation for vaccine discovery and broad computational immunology tasks.

Figure 1: Hierarchical training pipeline for AntigenLM
Figure 1. Hierarchical training pipeline for AntigenLM with structure-aware fine-tuning. (a) Protein sequences were collected from NCBI and UniProt across bacteria, viruses and fungi, and microbial antigenic peptides were curated from IEDB, followed by MMseqs2 redundancy reduction (50%). Secondary structures were predicted by PSIPRED and grouped into α-helix, β-sheet and coil. (b) We performed stepwise BERT-style MLM training along “microbes → pathogenic microbes → microbial antigens”, yielding MicroLM, PathogLM and AntigenLM, with sliding-window tokenization for variable-length sequences. (c) During fine-tuning, we applied secondary-structure–conditioned dynamic masking (e.g., 15% for helix and sheet and 40% for coil) and updated only the top encoder blocks while freezing lower layers. (d) The resulting representations were transferred to downstream tasks including protective antigen recognition, antigen–TCR recognition and antigen–MHC recognition.

Server

Input antigen sequence and optional TCR/HLA, then run predictions.

AntigenLM Interface

AntigenLM Server

Enter an antigen sequence, view its AntigenLM representation, and run three downstream predictors in one place.

Model AntigenLM 300M
Outputs Embedding + 3 Tasks

AntigenLM Representation

Peptide embedding: - Sequence embedding: -

Protective antigenLM recognition

-

Antigen-TCR recognition

-

Antigen-MHC recognition

-

Download

Code, datasets, and pretrained checkpoints for AntigenLM.

Resources

Everything you need to start with AntigenLM.

Grab the GitHub repository, curated datasets, and pretrained models in a single place.

Updated Hugging Face + GitHub

Code Download

Clone the full AntigenLM repository with training and downstream pipelines.

Model Download

Pretrained checkpoints for immediate embedding and downstream evaluation.

  • Final model for downstream embeddings
  • Intermediate model from LoRA fine-tuning
  • Base pretrained microbe model

Contact

If you have any questions, please contact us:

Corresponding Author:
Bin Liu, Professor, Beijing Institute of Technology, E-mail: bliu@bliulab.net
Main Developer:
Kai Chen, PhD Candidate, Beijing Institute of Technology, E-mail: a18844151839@163.com
Other Authors:
Ke Yan, Associate Researcher, Beijing Institute of Technology, E-mail: kyan@bliulab.net

About

Bin Liu’s lab at Beijing Institute of Technology (BIT) is focusing on developing techniques grounded in the natural language processing (NLP) to uncover the meanings of “book of life”. The research areas of Bin Liu’s lab include:

  1. Developing the Biological language models (BLMs);
  2. Studying the natural language processing techniques;
  3. Applying BLMs to biological sequence analysis;
  4. Protein remote homology detection and fold recognition;
  5. Predicting DNA/RNA/Peptide/Ligand binding proteins and their binding residues;
  6. Disordered protein/region prediction based on sequence labelling models;
  7. Predicting noncoding RNA-disease associations;
  8. Identifying protein complexes;
  9. DNA/RNA sequence analysis.

For more relevant research, please see http://bliulab.net.

Acknowledgements

We sincerely thank the laboratory members, partners, and reviewers for their dedicated efforts and valuable time invested in this project. Additionally, we thank the National Natural Science Foundation of China (No. 62325202, U22A2039) for its support of this work.

The project referenced and used the free and open source CSS framework for UI.