Home

Introduction

AntigenLM Overview

AntigenLM: Lineage guided protein language modelling for microbial antigen representation and CD8⁺ immune recognition

A dedicated protein language model for microbial antigens, integrating lineage-guided transfer and structure-aware masking to boost immune-specific representation learning.

Large scale computational analysis of microbial antigen immunogenicity is pivotal for vaccine development, enabling the rapid prioritization of key antigens and epitopes from vast candidate pools. Crucially, the efficiency of this process hinges on high-quality, transferable antigen representations. However, antigen representation learning is constrained by two major challenges. First, coupled data and domain biases: high-quality immune annotations are severely scarce, making immune-specific supervision difficult to achieve. Moreover, the severe underrepresentation of antigens in the pre-training corpora of general-purpose protein language models (PLMs) biases these models toward broadly conserved features, failing to yield representations sensitive to immune specificity. Second, intrinsic obstacles in feature extraction: immunological determinants are inherently weak and localized, making them difficult to capture. Moreover, the naive long-sequence truncation mechanisms of general PLMs readily sever critical structural fragments, exacerbating the loss of immune signals. Therefore, we propose the field's first dedicated protein language model for microbial antigens-AntigenLM. To resolve coupled data and domain biases, we developed a lineage-guided hierarchical transfer learning strategy that progressively specializes representations from microbes to microbial antigens, effectively mitigating biases induced by distribution shifts. Concurrently, to tackle intrinsic feature extraction obstacles, the model integrates structure-aware masking and overlapping sliding-window mechanisms. Together, they heighten model sensitivity to localized immune determinants while safeguarding the structural integrity of key sequence fragments. Evaluated across a multi-level CD8⁺T cell pathway framework-spanning antigen prioritization, presentation, and TCR recognition-AntigenLM substantially outperforms general-purpose PLMs like ESM series. Crucially, it demonstrates remarkable generalization, achieving a 0.944 PR-AUC in highly imbalanced (1:10) screening and driving zero-shot TCR recognition AUROC from 0.593 to 0.891. Overcoming the limits of generic pre-training, its strong cross-domain transferability provides a robust representational foundation for vaccine discovery and broad computational immunology tasks.

Figure 1: Hierarchical training pipeline for AntigenLM — Figure 1. Hierarchical training pipeline for AntigenLM with structure-aware fine-tuning. (a) Protein sequences were collected from NCBI and UniProt across bacteria, viruses and fungi, and microbial antigenic peptides were curated from IEDB, followed by MMseqs2 redundancy reduction (50%). Secondary structures were predicted by PSIPRED and grouped into α-helix, β-sheet and coil. (b) We performed stepwise BERT-style MLM training along “microbes → pathogenic microbes → microbial antigens”, yielding MicroLM, PathogLM and AntigenLM, with sliding-window tokenization for variable-length sequences. (c) During fine-tuning, we applied secondary-structure–conditioned dynamic masking (e.g., 15% for helix and sheet and 40% for coil) and updated only the top encoder blocks while freezing lower layers. (d) The resulting representations were transferred to downstream tasks including protective antigen recognition, antigen–TCR recognition and antigen–MHC recognition.

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen–MHC interaction prediction, and antigen–TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

Server

Input antigen sequence and optional TCR/HLA, then run predictions.

AntigenLM Interface

AntigenLM Server

Enter an antigen sequence, view its AntigenLM representation, and run three downstream predictors in one place.

Model AntigenLM 300M

Outputs Embedding + 3 Tasks

Antigen sequence

TCR sequence (for Antigen-TCR)

HLA sequence (for Antigen-MHC)

Protective antigenLM recognition Representation only Antigen-TCR recognition Antigen-MHC recognition

AntigenLM Representation

Peptide embedding: - Sequence embedding: -

Protective antigenLM recognition

-

Antigen-TCR recognition

-

Antigen-MHC recognition

-

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen–MHC interaction prediction, and antigen–TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

Download

Code, datasets, and pretrained checkpoints for AntigenLM.

Resources

Everything you need to start with AntigenLM.

Grab the GitHub repository, curated datasets, and pretrained models in a single place.

Updated Hugging Face + GitHub

Code Download

Clone the full AntigenLM repository with training and downstream pipelines.

GitHub Repository

Dataset Download

Five curated datasets for microbial antigen modeling and fine-tuning.

Antigen Protein FASTA
Antigen protein sequences
Antigen Protein with SS CSV
Antigen sequences with secondary structure
Pathogen Protein FASTA
Pathogenic microbe sequences
Pathogen Protein with SS CSV
Pathogen sequences with secondary structure
Microbial Protein FASTA
Base microbe pretraining data

Model Download

Pretrained checkpoints for immediate embedding and downstream evaluation.

AntigenLM 300M
Final model for downstream embeddings
PathogLM 300M
Intermediate model from LoRA fine-tuning
MicroLM 300M
Base pretrained microbe model

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen–MHC interaction prediction, and antigen–TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

Contact

If you have any questions, please contact us:

Corresponding Author:

Bin Liu, Professor, Beijing Institute of Technology, E-mail: bliu@bliulab.net

Main Developer:

Kai Chen, PhD Candidate, Beijing Institute of Technology, E-mail: a18844151839@163.com

Other Authors:

Ke Yan, Associate Researcher, Beijing Institute of Technology, E-mail: kyan@bliulab.net

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen–MHC interaction prediction, and antigen–TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

About

Bin Liu’s lab at Beijing Institute of Technology (BIT) is focusing on developing techniques grounded in the natural language processing (NLP) to uncover the meanings of “book of life”. The research areas of Bin Liu’s lab include:

Developing the Biological language models (BLMs);
Studying the natural language processing techniques;
Applying BLMs to biological sequence analysis;
Protein remote homology detection and fold recognition;
Predicting DNA/RNA/Peptide/Ligand binding proteins and their binding residues;
Disordered protein/region prediction based on sequence labelling models;
Predicting noncoding RNA-disease associations;
Identifying protein complexes;
DNA/RNA sequence analysis.

For more relevant research, please see http://bliulab.net.

Acknowledgements

We sincerely thank the laboratory members, partners, and reviewers for their dedicated efforts and valuable time invested in this project. Additionally, we thank the National Natural Science Foundation of China (No. 62325202, U22A2039) for its support of this work.

The project referenced and used the free and open source CSS framework for UI.

Upon the usage the users are requested to use the following citation:

INTRODUCTION

In this study, we present a protein language model specifically tailored for microbial antigens to learn antigen representations. The model is comprehensively validated on three downstream tasks relevant to adaptive immune recognition, including protective antigen prediction, antigen–MHC interaction prediction, and antigen–TCR recognition.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

Home

AntigenLM: Lineage guided protein language modelling for microbial antigen representation and CD8⁺ immune recognition

Cite

INTRODUCTION

NOTE

Server

AntigenLM Server

AntigenLM Representation

Protective antigenLM recognition

Antigen-TCR recognition

Antigen-MHC recognition

Cite

INTRODUCTION

NOTE

Download

Everything you need to start with AntigenLM.

Code Download

Dataset Download

Model Download

Cite

INTRODUCTION

NOTE

Contact

Cite

INTRODUCTION

NOTE

About

Acknowledgements

Cite

INTRODUCTION

NOTE