BioSeq-BLM:

a platform for analyzing DNA, RNA, and protein sequences based on biological language models


Home Server Tutorial Description Download Citation

1 Biological language models

1.1 BGLMs based on word properties

Similar as sentences, biological sequences have their own words with more diverse properties reflecting evolutionary information, physicochemical values, structure information, etc. These properties are incorporated into BGLMs to more comprehensively represent biological sequences. There are 29 BGLMs based on word properties (see Table 1).

1.2 BGLMs based on syntax rules

The syntax rules reflect the relationships among residues, and 29 BGLMs based on syntax rules are summarized and listed in Table 2.

1.3 BSLMs based on BOW

BOW model represents sentences as the “bag” of words by word occurrence frequencies, ignoring grammar and even word orders (1). This model is performed on the words of biological sequences, and generate 12 BSLMs based on BOW (see Table 3).

1.4 BSLMs based on TF-IDF

TF-IDF model (2) reflects the importance of words to the biological sequences. This model is performed on the words of biological sequences, and generate 12 BSLMs based on TF-IDF (see Table 4).

1.5 BSLMs based on TextRank

TextRank (3), a graph-based ranking model, recognizes key sentences by ranking the criticality of sentences in the text, and assigns higher weights indicating the influence of a word. This model is performed on the words of biological sequences, and generate 12 BSLMs based on TextRank (see Table 5).

1.6 BSLMs based on topic models

The topic model discovers the abstract “topics” and the latent semantic structures of a “sequence document” by using Latent Semantic Analysis (LSA) (4), Probabilistic Latent Semantic Analysis (PLSA) (5), Latent Dirichlet Allocation (LDA) (6) and Labeled-Latent Dirichlet Allocation (Labeled-LDA) (7), leading to 12 BSLMs based on topic models (see Table 6).

1.7 BNLMs based on word embedding

Because linguistic objects with similar distributions have similar meanings (8), word embedding embeds each word into a continuous real-valued vector to represent the words. In this study, word2vec (9), GloVe (10) and fastText (11) are combined with the aforementioned words of biological sequences, and the corresponding 36 BNLMs based on word embedding are listed in Table 7.

1.8 BNLMs based on automatic features

Deep learning techniques are able to automatically extract the linguistic features independent from grammar rules and other experience knowledge. In this study, autoencoder (12), CNN-BiLSTM (13) and DCNN-BiLSTM (13) are used to model the dependencies among residues/words in biological sequences. MotifCNN (14) and MotifDCNN (14) are used to capture the motif-based features. Finally, 5 BNLMs based on automatic features are shown in Table 8.

1.9 Biological semantic similarity language models

Calculation of the sequence similarities of biological sequences is one of the keys in biological sequence analysis, which can be considered as the semantic similarities among sentences. The biological semantic similarity language models (BSSLMs) are able to represent the biological sequences based on the semantic similarities. The semantic similarities can be calculated by the feature vectors generated by the aforementioned 3 kinds of BLMs via Euclidean Distance (15-17), Manhattan Distance (18), Chebyshev Distance (19), Hamming Distance (20), Cosine Similarity (15-17), Pearson Correlation Coefficient (15-17), KL Divergence (Relative Entropy) (15-17), and Jaccard Similarity Coefficient (15-17). The resulting 8 BSSLMs are listed in Table 9.

2 Predictor construction algorithms

2.1 Support Vector Machine

Support Vector Machine (SVM) is a supervised learning algorithm that conducts data analysis for classification and regression (21, 22). Here, the scikit-learn (23) package was used as the implementation of SVM algorithm with radial basis function as the kernel.

2.2 Random Forest

Random Forest (RF) is an ensemble learning method for classification, regression and some other tasks. In BioSeq-BLM, the RF algorithm in scikit-learn (23), a widely used machine learning Python package, was used as the implementation of RF algorithm.

2.3 Conditional Random Field

In order to capture the global information of residues for a long sequence, a sequence labelling algorithm Conditional Random Field (CRF) was provided for residue-level analysis. Compared with the traditional classification classifiers, such as SVM and RF, CRF is a sequence labelling algorithm that models the biological sequences in a global fashion and considering the dependency information of all the residues along the sequences (24).

2.4 Convolution Neural Network

In natural language processing, due to its high degree of parallelization, convolutional neural network (CNN) (25) is most commonly applied to the text classification problems. They are known as shift invariant or space invariant artificial neural networks based on their shared-weights architecture and translation invariance characteristics, which is capable of capturing a localized feature.

2.5 Long Short-Term Memory

Long short-term memory (LSTM) (26) is an artificial recurrent neural network (RNN) architecture. A common LSTM unit is composed of an input gate, an output gate and a forget gate, which makes it suitable for capturing long-term dependence feature than other convolutional neural networks.

2.6 Gated Recurrent Units

Gated recurrent units (GRU) (27) are a gating mechanism in recurrent neural networks (RNN). Different from the LSTM, there are only update gate and reset gate in GRU unit, whose advantages are reducing parameters and solving the problem of gradient disappearance in back propagation.

2.7 Transformer

Like recurrent neural networks (RNNs), Transformer (28) is designed to handle sequential data, especially for the natural language tasks, such as translation and text summarization. Based on self-attention mechanism and encoder-decoder architecture, the transformer models the association between any two units in the sequence and achieves the state-of-the-art performance in many NLP tasks. Transformers have become the primary choice for tackling many NLP problems, replacing most of recurrent neural network models, such as the long short-term memory (LSTM).

2.8 Weighted Transformer

Weighted Transformer, a Transformer with modified attention layers, replaces the multi-head attention by multiple self-attention branches learning to combine during the training process. Experimental verification indicates the weighted Transformer not only outperforms the baseline network, but also converges faster (29).

2.9 Reformer

Similar with Weighted Transformer, Reformer is an attention-based model improving Transformer. In the Reformer, the dot-product attention and the reversible residual layers are used to replace the locality-sensitive hashing attention and the standard residual layer, respectively. Reformer outperforms Transformer models. Reformer is much more memory-efficient and much faster on long sequence (30).


For above machine learning algorithm, detailed method, description and task application are listed in Table 10. In addition, to solve imbalanced dataset, we provide multiple sampling techniques for constructing more powerful predictor. Details are listed in Table 11.

3 Results analysis

3.1 Method for results analysis

We provide a result analysis framework to interpret the predictive results with four modules: normalization, clustering, feature selection and dimension reduction. Detailed method and description are listed in Table 12.

4 Table of BioSeq-BLM

Table 1. 29 BGLMs based on word properties.

Category

Method

Description

DNA sequence

One-hot

Basic one-hot (31)

DBE

Dinucleotide Binary Encoding (32)

Position-specific-2

Position-specific of two nucleotides (33)

Position-specific-3

Position-specific of three nucleotides (33)

Position-specific-4

Position-specific of four nucleotides (33)

DPC

Dinucleotide physicochemical (34, 35)

TPC

Trinucleotide physicochemical (34, 35)

BLAST-matrix

BLAST-matrix (36)

RNA sequence

One-hot

Basic one-hot (31)

DBE

Dinucleotide Binary Encoding (32)

Position-specific-2

Position-specific of two nucleotides (33)

Position-specific-3

Position-specific of three nucleotides (33)

Position-specific-4

Position-specific of four nucleotides (33)

NCP

Nucleotide Chemical Property (37)

DPC

Dinucleotide physicochemical (34, 35)

RSS

RNA Secondary structure (38)

Protein sequence

One-hot

Basic one-hot (31)

One-hot(6-bit)

6-dimension One-hot method (39)

Binary(5-bit)

Use five binary bit to encode (40)

AESNN3

Learn from alignments (41)

Position-specific-2

Position-specific of two residues (33)

PP

Properties form AAindex (42)

SS

Secondary structure (43)

SASA

Solvent accessible surface area (44)

PAM250

PAM250 matrix (45)

BLOSUM62

BLOSUM62 matrix (46)

PSSM

PSSM matrix (47)

PSFM

Frequency profiles matrix (48)

CS

Conservation score (49)

Table 2. 29 BGLMs based on syntax rules.

Category

Method

Description

DNA sequence

DAC

Dinucleotide-based auto covariance (50)

DCC

Dinucleotide-based cross covariance (50)

DACC

Dinucleotide-based auto-cross covariance (50)

TAC

Trinucleotide-based auto covariance (50)

TCC

Trinucleotide-based cross covariance (50)

TACC

Trinucleotide-based auto-cross covariance (50)

MAC

Moran autocorrelation (51, 52)

GAC

Geary autocorrelation (51, 53)

NMBAC

Normalized Moreau-Broto autocorrelation (51, 54)

ZCPseKNC

Z curve pseudo k tuple nucleotide composition (55)

ND

Nucleotide Density (56)

RNA sequence

DAC

Dinucleotide-based auto covariance (50, 57)

DCC

Dinucleotide-based auto covariance (50, 57)

DACC

Dinucleotide-based auto-cross covariance (50, 57)

MAC

Moran autocorrelation (51, 52)

GAC

Geary autocorrelation (51, 53)

NMBAC

Normalized Moreau-Broto autocorrelation (51, 54)

ND

Nucleotide Density (56)

Protein sequence

AC

Auto covariance (50, 57)

CC

Cross covariance (50, 57)

ACC

Auto-cross covariance (50, 57)

PDT

Physicochemical distance transformation (58)

PDT-Profile

Profile-based Physicochemical distance transformation (58)

AC-PSSM

Profile-based Auto covariance (50)

CC-PSSM

Profile-based Cross covariance (50)

CC-PSSM

Profile-based Cross covariance (50)

ACC-PSSM

Profile-based Auto-cross covariance [23]

PSSM-DT

PSSM distance transformation (58)

PSSM-RT

PSSM relation transformation (59)

Motif-PSSM

Motifs initializing convolution kernel based (60)

Table 3. BSLMs based on BOW.

Category

Method

Description

DNA sequence

Kmer-BOW

Kmer-based BOW (35)

RevKmer-BOW

Reverse-complementary-kmer-based BOW (35, 61, 62)

Mismatch-BOW

Mismatch-based BOW (63-65)

Subsequence-BOW

Subsequence-based BOW (63, 65, 66)

RNA sequence

Kmer-BOW

Kmer-based BOW (67)

Mismatch-BOW

Mismatch-based BOW (63-65)

Subsequence-BOW

Subsequence-based BOW (63, 65, 66)

Protein sequence

Kmer-BOW

Kmer-based BOW (68)

Mismatch-BOW

Mismatch-based BOW (64)

DR-BOW

Distance-Residue-based BOW (69)

Top-n-gram-BOW

Top-n-gram-based BOW (70)

DT-BOW

Distance-Top-n-gram-based BOW (69)

Table 4. BSLMs based on TF-IDF.

Category

Method

Description

DNA sequence

Kmer-TF-IDF

Kmer-based TF-IDF (35, 71)

RevKmer-TF-IDF

Reverse-complementary-kmer-based TF-IDF (35, 61, 62, 71)

Mismatch-TF-IDF

Mismatch-based TF-IDF (63-65, 71)

Subsequence-TF-IDF

Subsequence-based TF-IDF (63, 65, 66, 71)

RNA sequence

Kmer-TF-IDF

Kmer-based TF-IDF (67, 71)

Mismatch-TF-IDF

Mismatch-based TF-IDF (63-65, 71)

Subsequence-TF-IDF

Subsequence-based TF-IDF(63, 65, 66, 71)

Protein sequence

Kmer-TF-IDF

Kmer-based TF-IDF (68, 71)

Mismatch-TF-IDF

Mismatch-based TF-IDF (64, 71)

DR-TF-IDF

Distance-Residue-based TF-IDF (69, 71)

Top-n-gram-TF-IDF

Top-n-gram-based TF-IDF (70, 71)

DT-TF-IDF

Distance-Top-n-gram-based TF-IDF (69, 71)

Table 5. BSLMs based on TextRank.

Category

Method

Description

DNA sequence

Kmer-TextRank

Kmer-based TextRank (3, 35)

RevKmer-TextRank

Reverse-complementary-kmer-based TextRank (3, 35, 61, 62)

Mismatch-TextRank

Mismatch-based TextRank (3, 63-65)

Subsequence-TextRank

Subsequence-based TextRank (3, 63, 65, 66)

RNA sequence

Kmer-TextRank

Kmer-based TextRank (3, 67)

Mismatch-TextRank

Mismatch-based TextRank (3, 63-65)

Subsequence-TextRank

Subsequence-based TextRank (3, 63, 65, 66)

Protein sequence

Kmer-TextRank

Kmer-based TextRank (3, 68)

Mismatch-TextRank

Mismatch-based TextRank (3, 64)

DR-TextRank

Distance-Residue-based TextRank (3, 69)

Top-n-gram-TextRank

Top-n-gram-based TextRank (3, 70)

DT-TextRank

Distance-based Top-n-gram based TextRank [62]

Table 6. BSLMs based on topic models.

Algorithm

Method

Description

LSA

BOW-LSA

Latent Semantic Analysis (4)

TF-IDF-LSA

TextRank-LSA

LDA

BOW-LDA

Latent Dirichlet Allocation (6)

TF-IDF-LDA

TextRank-LDA

Labeled-LDA

BOW-Labeled-LDA

Labeled Latent Dirichlet Allocation Model (7)

TF-IDF-Labeled-LDA

TextRank-Labeled-LDA

PLSA

BOW-PLSA

Probabilistic Latent Semantic Analysis (5)

TF-IDF-PLSA

TextRank-PLSA

Table 7. BNLMs based on word embedding.

Category

Algorithm

Method

Description

DNA sequence

word2vec

Kmer2vec

Learn word representations via word2vec model (9)

RevKmer2vec

Mismatch2vec

Subsequence2vec

GloVe

Kmer-GloVe

Learn word representations via Glove model (10)

RevKmer-GloVe

Mismatch-GloVe

Subsequence-GloVe

fastText

Kmer-fastText

Learn word representations via fastText model (11)

RevKmer-fastText

Mismatch-fastText

Subsequence-fastText

RNA sequence

word2vec

Kmer2vec

Learn word representations via word2vec model (9)

Mismatch2vec

Subsequence2vec

GloVe

Kmer-GloVe

Learn word representations via Glove model (10)

Mismatch-GloVe

Subsequence-GloVe

fastText

Kmer-fastText

Learn word representations via fastText model (11)

Mismatch-fastText

Subsequence-fastText

Protein sequence

word2vec

Kmer2vec

Learn word representations via word2vec model (9)

Mismatch2vec

DR2vec

Top-n-gram2vec

DT2vec

GloVe

Kmer-GloVe

Learn word representations via Glove model (10)

Mismatch-GloVe

DR-GloVe

Top-n-gram-GloVe

DT-GloVe

fastText

Kmer-fastText

Learn word representations via fastText model (11)

Mismatch-fastText

DR-fastText

Top-n-gram-fastText

DT-fastText

Table 8. BNLMs based on automatic features.

Model

Description

MotifCNN

CNN construction with motifs initializing convolution kernel (14)

MotifDCNN

DCNN construction with motifs initializing convolution kernel (14)

CNN-BiLSTM

Combine CNN and BiLSTM (13)

DCNN-BiLSTM

Combine DCNN and BiLSTM (13)

Autoencoder

Learning Sequence Representations based on Autoencoders (12)

Table 9. BSSLMs.

Method

Description

ED

Euclidean Distance (15-17)

MD

Manhattan Distance (18)

CD

Chebyshev Distance (19)

HD

Hamming Distance (20)

CS

Cosine Similarity (15-17)

PCC

Pearson Correlation Coefficient (15-17)

KLD

KL Divergence (Relative Entropy) (15-17)

JSC

Jaccard Similarity Coefficient (15-17)

Table 10. Machine learning algorithm for constructing predictor.

Category

Method

Description

Analysis Level

classification algorithm

SVM

Support Vector Machine (72)

S1, R2

RF

Random Forest (73)

sequence labelling algorithm

CRF

Conditional Random Field (74)

R

Deep learning algorithm

CNN

Convolutional Neural Networks (25)

S, R

LSTM

Long Short-Term Memory (26)

GRU

Gate Recurrent Unit (27)

Transformer

Network completely based on self-attention (28)

Weighted Transformer

Weighted Transformer network (29)

Reformer

Efficient Transformer (30)

Note: 1. S for sequence level; 2. R for residue level.

Table 11. Sampling technique for constructing predictor.

Method

Description

over

Over-sampleing based on Synthetic Minority Oversampling Technique (SMOTE) (75)

under

Under-sampling based on Tomek links method (76)

combine

Combine over-sampling and under-sampling by ‘SMOTETomek’ in sklearn package (77)

Table 12. Results analysis for Biological sequences.

Algorithm

Method

Description

Standardization or Normalization

min-max-scale

Normalization by scikit-learn (78) ‘MinMaxScaler’

standard-scale

Standardization by scikit-learn (78) ‘StandardScaler’

L1-regularization

Normalization based on L1 regularization (79)

L2-regularization

Normalization based on L2 regularization (80)

Clustering

AP

Clustering based on Affinity Propagation algorithm (81)

DBSCAN

Clustering based on Density-Based Spatial Clustering of Applications with Noise algorithm (82)

GMM

Clustering based on Gaussian Mixture Model (83)

AGNES

Clustering based on agglomerative nesting algorithm (84)

Kmeans

Clustering based on Kmeans algorithm (85)

Feature selection

chi2

Univariate feature selection based on Chi-square test (86, 87)

F-value

Univariate feature selection based on F-test (joint hypotheses test) (86, 87)

MIC

Univariate feature selection with mutual information (86, 87)

RFE

Select feature based on Recursive Feature Elimination (88)

Tree

Tree-based feature selection (89)

Dimension reduction

PCA

Reduce dimension based on principal component analysis (90)

KnernelPCA

Reduce dimension based on principal component analysis with ‘rbf’ kernel (91)

TSVD

Reduce dimension based on truncated singular value decomposition (92)

References


Fasta format example(DNA):