BioSeq-BLM

1 Biological language models

1.1 BGLMs based on word properties

Similar as sentences, biological sequences have their own words with more diverse properties reflecting evolutionary information, physicochemical values, structure information, etc. These properties are incorporated into BGLMs to more comprehensively represent biological sequences. There are 29 BGLMs based on word properties (see Table 1).

1.2 BGLMs based on syntax rules

The syntax rules reflect the relationships among residues, and 29 BGLMs based on syntax rules are summarized and listed in Table 2.

1.3 BSLMs based on BOW

BOW model represents sentences as the “bag” of words by word occurrence frequencies, ignoring grammar and even word orders (1). This model is performed on the words of biological sequences, and generate 12 BSLMs based on BOW (see Table 3).

1.4 BSLMs based on TF-IDF

TF-IDF model (2) reflects the importance of words to the biological sequences. This model is performed on the words of biological sequences, and generate 12 BSLMs based on TF-IDF (see Table 4).

1.5 BSLMs based on TextRank

TextRank (3), a graph-based ranking model, recognizes key sentences by ranking the criticality of sentences in the text, and assigns higher weights indicating the influence of a word. This model is performed on the words of biological sequences, and generate 12 BSLMs based on TextRank (see Table 5).

1.6 BSLMs based on topic models

The topic model discovers the abstract “topics” and the latent semantic structures of a “sequence document” by using Latent Semantic Analysis (LSA) (4), Probabilistic Latent Semantic Analysis (PLSA) (5), Latent Dirichlet Allocation (LDA) (6) and Labeled-Latent Dirichlet Allocation (Labeled-LDA) (7), leading to 12 BSLMs based on topic models (see Table 6).

1.7 BNLMs based on word embedding

Because linguistic objects with similar distributions have similar meanings (8), word embedding embeds each word into a continuous real-valued vector to represent the words. In this study, word2vec (9), GloVe (10) and fastText (11) are combined with the aforementioned words of biological sequences, and the corresponding 36 BNLMs based on word embedding are listed in Table 7.

1.8 BNLMs based on automatic features

Deep learning techniques are able to automatically extract the linguistic features independent from grammar rules and other experience knowledge. In this study, autoencoder (12), CNN-BiLSTM (13) and DCNN-BiLSTM (13) are used to model the dependencies among residues/words in biological sequences. MotifCNN (14) and MotifDCNN (14) are used to capture the motif-based features. Finally, 5 BNLMs based on automatic features are shown in Table 8.

1.9 Biological semantic similarity language models

Calculation of the sequence similarities of biological sequences is one of the keys in biological sequence analysis, which can be considered as the semantic similarities among sentences. The biological semantic similarity language models (BSSLMs) are able to represent the biological sequences based on the semantic similarities. The semantic similarities can be calculated by the feature vectors generated by the aforementioned 3 kinds of BLMs via Euclidean Distance (15-17), Manhattan Distance (18), Chebyshev Distance (19), Hamming Distance (20), Cosine Similarity (15-17), Pearson Correlation Coefficient (15-17), KL Divergence (Relative Entropy) (15-17), and Jaccard Similarity Coefficient (15-17). The resulting 8 BSSLMs are listed in Table 9.

2 Predictor construction algorithms

2.1 Support Vector Machine

Support Vector Machine (SVM) is a supervised learning algorithm that conducts data analysis for classification and regression (21, 22). Here, the scikit-learn (23) package was used as the implementation of SVM algorithm with radial basis function as the kernel.

2.2 Random Forest

Random Forest (RF) is an ensemble learning method for classification, regression and some other tasks. In BioSeq-BLM, the RF algorithm in scikit-learn (23), a widely used machine learning Python package, was used as the implementation of RF algorithm.

2.3 Conditional Random Field

In order to capture the global information of residues for a long sequence, a sequence labelling algorithm Conditional Random Field (CRF) was provided for residue-level analysis. Compared with the traditional classification classifiers, such as SVM and RF, CRF is a sequence labelling algorithm that models the biological sequences in a global fashion and considering the dependency information of all the residues along the sequences (24).

2.4 Convolution Neural Network

In natural language processing, due to its high degree of parallelization, convolutional neural network (CNN) (25) is most commonly applied to the text classification problems. They are known as shift invariant or space invariant artificial neural networks based on their shared-weights architecture and translation invariance characteristics, which is capable of capturing a localized feature.

2.5 Long Short-Term Memory

Long short-term memory (LSTM) (26) is an artificial recurrent neural network (RNN) architecture. A common LSTM unit is composed of an input gate, an output gate and a forget gate, which makes it suitable for capturing long-term dependence feature than other convolutional neural networks.

2.6 Gated Recurrent Units

Gated recurrent units (GRU) (27) are a gating mechanism in recurrent neural networks (RNN). Different from the LSTM, there are only update gate and reset gate in GRU unit, whose advantages are reducing parameters and solving the problem of gradient disappearance in back propagation.

2.7 Transformer

Like recurrent neural networks (RNNs), Transformer (28) is designed to handle sequential data, especially for the natural language tasks, such as translation and text summarization. Based on self-attention mechanism and encoder-decoder architecture, the transformer models the association between any two units in the sequence and achieves the state-of-the-art performance in many NLP tasks. Transformers have become the primary choice for tackling many NLP problems, replacing most of recurrent neural network models, such as the long short-term memory (LSTM).

2.8 Weighted Transformer

Weighted Transformer, a Transformer with modified attention layers, replaces the multi-head attention by multiple self-attention branches learning to combine during the training process. Experimental verification indicates the weighted Transformer not only outperforms the baseline network, but also converges faster (29).

2.9 Reformer

Similar with Weighted Transformer, Reformer is an attention-based model improving Transformer. In the Reformer, the dot-product attention and the reversible residual layers are used to replace the locality-sensitive hashing attention and the standard residual layer, respectively. Reformer outperforms Transformer models. Reformer is much more memory-efficient and much faster on long sequence (30).

For above machine learning algorithm, detailed method, description and task application are listed in Table 10. In addition, to solve imbalanced dataset, we provide multiple sampling techniques for constructing more powerful predictor. Details are listed in Table 11.

3 Results analysis

3.1 Method for results analysis

We provide a result analysis framework to interpret the predictive results with four modules: normalization, clustering, feature selection and dimension reduction. Detailed method and description are listed in Table 12.

4 Table of BioSeq-BLM

Table 1. 29 BGLMs based on word properties.

Category	Method	Description
DNA sequence	One-hot	Basic one-hot (31)
	DBE	Dinucleotide Binary Encoding (32)
	Position-specific-2	Position-specific of two nucleotides (33)
	Position-specific-3	Position-specific of three nucleotides (33)
	Position-specific-4	Position-specific of four nucleotides (33)
	DPC	Dinucleotide physicochemical (34, 35)
	TPC	Trinucleotide physicochemical (34, 35)
	BLAST-matrix	BLAST-matrix (36)
RNA sequence	One-hot	Basic one-hot (31)
	DBE	Dinucleotide Binary Encoding (32)
	Position-specific-2	Position-specific of two nucleotides (33)
	Position-specific-3	Position-specific of three nucleotides (33)
	Position-specific-4	Position-specific of four nucleotides (33)
	NCP	Nucleotide Chemical Property (37)
	DPC	Dinucleotide physicochemical (34, 35)
	RSS	RNA Secondary structure (38)
Protein sequence	One-hot	Basic one-hot (31)
	One-hot(6-bit)	6-dimension One-hot method (39)
	Binary(5-bit)	Use five binary bit to encode (40)
	AESNN3	Learn from alignments (41)
	Position-specific-2	Position-specific of two residues (33)
	PP	Properties form AAindex (42)
	SS	Secondary structure (43)
	SASA	Solvent accessible surface area (44)
	PAM250	PAM250 matrix (45)
	BLOSUM62	BLOSUM62 matrix (46)
	PSSM	PSSM matrix (47)
	PSFM	Frequency profiles matrix (48)
	CS	Conservation score (49)

Table 2. 29 BGLMs based on syntax rules.

Category	Method	Description
DNA sequence	DAC	Dinucleotide-based auto covariance (50)
	DCC	Dinucleotide-based cross covariance (50)
	DACC	Dinucleotide-based auto-cross covariance (50)
	TAC	Trinucleotide-based auto covariance (50)
	TCC	Trinucleotide-based cross covariance (50)
	TACC	Trinucleotide-based auto-cross covariance (50)
	MAC	Moran autocorrelation (51, 52)
	GAC	Geary autocorrelation (51, 53)
	NMBAC	Normalized Moreau-Broto autocorrelation (51, 54)
	ZCPseKNC	Z curve pseudo k tuple nucleotide composition (55)
	ND	Nucleotide Density (56)
RNA sequence	DAC	Dinucleotide-based auto covariance (50, 57)
	DCC	Dinucleotide-based auto covariance (50, 57)
	DACC	Dinucleotide-based auto-cross covariance (50, 57)
	MAC	Moran autocorrelation (51, 52)
	GAC	Geary autocorrelation (51, 53)
	NMBAC	Normalized Moreau-Broto autocorrelation (51, 54)
	ND	Nucleotide Density (56)
Protein sequence	AC	Auto covariance (50, 57)
	CC	Cross covariance (50, 57)
	ACC	Auto-cross covariance (50, 57)
	PDT	Physicochemical distance transformation (58)
	PDT-Profile	Profile-based Physicochemical distance transformation (58)
	AC-PSSM	Profile-based Auto covariance (50)
	CC-PSSM	Profile-based Cross covariance (50)
	CC-PSSM	Profile-based Cross covariance (50)
	ACC-PSSM	Profile-based Auto-cross covariance [23]
	PSSM-DT	PSSM distance transformation (58)
	PSSM-RT	PSSM relation transformation (59)
	Motif-PSSM	Motifs initializing convolution kernel based (60)

Table 3. BSLMs based on BOW.

Category	Method	Description
DNA sequence	Kmer-BOW	Kmer-based BOW (35)
	RevKmer-BOW	Reverse-complementary-kmer-based BOW (35, 61, 62)
	Mismatch-BOW	Mismatch-based BOW (63-65)
	Subsequence-BOW	Subsequence-based BOW (63, 65, 66)
RNA sequence	Kmer-BOW	Kmer-based BOW (67)
	Mismatch-BOW	Mismatch-based BOW (63-65)
	Subsequence-BOW	Subsequence-based BOW (63, 65, 66)
Protein sequence	Kmer-BOW	Kmer-based BOW (68)
	Mismatch-BOW	Mismatch-based BOW (64)
	DR-BOW	Distance-Residue-based BOW (69)
	Top-n-gram-BOW	Top-n-gram-based BOW (70)
	DT-BOW	Distance-Top-n-gram-based BOW (69)

Table 4. BSLMs based on TF-IDF.

Category	Method	Description
DNA sequence	Kmer-TF-IDF	Kmer-based TF-IDF (35, 71)
	RevKmer-TF-IDF	Reverse-complementary-kmer-based TF-IDF (35, 61, 62, 71)
	Mismatch-TF-IDF	Mismatch-based TF-IDF (63-65, 71)
	Subsequence-TF-IDF	Subsequence-based TF-IDF (63, 65, 66, 71)
RNA sequence	Kmer-TF-IDF	Kmer-based TF-IDF (67, 71)
	Mismatch-TF-IDF	Mismatch-based TF-IDF (63-65, 71)
	Subsequence-TF-IDF	Subsequence-based TF-IDF(63, 65, 66, 71)
Protein sequence	Kmer-TF-IDF	Kmer-based TF-IDF (68, 71)
	Mismatch-TF-IDF	Mismatch-based TF-IDF (64, 71)
	DR-TF-IDF	Distance-Residue-based TF-IDF (69, 71)
	Top-n-gram-TF-IDF	Top-n-gram-based TF-IDF (70, 71)
	DT-TF-IDF	Distance-Top-n-gram-based TF-IDF (69, 71)

Table 5. BSLMs based on TextRank.

Category	Method	Description
DNA sequence	Kmer-TextRank	Kmer-based TextRank (3, 35)
	RevKmer-TextRank	Reverse-complementary-kmer-based TextRank (3, 35, 61, 62)
	Mismatch-TextRank	Mismatch-based TextRank (3, 63-65)
	Subsequence-TextRank	Subsequence-based TextRank (3, 63, 65, 66)
RNA sequence	Kmer-TextRank	Kmer-based TextRank (3, 67)
	Mismatch-TextRank	Mismatch-based TextRank (3, 63-65)
	Subsequence-TextRank	Subsequence-based TextRank (3, 63, 65, 66)
Protein sequence	Kmer-TextRank	Kmer-based TextRank (3, 68)
	Mismatch-TextRank	Mismatch-based TextRank (3, 64)
	DR-TextRank	Distance-Residue-based TextRank (3, 69)
	Top-n-gram-TextRank	Top-n-gram-based TextRank (3, 70)
	DT-TextRank	Distance-based Top-n-gram based TextRank [62]

Table 6. BSLMs based on topic models.

Algorithm	Method	Description
LSA	BOW-LSA	Latent Semantic Analysis (4)
	TF-IDF-LSA
	TextRank-LSA
LDA	BOW-LDA	Latent Dirichlet Allocation (6)
	TF-IDF-LDA
	TextRank-LDA
Labeled-LDA	BOW-Labeled-LDA	Labeled Latent Dirichlet Allocation Model (7)
	TF-IDF-Labeled-LDA
	TextRank-Labeled-LDA
PLSA	BOW-PLSA	Probabilistic Latent Semantic Analysis (5)
	TF-IDF-PLSA
	TextRank-PLSA

Table 7. BNLMs based on word embedding.

Category	Algorithm	Method	Description
DNA sequence	word2vec	Kmer2vec	Learn word representations via word2vec model (9)
		RevKmer2vec
		Mismatch2vec
		Subsequence2vec
	GloVe	Kmer-GloVe	Learn word representations via Glove model (10)
		RevKmer-GloVe
		Mismatch-GloVe
		Subsequence-GloVe
	fastText	Kmer-fastText	Learn word representations via fastText model (11)
		RevKmer-fastText
		Mismatch-fastText
		Subsequence-fastText
RNA sequence	word2vec	Kmer2vec	Learn word representations via word2vec model (9)
		Mismatch2vec
		Subsequence2vec
	GloVe	Kmer-GloVe	Learn word representations via Glove model (10)
		Mismatch-GloVe
		Subsequence-GloVe
	fastText	Kmer-fastText	Learn word representations via fastText model (11)
		Mismatch-fastText
		Subsequence-fastText
Protein sequence	word2vec	Kmer2vec	Learn word representations via word2vec model (9)
		Mismatch2vec
		DR2vec
		Top-n-gram2vec
		DT2vec
	GloVe	Kmer-GloVe	Learn word representations via Glove model (10)
		Mismatch-GloVe
		DR-GloVe
		Top-n-gram-GloVe
		DT-GloVe
	fastText	Kmer-fastText	Learn word representations via fastText model (11)
		Mismatch-fastText
		DR-fastText
		Top-n-gram-fastText
		DT-fastText

Table 8. BNLMs based on automatic features.

Model	Description
MotifCNN	CNN construction with motifs initializing convolution kernel (14)
MotifDCNN	DCNN construction with motifs initializing convolution kernel (14)
CNN-BiLSTM	Combine CNN and BiLSTM (13)
DCNN-BiLSTM	Combine DCNN and BiLSTM (13)
Autoencoder	Learning Sequence Representations based on Autoencoders (12)

Table 9. BSSLMs.

Method	Description
ED	Euclidean Distance (15-17)
MD	Manhattan Distance (18)
CD	Chebyshev Distance (19)
HD	Hamming Distance (20)
CS	Cosine Similarity (15-17)
PCC	Pearson Correlation Coefficient (15-17)
KLD	KL Divergence (Relative Entropy) (15-17)
JSC	Jaccard Similarity Coefficient (15-17)

Table 10. Machine learning algorithm for constructing predictor.

Category	Method	Description	Analysis Level
classification algorithm	SVM	Support Vector Machine (72)	S¹, R²
classification algorithm	RF	Random Forest (73)	S¹, R²
sequence labelling algorithm	CRF	Conditional Random Field (74)	R
Deep learning algorithm	CNN	Convolutional Neural Networks (25)	S, R
	LSTM	Long Short-Term Memory (26)
	GRU	Gate Recurrent Unit (27)
	Transformer	Network completely based on self-attention (28)
	Weighted Transformer	Weighted Transformer network (29)
	Reformer	Efficient Transformer (30)

Note: 1. S for sequence level; 2. R for residue level.

Table 11. Sampling technique for constructing predictor.

Method	Description
over	Over-sampleing based on Synthetic Minority Oversampling Technique (SMOTE) (75)
under	Under-sampling based on Tomek links method (76)
combine	Combine over-sampling and under-sampling by ‘SMOTETomek’ in sklearn package (77)

Table 12. Results analysis for Biological sequences.

Algorithm	Method	Description
Standardization or Normalization	min-max-scale	Normalization by scikit-learn (78) ‘MinMaxScaler’
	standard-scale	Standardization by scikit-learn (78) ‘StandardScaler’
	L1-regularization	Normalization based on L1 regularization (79)
	L2-regularization	Normalization based on L2 regularization (80)
Clustering	AP	Clustering based on Affinity Propagation algorithm (81)
	DBSCAN	Clustering based on Density-Based Spatial Clustering of Applications with Noise algorithm (82)
	GMM	Clustering based on Gaussian Mixture Model (83)
	AGNES	Clustering based on agglomerative nesting algorithm (84)
	Kmeans	Clustering based on Kmeans algorithm (85)
Feature selection	chi2	Univariate feature selection based on Chi-square test (86, 87)
	F-value	Univariate feature selection based on F-test (joint hypotheses test) (86, 87)
	MIC	Univariate feature selection with mutual information (86, 87)
	RFE	Select feature based on Recursive Feature Elimination (88)
	Tree	Tree-based feature selection (89)
Dimension reduction	PCA	Reduce dimension based on principal component analysis (90)
	KnernelPCA	Reduce dimension based on principal component analysis with ‘rbf’ kernel (91)
	TSVD	Reduce dimension based on truncated singular value decomposition (92)

BioSeq-BLM: