1 Biological language models
1.1 BGLMs based on word properties
Similar as sentences, biological sequences have their own words with more diverse properties reflecting evolutionary information, physicochemical values, structure information, etc. These properties are incorporated into BGLMs to more comprehensively represent biological sequences. There are 29 BGLMs based on word properties (see Table 1).
1.2 BGLMs based on syntax rules
The syntax rules reflect the relationships among residues, and 29 BGLMs based on syntax rules are summarized and listed in Table 2.
1.3 BSLMs based on BOW
BOW model represents sentences as the “bag” of words by word occurrence frequencies, ignoring grammar and even word orders (1). This model is performed on the words of biological sequences, and generate 12 BSLMs based on BOW (see Table 3).
1.4 BSLMs based on TF-IDF
TF-IDF model (2) reflects the importance of words to the biological sequences. This model is performed on the words of biological sequences, and generate 12 BSLMs based on TF-IDF (see Table 4).
1.5 BSLMs based on TextRank
TextRank (3), a graph-based ranking model, recognizes key sentences by ranking the criticality of sentences in the text, and assigns higher weights indicating the influence of a word. This model is performed on the words of biological sequences, and generate 12 BSLMs based on TextRank (see Table 5).
1.6 BSLMs based on topic models
The topic model discovers the abstract “topics” and the latent semantic structures of a “sequence document” by using Latent Semantic Analysis (LSA) (4), Probabilistic Latent Semantic Analysis (PLSA) (5), Latent Dirichlet Allocation (LDA) (6) and Labeled-Latent Dirichlet Allocation (Labeled-LDA) (7), leading to 12 BSLMs based on topic models (see Table 6).
1.7 BNLMs based on word embedding
Because linguistic objects with similar distributions have similar meanings (8), word embedding embeds each word into a continuous real-valued vector to represent the words. In this study, word2vec (9), GloVe (10) and fastText (11) are combined with the aforementioned words of biological sequences, and the corresponding 36 BNLMs based on word embedding are listed in Table 7.
1.8 BNLMs based on automatic features
Deep learning techniques are able to automatically extract the linguistic features independent from grammar rules and other experience knowledge. In this study, autoencoder (12), CNN-BiLSTM (13) and DCNN-BiLSTM (13) are used to model the dependencies among residues/words in biological sequences. MotifCNN (14) and MotifDCNN (14) are used to capture the motif-based features. Finally, 5 BNLMs based on automatic features are shown in Table 8.
1.9 Biological semantic similarity language models
Calculation of the sequence similarities of biological sequences is one of the keys in biological sequence analysis, which can be considered as the semantic similarities among sentences. The biological semantic similarity language models (BSSLMs) are able to represent the biological sequences based on the semantic similarities. The semantic similarities can be calculated by the feature vectors generated by the aforementioned 3 kinds of BLMs via Euclidean Distance (15-17), Manhattan Distance (18), Chebyshev Distance (19), Hamming Distance (20), Cosine Similarity (15-17), Pearson Correlation Coefficient (15-17), KL Divergence (Relative Entropy) (15-17), and Jaccard Similarity Coefficient (15-17). The resulting 8 BSSLMs are listed in Table 9.
2 Predictor construction algorithms
2.1 Support Vector Machine
Support Vector Machine (SVM) is a supervised learning algorithm that conducts data analysis for classification and regression (21, 22). Here, the scikit-learn (23) package was used as the implementation of SVM algorithm with radial basis function as the kernel.
2.2 Random Forest
Random Forest (RF) is an ensemble learning method for classification, regression and some other tasks. In BioSeq-BLM, the RF algorithm in scikit-learn (23), a widely used machine learning Python package, was used as the implementation of RF algorithm.
2.3 Conditional Random Field
In order to capture the global information of residues for a long sequence, a sequence labelling algorithm Conditional Random Field (CRF) was provided for residue-level analysis. Compared with the traditional classification classifiers, such as SVM and RF, CRF is a sequence labelling algorithm that models the biological sequences in a global fashion and considering the dependency information of all the residues along the sequences (24).
2.4 Convolution Neural Network
In natural language processing, due to its high degree of parallelization, convolutional neural network (CNN) (25) is most commonly applied to the text classification problems. They are known as shift invariant or space invariant artificial neural networks based on their shared-weights architecture and translation invariance characteristics, which is capable of capturing a localized feature.
2.5 Long Short-Term Memory
Long short-term memory (LSTM) (26) is an artificial recurrent neural network (RNN) architecture. A common LSTM unit is composed of an input gate, an output gate and a forget gate, which makes it suitable for capturing long-term dependence feature than other convolutional neural networks.
2.6 Gated Recurrent Units
Gated recurrent units (GRU) (27) are a gating mechanism in recurrent neural networks (RNN). Different from the LSTM, there are only update gate and reset gate in GRU unit, whose advantages are reducing parameters and solving the problem of gradient disappearance in back propagation.
2.7 Transformer
Like recurrent neural networks (RNNs), Transformer (28) is designed to handle sequential data, especially for the natural language tasks, such as translation and text summarization. Based on self-attention mechanism and encoder-decoder architecture, the transformer models the association between any two units in the sequence and achieves the state-of-the-art performance in many NLP tasks. Transformers have become the primary choice for tackling many NLP problems, replacing most of recurrent neural network models, such as the long short-term memory (LSTM).
2.8 Weighted Transformer
Weighted Transformer, a Transformer with modified attention layers, replaces the multi-head attention by multiple self-attention branches learning to combine during the training process. Experimental verification indicates the weighted Transformer not only outperforms the baseline network, but also converges faster (29).
2.9 Reformer
Similar with Weighted Transformer, Reformer is an attention-based model improving Transformer. In the Reformer, the dot-product attention and the reversible residual layers are used to replace the locality-sensitive hashing attention and the standard residual layer, respectively. Reformer outperforms Transformer models. Reformer is much more memory-efficient and much faster on long sequence (30).
3 Results analysis
3.1 Method for results analysis
We provide a result analysis framework to interpret the predictive results with four modules: normalization, clustering, feature selection and dimension reduction. Detailed method and description are listed in Table 12.
4 Table of BioSeq-BLM
Table 1. 29 BGLMs based on word properties.
Category |
Method |
Description |
---|---|---|
DNA sequence |
One-hot |
Basic one-hot (31) |
DBE |
Dinucleotide Binary Encoding (32) |
|
Position-specific-2 |
Position-specific of two nucleotides (33) |
|
Position-specific-3 |
Position-specific of three nucleotides (33) |
|
Position-specific-4 |
Position-specific of four nucleotides (33) |
|
DPC |
Dinucleotide physicochemical (34, 35) |
|
TPC |
Trinucleotide physicochemical (34, 35) |
|
BLAST-matrix |
BLAST-matrix (36) |
|
RNA sequence |
One-hot |
Basic one-hot (31) |
DBE |
Dinucleotide Binary Encoding (32) |
|
Position-specific-2 |
Position-specific of two nucleotides (33) |
|
Position-specific-3 |
Position-specific of three nucleotides (33) |
|
Position-specific-4 |
Position-specific of four nucleotides (33) |
|
NCP |
Nucleotide Chemical Property (37) |
|
DPC |
Dinucleotide physicochemical (34, 35) |
|
RSS |
RNA Secondary structure (38) |
|
Protein sequence |
One-hot |
Basic one-hot (31) |
One-hot(6-bit) |
6-dimension One-hot method (39) |
|
Binary(5-bit) |
Use five binary bit to encode (40) |
|
AESNN3 |
Learn from alignments (41) |
|
Position-specific-2 |
Position-specific of two residues (33) |
|
PP |
Properties form AAindex (42) |
|
SS |
Secondary structure (43) |
|
SASA |
Solvent accessible surface area (44) |
|
PAM250 |
PAM250 matrix (45) |
|
BLOSUM62 |
BLOSUM62 matrix (46) |
|
PSSM |
PSSM matrix (47) |
|
PSFM |
Frequency profiles matrix (48) |
|
CS |
Conservation score (49) |
Table 2. 29 BGLMs based on syntax rules.
Category |
Method |
Description |
---|---|---|
DNA sequence |
DAC |
Dinucleotide-based auto covariance (50) |
DCC |
Dinucleotide-based cross covariance (50) |
|
DACC |
Dinucleotide-based auto-cross covariance (50) |
|
TAC |
Trinucleotide-based auto covariance (50) |
|
TCC |
Trinucleotide-based cross covariance (50) |
|
TACC |
Trinucleotide-based auto-cross covariance (50) |
|
MAC |
Moran autocorrelation (51, 52) |
|
GAC |
Geary autocorrelation (51, 53) |
|
NMBAC |
Normalized Moreau-Broto autocorrelation (51, 54) |
|
ZCPseKNC |
Z curve pseudo k tuple nucleotide composition (55) |
|
ND |
Nucleotide Density (56) |
|
RNA sequence |
DAC |
Dinucleotide-based auto covariance (50, 57) |
DCC |
Dinucleotide-based auto covariance (50, 57) |
|
DACC |
Dinucleotide-based auto-cross covariance (50, 57) |
|
MAC |
Moran autocorrelation (51, 52) |
|
GAC |
Geary autocorrelation (51, 53) |
|
NMBAC |
Normalized Moreau-Broto autocorrelation (51, 54) |
|
ND |
Nucleotide Density (56) |
|
Protein sequence |
AC |
Auto covariance (50, 57) |
CC |
Cross covariance (50, 57) |
|
ACC |
Auto-cross covariance (50, 57) |
|
PDT |
Physicochemical distance transformation (58) |
|
PDT-Profile |
Profile-based Physicochemical distance transformation (58) |
|
AC-PSSM |
Profile-based Auto covariance (50) |
|
CC-PSSM |
Profile-based Cross covariance (50) |
|
CC-PSSM |
Profile-based Cross covariance (50) |
|
ACC-PSSM |
Profile-based Auto-cross covariance [23] |
|
PSSM-DT |
PSSM distance transformation (58) |
|
PSSM-RT |
PSSM relation transformation (59) |
|
Motif-PSSM |
Motifs initializing convolution kernel based (60) |
Table 3. BSLMs based on BOW.
Category |
Method |
Description |
---|---|---|
DNA sequence |
Kmer-BOW |
Kmer-based BOW (35) |
RevKmer-BOW |
Reverse-complementary-kmer-based BOW (35, 61, 62) |
|
Mismatch-BOW |
Mismatch-based BOW (63-65) |
|
Subsequence-BOW |
Subsequence-based BOW (63, 65, 66) |
|
RNA sequence |
Kmer-BOW |
Kmer-based BOW (67) |
Mismatch-BOW |
Mismatch-based BOW (63-65) |
|
Subsequence-BOW |
Subsequence-based BOW (63, 65, 66) |
|
Protein sequence |
Kmer-BOW |
Kmer-based BOW (68) |
Mismatch-BOW |
Mismatch-based BOW (64) |
|
DR-BOW |
Distance-Residue-based BOW (69) |
|
Top-n-gram-BOW |
Top-n-gram-based BOW (70) |
|
DT-BOW |
Distance-Top-n-gram-based BOW (69) |
Table 4. BSLMs based on TF-IDF.
Category |
Method |
Description |
---|---|---|
DNA sequence |
Kmer-TF-IDF |
Kmer-based TF-IDF (35, 71) |
RevKmer-TF-IDF |
Reverse-complementary-kmer-based TF-IDF (35, 61, 62, 71) |
|
Mismatch-TF-IDF |
Mismatch-based TF-IDF (63-65, 71) |
|
Subsequence-TF-IDF |
Subsequence-based TF-IDF (63, 65, 66, 71) |
|
RNA sequence |
Kmer-TF-IDF |
Kmer-based TF-IDF (67, 71) |
Mismatch-TF-IDF |
Mismatch-based TF-IDF (63-65, 71) |
|
Subsequence-TF-IDF |
Subsequence-based TF-IDF(63, 65, 66, 71) |
|
Protein sequence |
Kmer-TF-IDF |
Kmer-based TF-IDF (68, 71) |
Mismatch-TF-IDF |
Mismatch-based TF-IDF (64, 71) |
|
DR-TF-IDF |
Distance-Residue-based TF-IDF (69, 71) |
|
Top-n-gram-TF-IDF |
Top-n-gram-based TF-IDF (70, 71) |
|
DT-TF-IDF |
Distance-Top-n-gram-based TF-IDF (69, 71) |
Table 5. BSLMs based on TextRank.
Category |
Method |
Description |
---|---|---|
DNA sequence |
Kmer-TextRank |
Kmer-based TextRank (3, 35) |
RevKmer-TextRank |
Reverse-complementary-kmer-based TextRank (3, 35, 61, 62) |
|
Mismatch-TextRank |
Mismatch-based TextRank (3, 63-65) |
|
Subsequence-TextRank |
Subsequence-based TextRank (3, 63, 65, 66) |
|
RNA sequence |
Kmer-TextRank |
Kmer-based TextRank (3, 67) |
Mismatch-TextRank |
Mismatch-based TextRank (3, 63-65) |
|
Subsequence-TextRank |
Subsequence-based TextRank (3, 63, 65, 66) |
|
Protein sequence |
Kmer-TextRank |
Kmer-based TextRank (3, 68) |
Mismatch-TextRank |
Mismatch-based TextRank (3, 64) |
|
DR-TextRank |
Distance-Residue-based TextRank (3, 69) |
|
Top-n-gram-TextRank |
Top-n-gram-based TextRank (3, 70) |
|
DT-TextRank |
Distance-based Top-n-gram based TextRank [62] |
Table 6. BSLMs based on topic models.
Algorithm |
Method |
Description |
---|---|---|
LSA |
BOW-LSA |
Latent Semantic Analysis (4) |
TF-IDF-LSA |
||
TextRank-LSA |
||
LDA |
BOW-LDA |
Latent Dirichlet Allocation (6) |
TF-IDF-LDA |
||
TextRank-LDA |
||
Labeled-LDA |
BOW-Labeled-LDA |
Labeled Latent Dirichlet Allocation Model (7) |
TF-IDF-Labeled-LDA |
||
TextRank-Labeled-LDA |
||
PLSA |
BOW-PLSA |
Probabilistic Latent Semantic Analysis (5) |
TF-IDF-PLSA |
||
TextRank-PLSA |
Table 7. BNLMs based on word embedding.
Category |
Algorithm |
Method |
Description |
---|---|---|---|
DNA sequence |
word2vec |
Kmer2vec |
Learn word representations via word2vec model (9) |
RevKmer2vec |
|||
Mismatch2vec |
|||
Subsequence2vec |
|||
GloVe |
Kmer-GloVe |
Learn word representations via Glove model (10) |
|
RevKmer-GloVe |
|||
Mismatch-GloVe |
|||
Subsequence-GloVe |
|||
fastText |
Kmer-fastText |
Learn word representations via fastText model (11) |
|
RevKmer-fastText |
|||
Mismatch-fastText |
|||
Subsequence-fastText |
|||
RNA sequence |
word2vec |
Kmer2vec |
Learn word representations via word2vec model (9) |
Mismatch2vec |
|||
Subsequence2vec |
|||
GloVe |
Kmer-GloVe |
Learn word representations via Glove model (10) |
|
Mismatch-GloVe |
|||
Subsequence-GloVe |
|||
fastText |
Kmer-fastText |
Learn word representations via fastText model (11) |
|
Mismatch-fastText |
|||
Subsequence-fastText |
|||
Protein sequence |
word2vec |
Kmer2vec |
Learn word representations via word2vec model (9) |
Mismatch2vec |
|||
DR2vec |
|||
Top-n-gram2vec |
|||
DT2vec |
|||
GloVe |
Kmer-GloVe |
Learn word representations via Glove model (10) |
|
Mismatch-GloVe |
|||
DR-GloVe |
|||
Top-n-gram-GloVe |
|||
DT-GloVe |
|||
fastText |
Kmer-fastText |
Learn word representations via fastText model (11) |
|
Mismatch-fastText |
|||
DR-fastText |
|||
Top-n-gram-fastText |
|||
DT-fastText |
Table 8. BNLMs based on automatic features.
Model |
Description |
---|---|
MotifCNN |
CNN construction with motifs initializing convolution kernel (14) |
MotifDCNN |
DCNN construction with motifs initializing convolution kernel (14) |
CNN-BiLSTM |
Combine CNN and BiLSTM (13) |
DCNN-BiLSTM |
Combine DCNN and BiLSTM (13) |
Autoencoder |
Learning Sequence Representations based on Autoencoders (12) |
Table 9. BSSLMs.
Method |
Description |
---|---|
ED |
Euclidean Distance (15-17) |
MD |
Manhattan Distance (18) |
CD |
Chebyshev Distance (19) |
HD |
Hamming Distance (20) |
CS |
Cosine Similarity (15-17) |
PCC |
Pearson Correlation Coefficient (15-17) |
KLD |
KL Divergence (Relative Entropy) (15-17) |
JSC |
Jaccard Similarity Coefficient (15-17) |
Table 10. Machine learning algorithm for constructing predictor.
Category |
Method |
Description |
Analysis Level |
---|---|---|---|
classification algorithm |
SVM |
Support Vector Machine (72) |
S1, R2 |
RF |
Random Forest (73) |
||
sequence labelling algorithm |
CRF |
Conditional Random Field (74) |
R |
Deep learning algorithm |
CNN |
Convolutional Neural Networks (25) |
S, R |
LSTM |
Long Short-Term Memory (26) |
||
GRU |
Gate Recurrent Unit (27) |
||
Transformer |
Network completely based on self-attention (28) |
||
Weighted Transformer |
Weighted Transformer network (29) |
||
Reformer |
Efficient Transformer (30) |
Note: 1. S for sequence level; 2. R for residue level.
Table 11. Sampling technique for constructing predictor.
Method |
Description |
---|---|
over |
Over-sampleing based on Synthetic Minority Oversampling Technique (SMOTE) (75) |
under |
Under-sampling based on Tomek links method (76) |
combine |
Combine over-sampling and under-sampling by ‘SMOTETomek’ in sklearn package (77) |
Table 12. Results analysis for Biological sequences.
Algorithm |
Method |
Description |
---|---|---|
Standardization or Normalization |
min-max-scale |
Normalization by scikit-learn (78) ‘MinMaxScaler’ |
standard-scale |
Standardization by scikit-learn (78) ‘StandardScaler’ |
|
L1-regularization |
Normalization based on L1 regularization (79) |
|
L2-regularization |
Normalization based on L2 regularization (80) |
|
Clustering |
AP |
Clustering based on Affinity Propagation algorithm (81) |
DBSCAN |
Clustering based on Density-Based Spatial Clustering of Applications with Noise algorithm (82) |
|
GMM |
Clustering based on Gaussian Mixture Model (83) |
|
AGNES |
Clustering based on agglomerative nesting algorithm (84) |
|
Kmeans |
Clustering based on Kmeans algorithm (85) |
|
Feature selection |
chi2 |
Univariate feature selection based on Chi-square test (86, 87) |
F-value |
Univariate feature selection based on F-test (joint hypotheses test) (86, 87) |
|
MIC |
Univariate feature selection with mutual information (86, 87) |
|
RFE |
Select feature based on Recursive Feature Elimination (88) |
|
Tree |
Tree-based feature selection (89) |
|
Dimension reduction |
PCA |
Reduce dimension based on principal component analysis (90) |
KnernelPCA |
Reduce dimension based on principal component analysis with ‘rbf’ kernel (91) |
|
TSVD |
Reduce dimension based on truncated singular value decomposition (92) |
References
- 1. Harris, Z.S., Distributional structure. Word, 1954. 10(2-3): p. 146-162.
- 2. Ramos, J. Using tf-idf to determine word relevance in document queries. in Proceedings of the first instructional conference on machine learning. 2003. New Jersey, USA.
- 3. Mihalcea, R. and P. Tarau. Textrank: Bringing order into text. in Proceedings of the 2004 conference on Empirical Methods in Natural Language Processing. 2004. Barcelona, Spain: Association for Computational Linguistics.
- 4. Landauer, T.K., P.W. Foltz, and D. Laham, An introduction to latent semantic analysis. Discourse processes, 1998. 25(2-3): p. 259-284.
- 5. Blei, D.M., Probabilistic topic models. Communications of the ACM, 2012. 55(4): p. 77-84.
- 6. Blei, D.M., A.Y. Ng, and M.I. Jordan, Latent dirichlet allocation. Journal of machine Learning research, 2003. 3(Jan): p. 993-1022.
- 7. Ramage, D., et al. Labeled LDA: A supervised topic model for credit attribution in multilabeled corpora. in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009. Singapore: Association for Computational Linguistics.
- 8. HARRIS, Z., Distributional Structure. Word, 1954. 10(23): p. 142-146.
- 9. Mikolov, T., et al., Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781, 2013.
- 10. Pennington, J., R. Socher, and C.D. Manning. Glove: Global vectors for word representation. in Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing. 2014. Association for Computational Linguistics.
- 11. Joulin, A., et al., Bag of Tricks for Efficient Text Classification, in Conference of the European Chapter of the Association for Computational Linguistics. 2017. p. 427-431.
- 12. Lebret, R. and R. Collobert, " The Sum of Its Parts": Joint Learning of Word and Phrase Representations with Autoencoders. Preprint at https://arxiv.org/abs/1506.05703, 2015.
- 13. Liu, B., C.C. Li, and K. Yan, DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks. Briefings in Bioinformatics, 2020. 21(5): p. 1733-1741.
- 14. Li, C.C. and B. Liu, MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks. Briefings in Bioinformatics, 2020. 21(6): p. 2133-2141.
- 15. Ye, X.G., G.L. Wang, and S.F. Altschul, An assessment of substitution scores for protein profile-profile comparison. Bioinformatics, 2011. 27(24): p. 3356-3363.
- 16. Rangwala, H. and G. Karypis, Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 2005. 21(23): p. 4239-4247.
- 17. Mittelman, D., R. Sadreyev, and N. Grishin, Probabilistic scoring measures for profileprofile comparison yield more accurate short seed alignments. Bioinformatics, 2003. 19(12): p. 1531-1539.
- 18. Strauss, T. and M.J. von Maltitz, Generalising Ward's Method for Use with Manhattan Distances. PLoS One, 2017. 12(1): p. e0168288.
- 19. Weinberger, K.Q. and L.K. Saul, Distance Metric Learning for Large Margin Nearest Neighbor Classification. J. Mach. Learn. Res., 2009. 10: p. 207–244.
- 20. Laboulais, C., et al., Hamming distance geometry of a protein conformational space: Application to the clustering of a 4-ns molecular dynamics trajectory of the HIV-1 integrase catalytic core. Proteins-Structure Function and Genetics, 2002. 47(2): p. 169- 179.
- 21. Suykens, J.A. and J.J.N.p.l. Vandewalle, Least squares support vector machine classifiers. 1999. 9(3): p. 293-300.
- 22. Liu, B., BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Briefings in Bioinformatics, 2019. 20(4): p. 1280-1294.
- 23. Pedregosa, F., et al., Scikit-learn: Machine learning in Python. 2011. 12: p. 2825-2830.
- 24. Liu, B., X. Gao, and H.Y. Zhang, BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Research, 2019. 47(20): p. e127-e127.
- 25. Zeng, M., et al., Protein–protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics, 2019. 36(4): p. 1114-1120.
- 26. Hochreiter, S. and J. Schmidhuber, Long short-term memory. Neural Computation, 1997. 9(8): p. 1735-1780.
- 27. Cho, K., et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing 2014. Association for Computational Linguistics.
- 28. Vaswani, A., et al., Attention is all you need, in Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, Curran Associates Inc.: Long Beach, California, USA. p. 6000–6010.
- 29. Ahmed, K., N.S. Keskar, and R. Socher, Weighted transformer network for machine translation. Preprint at https://arxiv.org/abs/1711.02132, 2017.
- 30. Kitaev, N., Ł. Kaiser, and A. Levskaya, Reformer: The efficient transformer. Preprint at https://arxiv.org/abs/2001.04451, 2020.
- 31. Yoo, P.D., B.B. Zhou, and A.Y. Zomaya, Machine learning techniques for protein secondary structure prediction: An overview and evaluation. Current Bioinformatics, 2008. 3(2): p. 74-86.
- 32. Qiang, X.L., et al., M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species. Frontiers in Genetics, 2018. 9: p. 495.
- 33. Doench, J.G., et al., Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nature Biotechnology, 2016. 34(2): p. 184-191.
- 34. Friedel, M., et al., DiProDB: a database for dinucleotide properties. Nucleic Acids Research, 2009. 37: p. D37-D40.
- 35. Chen, W., et al., PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition. Analytical Biochemistry, 2014. 456: p. 53-60.
- 36. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 1997. 25(17): p. 3389-3402.
- 37. Chen, W., et al., iRNA-3typeA: Identifying Three Types of Modification at RNA's Adenosine Sites. Molecular Therapy-Nucleic Acids, 2018. 11: p. 468-474.
- 38. Hofacker, I.L., et al., Fast Folding and Comparison of Rna Secondary Structures. Monatshefte Fur Chemie, 1994. 125(2): p. 167-188.
- 39. Wang, J.T.L., et al., New techniques for extracting features from protein sequences. Ibm Systems Journal, 2001. 40(2): p. 426-441.
- 40. White, G. and W. Seffens, Using a neural network to backtranslate amino acid sequences. Electronic Journal of Biotechnology, 1998. 1: p. 196-201.
- 41. Lin, K., A.C.W. May, and W.R. Taylor, Amino acid encoding schemes from protein structure alignments: Multi-dimensional vectors to describe residue types. Journal of Theoretical Biology, 2002. 216(3): p. 361-365.
- 42. Kawashima, S., et al., AAindex: amino acid index database, progress report 2008. Nucleic Acids Research, 2008. 36: p. D202-D205.
- 43. Cuff, J.A. and G.J. Barton, Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins-Structure Function and Bioinformatics, 2000. 40(3): p. 502-511.
- 44. Heffernan, R., et al., Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Scientific Reports, 2015. 5.
- 45. Dayhoff, M., R. Schwartz, and B. Orcutt, a model of evolutionary change in proteins, in Atlas of protein sequence and structure. 1978, National Biomedical Research Foundation Silver Spring MD. p. 345-352.
- 46. Henikoff, S. and J.G. Henikoff, Amino-Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences of the United States of America, 1992. 89(22): p. 10915-10919.
- 47. Altschul, S.F. and E.V. Koonin, Iterated profile searches with PSI-BLAST - a tool for discovery in protein databases. Trends in Biochemical Sciences, 1998. 23(11): p. 444-447.
- 48. Liu, B., et al., Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics, 2014. 30(4): p. 472-479.
- 49. Glaser, F., et al., The ConSurf-HSSP database: The mapping of evolutionary conservation among homologs onto PDB structures. Proteins-Structure Function and Bioinformatics, 2005. 58(3): p. 610-617.
- 50. Dong, Q.W., S.G. Zhou, and J.H. Guan, A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 2009. 25(20): p. 2655-2662.
- 51. Chen, W., et al., PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics, 2015. 31(1): p. 119–120.
- 52. Horne, D.S., Prediction of Protein Helix Content from an Auto-Correlation Analysis of Sequence Hydrophobicities. Biopolymers, 1988. 27(3): p. 451-477.
- 53. Sokal, R.R. and B.A. Thomson, Population structure inferred by local spatial autocorrelation: An example from an Amerindian tribal population. American Journal of Physical Anthropology, 2006. 129(1): p. 121-131.
- 54. Feng, Z.P. and C.T. Zhang, Prediction of membrane protein types based on the hydrophobic index of amino acids. J Protein Chem, 2000. 19(4): p. 269-75.
- 55. Chen, J.H., et al., iEsGene-ZCPseKNC: Identify Essential Genes Based on Z Curve Pseudo k-Tuple Nucleotide Composition. IEEE Access, 2019. 7: p. 165241-165247.
- 56. Bari, A.T., et al. DNA Encoding for Splice Site Prediction in Large DNA Sequence. in Proceedings of the 18th International Conference on Database Systems for Advanced Applications. 2013. Berlin, Heidelberg: Springer Berlin Heidelberg.
- 57. Guo, Y.Z., et al., Using support vector machine combined with auto covariance to predict proteinprotein interactions from protein sequences. Nucleic Acids Research, 2008. 36(9): p. 3025-3030.
- 58. Liu, B., et al., Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection. PloS One, 2012. 7(9).
- 59. Zhou, J., et al., EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation. BMC Bioinformatics, 2017. 18(1): p. 379.
- 60. Zhang, J., Q.C. Chen, and B. Liu, iDRBP_MMC: Identifying DNA-Binding Proteins and RNABinding Proteins Based on Multi-Label Learning Model and Motif-Based Convolutional Neural Network. Journal of Molecular Biology, 2020. 432(22): p. 5860-5875.
- 61. Gupta, S., et al., Predicting Human Nucleosome Occupancy from Primary Sequence. Plos Computational Biology, 2008. 4(8): p. e1000134.
- 62. Noble, W.S., et al., Predicting the in vivo signature of human gene regulatory sequences. Bioinformatics, 2005. 21: p. I338-I343.
- 63. El-Manzalawy, Y., D. Dobbs, and V. Honavar, Predicting flexible length linear B-cell epitopes. Comput Syst Bioinformatics Conf, 2008. 7: p. 121-32.
- 64. Leslie, C.S., et al., Mismatch string kernels for discriminative protein classification. Bioinformatics, 2004. 20(4): p. 467-476.
- 65. Luo, L.Q., et al., Accurate Prediction of Transposon-Derived piRNAs by Integrating Various Sequential and Physicochemical Features. PloS One, 2016. 11(4).
- 66. Lodhi, H., et al., Text classification using string kernels. Journal of Machine Learning Research, 2002. 2(3): p. 419-444.
- 67. Lin, H., et al., iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Research, 2014. 42(21): p. 12961-12972.
- 68. Liu, B., et al., Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research, 2015. 43(W1): p. W65-W71.
- 69. Liu, B., et al., Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinformatics, 2014. 15(2): p. S3.
- 70. Liu, B., et al., A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics, 2008. 9: p. 510.
- 71. Zhang, W., T. Yoshida, and X. Tang, A comparative study of TF*IDF, LSI and multi-words for text classification. expert systems with applications, 2011. 38(3): p. 2758-2765.
- 72. Chang, C.C. and C.J. Lin, LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2011. 2(3): p. Article 27.
- 73. Breiman, L., Random Forests. Mach. Learn., 2001. 45(1 ): p. 5–32.
- 74. Sutton, C. and A. McCallum, An Introduction to Conditional Random Fields. Found. Trends Mach. Learn., 2012. 4(4): p. 267–373.
- 75. Hanson, J., et al., Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks. Bioinformatics, 2017. 33(5): p. 685-692.
- 76. Chawla, N.V., et al., SMOTE: synthetic minority over-sampling technique. journal of artificial intelligence research, 2002. 16(1): p. 321-357.
- 77. Farquad, M.A.H. and I. Bose, Preprocessing unbalanced data using support vector machine. Decision Support Systems, 2012. 53(1): p. 226-233.
- 78. Junsomboon, N. and T. Phienthrakul, Combining Over-Sampling and Under-Sampling Techniques for Imbalance Dataset, in Proceedings of the 9th International Conference on Machine Learning and Computing. 2017, Association for Computing Machinery: Singapore, Singapore. p. 243–247.
- 79. Pedregosa, F., et al., Scikit-learn: Machine Learning in Python. journal of machine learning research, 2011. 12(85): p. 2825-2830.
- 80. Schmidt, M., G. Fung, and R. Rosales, Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches, in Proceedings of the 18th European conference on Machine Learning. 2007, Springer-Verlag: Warsaw, Poland. p. 286–297.
- 81. Bilgic, B., et al., Fast image reconstruction with L2-regularization. Journal of Magnetic Resonance Imaging, 2014. 40(1): p. 181-191.
- 82. Frey, B.J. and D. Dueck, Clustering by passing messages between data points. Science, 2007. 315(5814): p. 972-6.
- 83. Ester, M., et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. 1996, AAAI Press: Portland, Oregon. p. 226–231.
- 84. Kim, S.C. and T.J. Kang, Texture classification and segmentation using wavelet packet frame and Gaussian mixture model. Pattern Recogn., 2007. 40(4): p. 1207–1221.
- 85. Skarmeta, A.G., A. Bensaid, and N. Tazi, Data mining for text categorization with semi‐ supervised agglomerative hierarchical clustering. International journal of intelligent systems, 2000. 15(7): p. 633-646.
- 86. Jain, A.K., M.N. Murty, and P.J. Flynn, Data clustering: a review. ACM computing surveys, 1999. 31(3): p. 264-323.
- 87. Chandrashekar, G. and F. Sahin, A survey on feature selection methods. Computers Electrical Engineering, 2014. 40(1): p. 16-28.
- 88. Guyon, I. and A. Elisseeff, An introduction to variable and feature selection. Journal of machine learning research, 2003. 3: p. 1157-1182.
- 89. Darst, B.F., K.C. Malecki, and C.D. Engelman, Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genetics, 2018. 19: p. 353-363.
- 90. Sugumaran, V., V. Muralidharan, and K. Ramachandran, Feature selection using Decision Tree and classification through Proximal Support Vector Machine for fault diagnostics of roller bearing. Mechanical Systems Signal Processing, 2007. 21(2): p. 930-942.
- 91. Yeung, K.Y. and W.L. Ruzzo, Principal component analysis for clustering gene expression data. Bioinformatics, 2001. 17(9): p. 763-774.
- 92. Schölkopf, B., A.J. Smola, and K.-R. Müller, Kernel Principal Component Analysis, in Proceedings of the 7th International Conference on Artificial Neural Networks. 1997, Springer-Verlag. p. 583–588.
- 93. Wei, J.-J., et al., ECG data compression using truncated singular value decomposition. Trans. Info. Tech. Biomed., 2001. 5(4): p. 290–299.