Biological data type
Homogeneous biological sequence similarities
For the homogeneous biological sequence similarities, the queries and the retrieved samples are homogeneous.
(left) Non-coding RNA similarity analysis, which is a homogeneous biological sequence analysis task. (right) Non-coding RNA and disease association identification, which is a heterogeneous biological sequence analysis task.
Heterogeneous biological sequence similarities
For the heterogeneous biological sequence similarities, the queries and the retrieved samples are heterogeneous.
(left) Text matching task, which is a homogeneous language analysis task. (right) Machine translation task, which is a heterogeneous language analysis task.
Biological sequence similarities calculation methods
Distribution methods
Distribution methods fully consider the spatial correlation of input pairs, and show good generalization ability for modelling different types of data. [1]
Representation methods
The representation methods employ the Siamese architecture to encode the sentences, which can be applied to analyse the biological sequence similarities. [1]
Interaction methods
Interaction methods employ the hierarchical deep architecture to learn the semantics from the local interaction matrix of query and retrieved documents, which is suitable for comprehensively learning the associations between biological sequences. [1]
Reference:
[1] Chandrasekaran D, Mago V. Evolution of Semantic Similarity—A Survey. ACM Comput Surv. 2021;54(2):41. doi: 10.1145/3440755.
The input data should be in the BLS format. Detailed information of the BLS format is introduced in the followings:
Required format of input biological sequences.
Please enter the biological sequences in FASTA format.
Example:
>5www_A
GHHHHHHMQAALLRRKSVNTTECVPVPSSEHVAEIVGRQLGMVLWIYKWFKPDGRLTDEQIADGMVGMLFPPFYIKTPVRGEEPIFVVTGRKEDVAMAKREILSAAEHFSMIRAS
Required format of the input vectors.
Please enter the feature vectors in following format (similar to FASTA format).
>vec_name1
vec_val1 vec_val2 vec_val3 ... vec_valn
Example:
>Data_A_ID:0
0.045 0.027 0.035 0.030 0.039 0.023 0.006 0.032 0.030 0.021 0.045 0.024 0.023 0.029 0.035 0.035 0.199 0.168 0.158
Required format of the input labels.
Please enter the associations in list format.
Example:
0 0
1 1
2 0
2 1
3 3
3 31
4 3
Interaction with BioSeq-BLM
Format conversion
If users want to generate feature vectors for input biological sequences, the pipeline software, BioSeq-BLM is recommended.
Based on biological language models, BioSeq-BLM can extract features representing linguistics attributes and biological attributes of biological sequences.
Users can serve the feature vectors generated by BioSeq-BLM as the input of BioSeq-Diabolo after a simple format conversion.