Computational identification of disordered flexible linkers (DFLs) is important for understanding the functions of intrinsically disordered regions (IDRs). Different from other IDRs which are able to transition from disorder to order, such as molecular recognition features (MoRFs), DFLs are linkers or spacers between the domains of multi-domain proteins with high level of flexibility without defined structure.
We proposed a new predictor iDFL-TL for predicting the DFLs by transferring learning derived from the natural language processing (NLP). The sequential labelling model Bi-directional Long Short-Term Memory (Bi-LSTM) was pre-trained with a large IDRs dataset to learn the common characteristics between IDRs and DFLs, and the global interactions among residues along the whole proteins. Then it was transferred to DFL prediction by fine-tuning with the DFLs data to capture the DFL specific features.
Experimental results on the TEST60 independent test dataset showed that iDFL-TL outperformed DFLpred by 13.9% in terms of AUC. Further tested on the updated proteins from the latest DisProt database (version 8.0.2) indicated that trained with more DFLs, the performance of iDFL-TL can be improved, outperforming DFLpred by 21.7% in terms of AUC.