Document
1. Datasets and Codes
we constructed two independent datasets, GDSC1 and GDSC2. Specifically, the GDSC1-based dataset comprises 116,966 response pairs (448 cell lines, 284 drugs) with an 8.0% missing value rate. The GDSC2-based dataset includes 91,536 pairs (449 cell lines, 222 drugs) with an 8.1% missing value rate. Our case studies are mainly based on drugs from GDSC2, which contains more recent drug response data. Our dataset can be downloaded from Hugging Face .
Source code and data: github link
2. References
[1] Ross, J., B. Belgodere, V. Chenthamarakshan, et al., Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence(2022).