The recent visibility for machine learning has resulted from the dramatic improvements in compute power (hardware), novel algorithms (software) and the growing amount of data available (public or private knowledge). Some of the recent advances in machine learning algorithms to create newer model architectures including transformers (large language models, LLMs) which have captured considerable attention for being trained on massive datasets to enable realistic text and image generation (e.g. ChatGPT) while at the opposite end of the scale methods like few-shot learning (FSLC) models potentially offer some predictive power with very small datasets. As scientists in drug discovery some of the areas we apply machine learning algorithms to include structure activity relationship or structure property relationship datasets. But with more data normally hundreds of molecules, we can generate machine learning models that enable us to then perform computational searches for new molecules with the ideal bioactivity or score molecules for there predicted properties. This could help to dramatically shorten aspects of drug discovery or decrease the use of expensive assays or perhaps even animal models in toxicity assessment.
One question that really motivated our recent work described in this paper is repeated and heard often. "How much data do we need to build a model" and the follow up "which algorithm is the best to use". You are tempted to offer "how long is a piece of string" in response. Traditionally scientists have stuck with a machine learning method that works well for them and then use it consistently with different datasets. We set out to explore several representative machine learning methods including classical (SVR), FSLC, and transformer models using different datasets of various sizes. In the process we identified a ‘goldilocks zone’ for each model type. We discovered that dataset size and and diversity may ultimately determine the optimal algorithm. In the same manner as the childrens story with Goldilocks trying to find what is "just right", when datasets are small (<50 molecules) FSLC works best. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers perform best. When datasets are larger (> 240 molecules), classical methods like SVR performed the best. We therefore suggested the optimal machine learning method to choose depends on the dataset size and diversity.
We demonstrated this with a large collection of kinase models. We also trained a machine learning model to predict which approach is likely to have the highest predictive power using Fast Interpretable Greedy-Tree Sums which is a generalized classification and regression tree model that creates highly interpretable decision trees. We showed that relative model performance can be reliably predicted based on dataset size and diversity alone. Earlier inspiration for this work came from collections of very large numbers of machine learning models curated and built from different public databases.
We demonstrated the utility of these different algorithms using a target for Alzheimers disease, namely MARK1, for which FSLC performed the best in finding active molecules. Certainly the concept we propose could be evaluated with other datasets beyond those used in this study (e.g. ADME/Tox or other molecule properties). These findings may also help to answer the perennial question when faced with a new dataset and like Goldilocks we can find the machine learning method that is just right.
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in