Behind the Paper

Which algorithm is just right for my data?

Artificial intelligence is impacting research and drug discovery and such techniques have been developed and applied for decades for structure activity analysis. Yet the question of which algorithm to use and how many molecules are needed is still an open one. We found a Goldilocks zone may apply.

Published in Computational Sciences, General & Internal Medicine, and Mathematics

Jun 12, 2024

Sean Ekins

CEO, Collaborations Pharmaceuticals, Inc.

Like Be the first to like this

Explore the Research

The recent visibility for machine learning has resulted from the dramatic improvements in compute power (hardware), novel algorithms (software) and the growing amount of data available (public or private knowledge). Some of the recent advances in machine learning algorithms to create newer model architectures including transformers (large language models, LLMs) which have captured considerable attention for being trained on massive datasets to enable realistic text and image generation (e.g. ChatGPT) while at the opposite end of the scale methods like few-shot learning (FSLC) models potentially offer some predictive power with very small datasets. As scientists in drug discovery some of the areas we apply machine learning algorithms to include structure activity relationship or structure property relationship datasets. But with more data normally hundreds of molecules, we can generate machine learning models that enable us to then perform computational searches for new molecules with the ideal bioactivity or score molecules for there predicted properties. This could help to dramatically shorten aspects of drug discovery or decrease the use of expensive assays or perhaps even animal models in toxicity assessment.

One question that really motivated our recent work described in this paper is repeated and heard often. "How much data do we need to build a model" and the follow up "which algorithm is the best to use". You are tempted to offer "how long is a piece of string" in response. Traditionally scientists have stuck with a machine learning method that works well for them and then use it consistently with different datasets. We set out to explore several representative machine learning methods including classical (SVR), FSLC, and transformer models using different datasets of various sizes. In the process we identified a ‘goldilocks zone’ for each model type. We discovered that dataset size and and diversity may ultimately determine the optimal algorithm. In the same manner as the childrens story with Goldilocks trying to find what is "just right", when datasets are small (<50 molecules) FSLC works best. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers perform best. When datasets are larger (> 240 molecules), classical methods like SVR performed the best. We therefore suggested the optimal machine learning method to choose depends on the dataset size and diversity.

We demonstrated this with a large collection of kinase models. We also trained a machine learning model to predict which approach is likely to have the highest predictive power using Fast Interpretable Greedy-Tree Sums which is a generalized classification and regression tree model that creates highly interpretable decision trees. We showed that relative model performance can be reliably predicted based on dataset size and diversity alone. Earlier inspiration for this work came from collections of very large numbers of machine learning models curated and built from different public databases.

We demonstrated the utility of these different algorithms using a target for Alzheimers disease, namely MARK1, for which FSLC performed the best in finding active molecules. Certainly the concept we propose could be evaluated with other datasets beyond those used in this study (e.g. ADME/Tox or other molecule properties). These findings may also help to answer the perennial question when faced with a new dataset and like Goldilocks we can find the machine learning method that is just right.

Sean Ekins

CEO, Collaborations Pharmaceuticals, Inc.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Machine Learning

Mathematics and Computing > Computer Science > Artificial Intelligence > Machine Learning

Alzheimer's disease

Life Sciences > Health Sciences > Clinical Medicine > Neurology > Neurological Disorders > Neurodegenerative diseases > Alzheimer's disease

Data Mining and Knowledge Discovery

Mathematics and Computing > Mathematics > Probability Theory > Machine Learning > Data Mining and Knowledge Discovery

Communications Chemistry

Communications Chemistry

An open access journal from Nature Portfolio publishing high-quality research, reviews and commentary in all areas of the chemical sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Advances in Polymer Synthesis

All participating journals invite submissions of original research articles, with Nature Communications and Communications Chemistry also considering Reviews and Perspectives which fall within the scope of the Collection. All submissions will be subject to the same peer review process and editorial processes as regular Nature Communications, Communications Chemistry, and Scientific Reports articles.

Publishing Model: Open Access

Deadline: Jan 31, 2026

Explore this Collection

f-block chemistry

This Collection aims to highlight recent progress in f-element chemistry, encompassing studies on fundamental electronic structure, advances in separation chemistry, advances in coordination and organometallic chemistry, and the application of f-element compounds in materials science and environmental technologies.

Publishing Model: Open Access

Deadline: Feb 28, 2026

Explore this Collection

Latest Content

Opportunities, Life in Research

Neural Oscillator Synchronization and Resonance Shock as a Gateway to Alternate Realities

Behind the Paper, From the Editors

Physics of 2D Materials for Developing Smart Devices

Behind the Paper, From the Editors

Bioinspired Electrolyte‑Gated Organic Synaptic Transistors: From Fundamental Requirements to Applications

News and Opinion

Power-hungry intelligence systems?

Biomedical research, when methods count more than results

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Which algorithm is just right for my data?

Share this post

Share with...

...or copy the link