Behind the Paper

Introducing MolE: A New Model for Predicting Molecular Properties for AI Drug Design and Beyond

Published in Chemistry, Protocols & Methods, and Computational Sciences

Nov 12, 2024

Oscar Méndez-Lucio and Brita Belli

2 contributors

Liked by India Ambler and 1 other

Explore the Research

When medicinal chemists are engaged in drug discovery tasks like hit identification, hit-to-lead, and lead optimization, they need to predict properties about the molecules they want to synthesize in order to optimize features like potency and binding affinity while preventing toxicity across a number of different dimensions. The most efficient way to do this is using a trained AI model that can predict the molecule’s biological effects and propose structural changes.

There are a number of these chemical modeling efforts underway, however most of public and private training sets are typically small compared to other fields – in the tens of thousands, versus millions for image recognition labeled examples as with the public image database ImageNet.

Many existing chemical modeling efforts also use a type of molecular representation known as simplified molecular input line-entry system (SMILES) sequences for training. These sequences rely on a linguistic construct, with a series of characters representing the molecules’ atoms and bonds.

In our paper, we offer a new foundation model for chemistry called MolE that relies instead on molecular graphs – visual depictions of the molecules with nodes and edges as opposed to a linear string of characters. MolE was first trained on over 842 million molecular graphs using a self-supervised approach – meaning it didn’t need experimental results since it learns entirely from the chemical structure – and then further fine-tuned on a set of downstream absorption–distribution–metabolism–excretion–toxicity (ADMET) tasks. Ultimately, we showed that MolE outperforms earlier approaches, ranking first in 10 of the 22 ADMET tasks included in the Therapeutic Data Commons (TDC) leaderboard.

MolE initially learns from unlabeled data, and this allows us to build models with fewer data points, generalizing better and producing better benchmarks in downstream models than prior methods.

Building and Testing the MolE Foundation Model for Chemistry

MolE relies on atom identifiers as input tokens and graph connectivity for positioning information from the molecular graphs. We used this data to pretrain the model – transferring from large unlabeled datasets to smaller labeled datasets by randomly masking atoms and having the model predict the corresponding atom environment of all neighboring atoms separated by no more than 2 bonds. We used both a self-supervised approach with approximately 842 million molecules, followed by supervised pre-training with about 456,000 molecules.

MolE self-supervised pre-training — MolE self- supervised approach in which an input atom is masked and the task is to predict the corresponding atom environment of radius 2, i.e. the masked atom plus all the neighboring atoms separated by no more than two bonds.

Next, we fine-tuned the model by assessing the quality of its predictions using a set of 22 ADMET tasks included in the TDC benchmark. The Therapeutic Data Commons is a resource providing tools, libraries and other resources accessible via an open Python library, that allows researchers to evaluate AI capability across a variety of therapeutic modalities and stages of discovery. This benchmark gave us a standard way to compare MolE against other established models, such as those using precomputed fingerprints like RDKit or the Morgan fingerprint, convolutional neural networks using SMILES, and versions of graph neural networks like ChemProp.

When compared to the best models in the TDC leaderboard, MolE achieved state-of-the-art performance in 10 of the 22 tasks and was the second-best model on another 4 tasks. The tasks where it performed best included 6 regression tasks and 4 classification tasks, primarily those related to CYP inhibition. CYP enzymes play a major role in metabolizing drugs, and CYP inhibition often leads to drug-drug interactions. While the CYP results are tied to larger datasets, MolE also achieved top performance on tasks with just a few hundred training examples, including predicting half-life and CYP substrates.

The next-best model after MolE, ZairaChem, achieved top performance on only 5 of the 22 tasks.

What We Learned – And What’s Next

This paper demonstrates that we can use a transformer-based model, MolE, to predict chemical and biological properties directly from molecular graphs using a pre-trained model. We used a two-step pre-training approach – self-supervised followed by supervised – to train models that outperformed earlier approaches.

So what could this mean for the future of AI drug discovery?

We think MolE will prove to be an important tool in our understanding of how new molecules will perform in the body – and in guiding the design of highly optimized molecules. By training the model to understand atom environments and their relationship to each other, it can also help avoid problems of classical fingerprints such as sparsity and clashes when using bit vectors.

And while we only used drug-like molecules in this research, we expect that larger and more diverse datasets can only improve the model’s performance. This work, we believe, represents an important first step towards establishing a foundation model for chemical property prediction.

Note to researchers: The code to use the model reported in this study is available under the Attribution-NonCommercial 4.0 International License (CC-BY-NC 4.0) in https://github.com/recursionpharma/mole_public.

Multiple Contributors

Oscar Méndez-Lucio and Brita Belli

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Virtual Drug Screening

Life Sciences > Biological Sciences > Biological Techniques > Computational and Systems Biology > Virtual Drug Screening

Computational Chemistry

Physical Sciences > Chemistry > Theoretical Chemistry > Computational Chemistry

Cheminformatics

Physical Sciences > Chemistry > Theoretical Chemistry > Cheminformatics

Artificial Intelligence

Mathematics and Computing > Computer Science > Artificial Intelligence

Nature Communications

Nature Communications

An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Women's Health

A selection of recent articles that highlight issues relevant to the treatment of neurological and psychiatric disorders in women.

Publishing Model: Hybrid

Deadline: Ongoing

Explore this Collection

Advances in neurodegenerative diseases

This Collection aims to bring together research from various domains related to neurodegenerative conditions, encompassing novel insights into disease pathophysiology, diagnostics, therapeutic developments, and care strategies. We welcome the submission of all papers relevant to advances in neurodegenerative disease.

Publishing Model: Hybrid

Deadline: Dec 24, 2025

Explore this Collection

The active layer soils of Greenlandic permafrost areas can function as important sinks for volatile organic compounds

Behind the Paper

Hydrogel–elastomer-based conductive nanomembranes for soft bioelectronics

Behind the Paper

Assessing technological and sustainable performance in maritime transport through the SFA metafrontier framework

Behind the Paper

Catching RNA in the Act of Folding

Behind the Paper

Happy Killers in Hostile Tumors: Engineering T Cells with a Built-In Survival Toolkit

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Introducing MolE: A New Model for Predicting Molecular Properties for AI Drug Design and Beyond

Share this post

Share with...

...or copy the link