Behind the Paper

Automated Extraction of Chemical Synthesis Actions from Experimental Procedures

How to extract information about the operations needed to reproduce chemical reactions in the lab from millions of experimental procedures published in the chemical literature.

Published in Chemistry

Jul 20, 2020

Alain Vaucher

Research Scientist, IBM Research Europe

Automated Extraction of Chemical Synthesis Actions from Experimental Procedures

Like Be the first to like this

Explore the Research

In the past few years, the Future of Computing team at IBM Research Europe developed machine learning models to assist organic chemists. We made the technology available worldwide through the "RXN for Chemistry" portal, catalyzing the growth of a vivid community of more than 14,000 users who generated more than 700,000 machine learning predictions of chemical reactions in two years. The RXN for Chemistry platform provides pre-trained models to predict the products of chemical reactions [1] and suggest retrosynthetic pathways [2].

As a next step, we explored the possibility to enable the machine learning algorithms to design and drive chemical reactions in a real laboratory with what we call RoboRXN. Its implementation entails learning how reactions are executed in the lab, e.g., the series of experimental actions needed for a chemical reaction to succeed, all the way from mixing compounds in a flask to the work-up of the product. So far, no database contains such information in an adequate format. Luckily, the chemical literature holds more than enough information about executing reactions: millions of experimental procedures are available in journal articles and in patents. However, they are reported in prose, which hampers a straightforward analysis and interpretation. Therefore, we took on the challenge of designing an algorithm to extract this information and provide it in a structured and automation-friendly format, as illustrated in the following table.

Experimental procedure sentence	Associated actions
Then water was added and the mixture was extracted with EA three times, the combined organic layers were washed with brine and dried (anhydrous Na2SO4).	ADD water EXTRACT with EA 3 x COLLECTLAYER organic WASH with brine DRYSOLUTION over anhydrous Na2SO4
18.1 ml (18 mmol) of a 1-molar solution of boron tribromide in dichloromethane were added to a solution of 3.37 g (9 mmol) of 4-chloro-3-(2,3-dichloro-4-methoxybenzyl)-5-difluoromethoxy-1-methyl-1H-pyrazole in 45 ml of dichloromethane which had been cooled to (−78)° C.	MAKESOLUTION with 4-chloro-3-(2,3-dichloro-4-methoxybenzyl)-5-difluoromethoxy-1-methyl-1H-pyrazole (3.37 g, 9 mmol) and dichloromethane (45 ml) ADD SLN SETTEMPERATURE (−78)° C ADD 1-molar solution of boron tribromide in dichloromethane (18.1 ml, 18 mmol)
The resulting slurry was stirred for 30 minutes at 25° C. and the pH was adjusted to pH=9 by addition of 6M NaOH (0.135 L).	STIR for 30 minutes at 25° C PH with 6M NaOH (0.135 L) to pH 9.

In order to implement a computational approach to extract actions as illustrated in the table above, we first turned our attention to so-called rule-based models. They use rules to analyze sentences and the relationships between their components to determine compounds, operations, or reaction conditions. We soon realized that this approach was not flexible and powerful enough to reach our goals: when sentences are complex and their meaning highly context-dependent, it is not practicable anymore to specify robust rules to fully capture the sense of sentences unambiguously.

Instead, we chose a purely data-driven approach: after seeing enough examples, a machine learning algorithm will be able to figure out on its own what words to pay attention to in order to extract sensible experimental steps. The major advantage of such a data-driven approach is that it relies only on data - in order to improve it, one simply needs more examples.

To provide the training data for the machine-learning model, we set up an annotation framework that enabled us to generate examples of experimental procedure sentences and corresponding operations.

In this way, we generated more than 1700 pairs of sentences and associated action sequences. Although substantial, this number is too small to train a reliable machine-learning model from the ground up. Nevertheless, we figured that the rule-based model that we had been studying earlier would be able to provide millions of examples at virtually no cost, albeit of lower quality. By pre-training the machine-learning model on that inexpensive data first, we could refine it on the manually annotated samples to obtain a satisfactory accuracy. The model can be used for free on our online platform:

Extracting the action sequence from a paragraph.

What still amazes me is the ability of the model to learn a structured syntax on its own. No need to tell it beforehand what action types are allowed and what set of properties is associated with each of them!

We presented this approach in an article published in Nature Communications, available here. Since then, the model for extracting actions from experimental procedures has paved the way to the implementation of RoboRXN. For instance, we used a large corpus of chemical procedures extracted from millions of experimental protocols to train a machine learning model to predict the experimental steps for new chemical reactions. Having assimilated the knowledge corresponding to decades of bench experience, this new model will act as the brain of the synthesis robot. More to come soon!

[1] https://dx.doi.org/10.1039/C8SC02339E

[2] https://dx.doi.org/10.1039/C9SC05704H

Alain Vaucher

Research Scientist, IBM Research Europe

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Fun Man Fung

about 6 years ago

Congratulation, Alain for your good work! I am happy for your success.

Follow the Topic

Chemistry

Physical Sciences > Chemistry

Nature Communications

Nature Communications

An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Women's Health

A selection of recent articles that highlight issues relevant to the treatment of neurological and psychiatric disorders in women.

Publishing Model: Hybrid

Deadline: Ongoing

Explore this Collection

Tumor Microenvironment Crosstalk and Therapeutic Implications

With this cross-journal Collection, the editors at Nature Immunology, Nature Communications, Communications Medicine and Scientific Reports invite manuscripts that highlight cutting-edge research on TME crosstalk and its therapeutic implications. Topics of interest include immune modulation and checkpoint pathways, cancer-associated fibroblasts and stromal remodeling, angiogenesis and vascular normalization, metabolic reprogramming within the TME, and the role of microbiota in tumor-immune dynamics. We also welcome studies on novel therapeutic approaches that exploit TME vulnerabilities to advance cancer treatment.

Publishing Model: Hybrid

Deadline: Nov 02, 2026

Explore this Collection

Predicting experimental steps for arbitrary chemical equations

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Automated Extraction of Chemical Synthesis Actions from Experimental Procedures

Share this post

Share with...

...or copy the link