Behind the Paper

Teaching a machine to read raw spectroscopic data

This project introduces a system, SECS, that for the first time is able to work on raw spectroscopic data; data that contains impurities, solvents, experimental artefacts.

Published in Chemistry and Computational Sciences

Jun 06, 2026

Adrian Mirza

Doctoral Researcher, Helmholtz Zentrum Berlin

Teaching a machine to read raw spectroscopic data

Liked by India Ambler and 3 others

Explore the Research

How did the idea emerge?

After managing to build a model that aligned different representations of molecules as part of a workshop paper, we started thinking about how such alignment models could be practically useful. Chemists typically interact with real-world molecules not as text representations, but as spectra.

However, at the beginning, the lack of curated experimental data was a big bottleneck , as no such datasets are openly available. So our first experiments were on simulation data in their raw form, not as post-processed peak data. This, in hindsight, was a good choice, because it left open an avenue for directly processing raw experimental spectra. However, the gap between experiments and simulations was still large — that is, if we kept treating the data as-is. In fact, our initial experiments showed no raw spectra elucidation ability. Adding simple artefacts such as noise was also insufficient, even though in this case a first signal gave us hope that the approach would work.

How did we make it work?

The next months were spent testing different augmentation techniques: essentially trying to make the simulated spectra look more like the experimental ones. A plethora of options were available, ranging from simple noise addition to baseline distortion, peak broadening, and impurity addition. This was promising from the start, and we saw that using simulations alone was sufficient at the initial stage. The real performance upgrade came from using our in-house experimental dataset as a fine-tuning step.

In order to build a better understanding of our pipeline and its results, we worked together with Luc Patiny, who had also built visualization platforms for other Jablonka Lab projects. Together, secs.lamalab.org was built, version by version, with Luc's NMR expertise aiding the project tremendously, as he scrutinised every output of the SECS (Structure Elucidation from Chemical Spectra) pipeline. On a less serious note, there is at least once a week a joke in the lab related to this name.

Initially, the pipeline was served on graphics processing units (GPUs) in the cloud. This was not sustainable in the long run due to prohibitive costs. So we had to find a workaround for our most time-consuming step: converting all available isomers of a molecule into embeddings. The solution turned out to be hiding in vector databases, which can efficiently compress vectors at scale. We took all molecules from the PubChem database as our starting point, from which we retrieve the most similar molecules to the spectrum we want to elucidate. These serve as the starting point in the optimisation process, where we iteratively modify the molecular structure to better match the representation of the spectrum.

How does it compare to chemists?

Thanks to three of my fellow PhD students with chemistry backgrounds and three NMR experts, all of whom accepted a challenge against our system, we managed to obtain a comparison with human chemists. On 20 randomly chosen molecules from our experimental evaluation suite, the system performed competitively — to our surprise even outperforming one of the experts on the chosen spectra.

For autonomous labs to be more than a demo, they need to interpret their own measurements. We don't think SECS is the final answer, but it is an important stepping stone, demonstrating promising results using perhaps the single most used spectroscopic method: proton NMR.

Adrian Mirza

Doctoral Researcher, Helmholtz Zentrum Berlin

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Spectroscopy

Physical Sciences > Chemistry > Analytical Chemistry > Spectroscopy

Machine Learning

Mathematics and Computing > Computer Science > Artificial Intelligence > Machine Learning

Analytical Chemistry

Physical Sciences > Chemistry > Analytical Chemistry

Nature Communications

Nature Communications

An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Women's Health

A selection of recent articles that highlight issues relevant to the treatment of neurological and psychiatric disorders in women.

Publishing Model: Hybrid

Deadline: Ongoing

Explore this Collection

Tumor Microenvironment Crosstalk and Therapeutic Implications

With this cross-journal Collection, the editors at Nature Immunology, Nature Communications, Communications Medicine and Scientific Reports invite manuscripts that highlight cutting-edge research on TME crosstalk and its therapeutic implications. Topics of interest include immune modulation and checkpoint pathways, cancer-associated fibroblasts and stromal remodeling, angiogenesis and vascular normalization, metabolic reprogramming within the TME, and the role of microbiota in tumor-immune dynamics. We also welcome studies on novel therapeutic approaches that exploit TME vulnerabilities to advance cancer treatment.

Publishing Model: Hybrid

Deadline: Nov 02, 2026

Explore this Collection

Latest Content

Opportunities

Call for papers: Plastisphere: plastic-microbial interactions in the environment

Opportunities

Call for papers: Managing cascading and non-linear climate risks

Catching Elusive Intermediates in Isoprene Ozonolysis

Behind the Paper

From a Connecticut garden to a Canadian laboratory: a century in the Private Lives of Birds

Insights on Meat Hygiene Quality in Global Trade: Why It Matters to Everyone

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Teaching a machine to read raw spectroscopic data

Share this post

Share with...

...or copy the link