Behind the Paper

Unlocking Scientific Data Hidden in Charts: Behind the Development of ChartRecover

Published in Computational Sciences and Mechanical Engineering

Jun 09, 2026

Zongguo Wang

Prof., Computer Network Information Center, Chinese Academy of Sciences

Liked by Yuanxin Zhang and 2 others

Explore the Research

Scientific charts are one of the most information-rich components of modern research papers. Every year, millions of charts are published across disciplines ranging from materials science and chemistry to biology and medicine. These charts often contain valuable experimental measurements that cannot be found anywhere else in the article.

However, despite the rapid development of artificial intelligence and scientific databases, most of these data remain effectively inaccessible to machines. Researchers can read a chart and immediately understand the trends it presents, but extracting the underlying numerical values often requires tedious manual work. Existing tools typically depend on human interaction and become impractical when processing thousands of charts at scale.

Our motivation for developing ChartRecover originated from this challenge. As we worked on large-scale scientific data collection and AI-ready database construction, we repeatedly encountered valuable experimental results that existed only as images embedded in publications. We realized that unlocking these hidden data resources could significantly accelerate data-driven scientific discovery.

The Challenge We Did Not Expect

When we first set out to automate chart extraction, we assumed that identifying the numbers on the axes would be the easy part. We thought that once an optical character recognition (OCR) system read the tick labels, we could simply map those text boxes to their corresponding numerical values.

However, we quickly hit an unexpected roadblock. We found that even small visual offsets between the text labels and the actual tick marks produced massive errors in the recovered data coordinates. Because of variations in font sizes, line spacing, and general layout, the visual center of a text label rarely aligns perfectly with the physical tick mark. If we directly used the text's center point as our anchor, those tiny pixel deviations amplified into significant systematic mapping errors when calculating the final scientific values. This seemingly minor detail—the slight misalignment of text—became one of the biggest obstacles to achieving high-fidelity data extraction.

Building ChartRecover

To solve these issues, we developed ChartRecover(https://www.nature.com/articles/s44172-026-00691-8), an end-to-end framework designed to interpret charts much like a human researcher does, but at machine speed. We built the system around three core capabilities:

Element Detection: Instead of relying on rigid, pre-programmed templates, our system uses an object detection architecture to intuitively identify common chart components. It robustly detects axes, tick marks, legends, and the actual data points across a wide variety of visual styles and complex layouts.
Coordinate Recovery: To overcome the text deviation problem, we introduced a specialized algorithm that precisely associates the semantic meaning of the tick text with the exact physical pixel coordinates of the tick mark. By treating the physical tick mark as the true anchor, we successfully eliminated the impact of text rendering offsets, establishing a highly accurate mapping between the image pixels and the real-world numerical scales.
Adaptive Parsing: In the real world, charts are messy and diverse. Bar charts, for instance, can be horizontal or vertical, and their data can be independent or stacked. We engineered ChartRecover to systematically analyze the geometric overlap and boundaries of these structures. This allows the system to automatically distinguish between stacked and non-stacked data geometries, establishing accurate baselines regardless of the journal's unique plotting conventions.

From charts to Scientific Knowledge

By transforming static images into machine-readable numbers, ChartRecover opens entirely new possibilities for building large-scale, AI-ready scientific databases.

Historically, building databases for materials science, chemistry, or biomedical research required researchers to manually interact with extraction tools—a process completely unsuited for large-scale scenarios. Now, researchers can automatically recover absolute measurement data directly from visual plots. This structured data can immediately support advanced downstream applications, such as large-scale structure-property relationship mining. By making this data accessible, we can accelerate the screening of new catalysts, the discovery of novel battery materials, and the construction of comprehensive scientific knowledge graphs.

Looking Ahead

More broadly, we hope this work contributes to a future in which scientific charts become as searchable and reusable as scientific text. Unlocking the vast amount of empirical evidence currently trapped inside published charts could significantly expand the resources available for AI-driven scientific discovery.

As the shift toward automated, high-quality data accumulation continues, tools like ChartRecover will be at the forefront of facilitating the broader reuse of global scientific data. Ultimately, by making the data behind the charts accessible to everyone, we can enhance research transparency, improve automated scientific verification, and ensure that no valuable experimental result is left behind simply because it was published as an image.

Zongguo Wang

Prof., Computer Network Information Center, Chinese Academy of Sciences

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Spotlight on Research from China

Research Publishing > Spotlight on Research from China

Database Management

Mathematics and Computing > Computer Science > Database Management System > Database Management

Engineering Design

Technology and Engineering > Mechanical Engineering > Engineering Design

Communications Engineering

Communications Engineering

A selective open access journal from Nature Portfolio publishing high-quality research, reviews and commentary in all areas of engineering.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Generative AI for mechanical engineering design and optimization

In this collection we aim to publish exciting advances in the capability of generative AI methods and their application directions within a broad mechanical engineering scope.

Publishing Model: Open Access

Deadline: Dec 31, 2026

Explore this Collection

Engineering Solutions in Wind Energy Systems: Design, Efficiency, and Sustainability

In this collection, we aim to showcase cutting-edge research and developments that advance the efficiency and sustainability of wind energy systems, from turbine design and performance optimization to conversion, maintenance, multifunctional wind turbine foundations, and end-of-life management.

Publishing Model: Open Access

Deadline: Sep 30, 2026

Explore this Collection

Latest Content

Opportunities

Call for papers: Plastisphere: plastic-microbial interactions in the environment

Opportunities

Call for papers: Managing cascading and non-linear climate risks

Catching Elusive Intermediates in Isoprene Ozonolysis

Behind the Paper

From a Connecticut garden to a Canadian laboratory: a century in the Private Lives of Birds

Insights on Meat Hygiene Quality in Global Trade: Why It Matters to Everyone

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Unlocking Scientific Data Hidden in Charts: Behind the Development of ChartRecover

Share this post

Share with...

...or copy the link