Behind the Paper

Where do the data go when nobody is looking?

404—Data not found.

Published in Microbiology

Aug 28, 2020

Stephanie Jurburg

Postdoctoral Researcher, German Centre for Integrative Biodiversity Research (iDiv)

Where do the data go when nobody is looking?

Like Be the first to like this

Explore the Research

In 2005, as high throughput sequencers became commercially available, the largest nucleotide sequence databases in the world joined forces to create the International Nucleotide Sequence Database (INSD). Today, the European, American, and Japanese archives exchange information daily, and the INSD ensures that nucleotide sequence data and its corresponding metadata are preserved as part of the scientific record, for future generations to reuse.

Since its creation, the number of nucleotide sequences in the INSD has grown exponentially. But how much of this data is really reusable? We began to ask ourselves this question when we set out to collect bacterial community data for meta-analyses. While the archiving of ecological data is the subject of ongoing discussion, we expected microbial data to be more accessible: their format is generally homogeneous (i.e., nucleotide sequence reads), data archiving has been centralized for over a decade, and journals in microbial ecology have implemented increasingly stringent and precise data deposition guidelines.

However, this was not the case—we encountered problems with approximately half of the datasets. We wanted to know why these issues arose, but this is experimentally hard to determine. In initial surveys, we had found that articles often deposited their nucleotide sequence data to the INSD, but minor errors downstream (i.e., archiving and documenting the data and metadata) rendered the sequences either inaccessible or not reusable. So instead, we decided to ask: where in the data deposition process do nucleotide sequence data stop being reusable, and with which frequency does each error occur?

We divided the data archiving process into four criteria which were necessary for data reuse (data location, deposition, formatting, and labeling). We examined datasets from articles published in microbial ecology-specific journals, as we had found that these journals had more precise requirements for data deposition. Then, we checked if the data met these four criteria.

Our findings shed some light into why data gets lost. One reason is the rapid pace with which sequencing technologies and best practices change, which make it hard to preserve all the data which is necessary to reanalyze the sequences in the future. One example is the frequent lack of mapping files, which are required to demultiplex sequence files into individual samples.

Our study also highlights that the solution may require little additional effort, but entails data providers, databases, and journals working together more closely. Even when data providers had uploaded their sequences to an INSDC database and had provided accession numbers, these accession numbers were often incorrect, or the data had not been made public—additional checking that the accession numbers are correct may be a simple and effective way that data does not get lost. Sending reminders to data providers to make their data public upon article publication may be another.

As the popularity of nucleotide sequencing continues to grow, so will the databases where these data are archived. Ensuring that the archives remain full of reusable data is an investment in the future of our field.

Stephanie Jurburg

Postdoctoral Researcher, German Centre for Integrative Biodiversity Research (iDiv)

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Microbiology

Life Sciences > Biological Sciences > Microbiology

Communications Biology

Communications Biology

An open access journal from Nature Portfolio publishing high-quality research, reviews and commentary in all areas of the biological sciences, representing significant advances and bringing new biological insight to a specialized area of research.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

From RNA Detection to Molecular Mechanisms

With this cross-journal Collection, the editors at Nature Communications, Communications Biology and Scientific Reports invite manuscripts that highlight innovative methods for visualising and detecting RNA molecules and modifications in vivo, and studies that use these tools to uncover RNA-based regulatory mechanisms.

Publishing Model: Open Access

Deadline: May 05, 2026

Explore this Collection

Signalling Pathways of Innate Immunity

In this cross-journal Collection, we invite research into the complex signalling pathways of innate immunity, emphasising the activation and regulation of pattern recognition receptors in response to microbial and endogenous triggers.

Publishing Model: Hybrid

Deadline: May 31, 2026

Explore this Collection

How Campylobacter jejuni Hijacks Host Cell Architecture to Survive Inside Cells

Behind the Paper, Psychedelics Hub

Psychedelic 5-MeO-DMT induces dissociated state in mice

Behind the Paper

When fieldwork surprises you: the challenge of studying an unexpected fossil periostracum

Behind the Paper

Connecting Brain Organoids to Build Dynamic Neural Networks

Behind the Paper

Single cell snapshot analyses under proper representation reveal that epithelial-mesenchymal transition couples at G1 and G2/M

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Where do the data go when nobody is looking?

Share this post

Share with...

...or copy the link