Behind the Paper

Metadata templates demystify the FAIR principles

Metadata templates serve as references for what constitutes FAIR data in given scientific communities. They embody standards for what the community considers to be FAIR—in a form that can be used by both people and computers to assist with data stewardship and data sharing.

Published in Research Data

Nov 12, 2022

Mark Musen

Professor, Stanford University

Metadata templates demystify the FAIR principles

Liked by Tristan Matthews and 2 others

Explore the Research

Most of us have an intuitive idea of what it means for data to be "FAIR." Still, there are still no universally recognized strategies for ensuring that research data abide by the FAIR Guiding Principles in a manner that makes the data truly findable, accessible, interoperable, and reusable. Our approach, described in this recently published paper, is motivated by an observation that many of us in the data-science community find a bit depressing: Investigators profess to want to make their data FAIR (often because their funders mandate it), but they feel overwhelmed by the FAIR Guiding Principles as articulated in the now-famous 2016 publication. There are simply too many principles to consider, and the principles all deal with things that most scientists never think about. Between this rock and hard place, investigators often feel they have no choice but to assert that their data are “FAIR”, although their assertion comes more from desperation than knowledge and skill.

CODATA: Committee on Data, International Science Council

But there are two “meta principles” that can help to demystify the confusion about how to make data FAIR.

First, half of the FAIR Guiding Principles define desirable properties of the repository where the data are placed: They are therefore not anything for scientists to worry about. Although the characteristics of the repository are important, they are almost never anything over which an investigator has direct control. Thus, for all practical purposes, scientists can ignore these archive-dependent determinants of FAIRness and leave them to the people who manage data repositories.

Second, the one thing over which investigators do have control is their metadata. The Guiding Principles make clear that datasets become FAIR when scientists annotate their data with “rich” metadata that adhere to the correct community standards. This matter naturally begs the question of how do investigators know if there even is a community standard to which their metadata should adhere, let alone how can they make sure that they are using such standards in the correct way?

Investigators can author metadata in a standards-adherent way by filling in a template that the community itself has created to capture the necessary standards—enumerating the things that scientists need to say about their data so that a third party can make sense of (1) the experiment that was conducted and (2) what the data represent. Because the relevant community standards are embedded in the metadata template, filling in the template ensures that the standards-dependent FAIR Guiding Principles are met.

Our work on metadata templates is taking place at a time when many groups of scientists—as well as funders and publishers—are searching for technology that they can simply point at an online repository to indicate whether the datasets are FAIR. There are now dozens of tools that purport to measure data FAIRness, and many more on the way. Unfortunately, inter-rater reliability among these tools is poor, and none of the tools has developed much of a following. The problem is often compounded by software engineers who believe that throwing even more technology at the automation of FAIR evaluation will correct all the limitations of previous approaches. Unfortunately, the whole enterprise of trying to programmatically evaluate datasets for FAIRness is doomed because no computer on its own can ever know (1) what community standards might be appropriate, (2) whether metadata are sufficiently “rich” to enable other investigators to find relevant datasets, (3) whether the data really are interoperable with other datasets, and (4) whether the experimental context is described sufficiently to re-use the data in an appropriate manner. The answers to all these questions are not a matter of computation, but a matter of how a scientific community advocates description of the standard types of experiments that it performs.

Nevertheless, still hoping to obtain a computational solution to the problem, the Research on Research Institute (RoRI) issued a request for proposals in 2020 hoping that someone could come up with a better tool. The RoRI RFP was unusual, however, because it encouraged proposers to suggest alternative approaches should they have what they think is a better idea. When my team applied for funding for what is now the FAIRware Workbench (described in our paper), we asserted that the world does not need yet another FAIR evaluator that attempts to address the problem purely through brute-force computation. Obviously, the reviewers agreed with us. (The FAIRware Workbench continues to be developed with support from the U.S. National Library of Medicine and other sources.)

The ultimate solution to the challenge of ensuring that data are FAIR rests with individual scientific communities, who need to take responsibility for the necessary metadata standards. The biologists and the earth scientists have made great strides in this area, but other groups of scientists need to follow their lead. More important, funders need to provide support for the development, promotion, and adoption of these kinds of standards, and for the infrastructure required to make it easy to put these standards to use in the course of routine data stewardship.

In the end, standards development is inherently time-consuming and tedious—and thus expensive—even in the setting of the Metadata for Machines workshops initiated by the GO FAIR Foundation to streamline this process. If the scientific community is serious about the need for FAIR data, then it has to expand its commitment to the work required to standardize the reporting guidelines and ontologies that allow data to comport with the FAIR Guiding Principles in the first place. More important, funders cannot simply demand that data be FAIR; they need to underwrite the requisite standards-development work, and to provide the tools that will make the use of community metadata standards a natural and inevitable component of everyday data management. It will take more than just funding to effect this change. It will take a major shift in the culture of science so that investigators can come to recognize that FAIR datasets are just as important an endpoint of research as are high-impact journal publications—and that putting a dataset online with inadequate metadata is worse than submitting a journal article with typos and missing references.

Mark Musen (He/Him)

Professor, Stanford University

Mark is Stanford Medicine Professor of Biomedical Informatics Research at Stanford University, where he is Director of the Stanford Center for Biomedical Informatics Research. Mark conducts research related to open science, intelligent systems, computational ontologies, and biomedical decision support. His group developed Protégé, the world’s most widely used technology for building and managing terminologies and ontologies. He served as principal investigator of the National Center for Biomedical Ontology, one of the original National Centers for Biomedical Computing created by the U.S. National Institutes of Heath. He directs the Center for Expanded Data Annotation and Retrieval (CEDAR), founded under the NIH Big Data to Knowledge Initiative. CEDAR develops semantic technology to ease the authoring of experimental metadata and the management of scientific datasets.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Research Data

Research Communities > Community > Research Data

Scientific Data

Scientific Data

A peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Computer vision in plant science and agriculture

This Scientific Data Collection invites Data Descriptors documenting the generation, curation, and validation of datasets that underpin computer vision applications across plant biology, crop science, and agricultural systems.

Publishing Model: Open Access

Deadline: Oct 10, 2026

Explore this Collection

Wearable and Computer Vision Data for Health and Behaviour Research

This Scientific Data collection of articles focuses on data from wearable and non-wearable devices, including data from devices that monitor health and computer vision data.

Publishing Model: Open Access

Deadline: Aug 08, 2026

Explore this Collection

Latest Content

"Guide for longevity and flexibility in interior architecture"

News and Opinion, Life in Research, Empower Your Research

"A Special Equation between Visual Lightness and the Luxurious Impression of Interior Architecture through the New Classic"

"Sustainable Development Interaction between Humans and the Interior Architecture"

Can one target do both?

What can we image?

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Metadata templates demystify the FAIR principles

Share this post

Share with...

...or copy the link