Behind the Paper

Using Annotations to Link Metadata to Ontologies

Connecting ad hoc units with units in an ontology is a first step towards using metadata annotations to enhance the interoperability and reusability of environmental data.

Making environmental data FAIR  - Findable, Accessible, Interoperable and Reusable (Wilkinson et al. 2016) has been a long-term challenge. The wide diversity of environmental sciences, which encompass the physical, chemical, biological and  ecological research domains, poses special challenges for Interoperability and Reusability. Interoperability is enhanced by standard metadata and vocabularies, and reusability depends on accurate descriptions of widely varying data attributes. Units of measurement are one aspect of data that are required to interpret and use data. But as discussed in Hanisch et al. (2022), there have been substantial challenges in making units consistent and machine readable. 

A collaboration among the Environmental Data Initiative (EDI), Long-Term Ecological Research Network (LTER), National Ecological Observatory Network (NEON), and DataONE is linking ad hoc units reflected in their respective data contributions to a community vocabulary (QUDT.org).  Our article describes the process of linking ad hoc, diverse unit descriptions from existing environmental metadata. We were able to match 91% of 355,057 uses of units in metadata with units in QUDT, and produced a lookup table that maps ad hoc to QUDT units. Here we do not recount that mapping  process, but instead focus on the utility of unit annotations added to existing metadata. 

In the metadata there were often many different ad hoc units that mapped to the same underlying concept. For example “DEG_C” in QUDT was mapped to at least 17 different unit representations (ignoring letter case)  in the metadata we examined.  Most were listed as “celsius”, but thousands of others used other variants, including some common misspellings (e.g., celcius). 

Much of the existing metadata in EDI, LTER, NEON, and DataONE uses the Ecological Metadata Language (EML) schema for metadata. This machine-readable schema facilitates multiple representations of the metadata from human-readable web pages to generated code for processing the data. The latest version of the schema supports “annotations,” with which metadata can include references in a computer-interpretable form, that can unambiguously identify terms or concepts  and capture relationships among metadata elements and external resources that provide more details on the entity.

For units, a typical annotation in EML metadata describing a column containing precipitation data in millimeters would be: 

<annotation>
<propertyURI label="has unit">http://qudt.org/schema/qudt/hasUnit</propertyURI>
<valueURI label="Millimeter">http://qudt.org/vocab/unit/MilliM</valueURI>
</annotation>

This flexible structure allows an annotation to address any sort of relationship (here, “has unit”) and value (here, “MilliM” in the QUDT ontology), and by its placement within a metadata document associate it with a particular attribute or column of data. This structure forms a Resource Description Framework (RDF) triplet that is the basis of Linked Open Data and the semantic web. We (as humans) can also follow the URI’s back to the QUDT ontology to view additional information about the dimensions, multipliers to SI units, and links to additional standards. 

Top of the QUDT web page for MilliM

Given the power of annotations, we are now working on getting more unit annotations integrated into ecological metadata in the EDI repository. The lookup table, described in the article, is publicly available in a dataset, along with the raw metadata and code (Porter et. al. 2022). This table can easily be ingested by web and statistical programs to automate annotation of metadata. 

In some cases, the knowledge of the unit alone is sufficient to automate unit conversions. For example, If one dataset attribute has units in grams, and another in kilograms, reference to QUDT will provide the multipliers needed to convert them to the same SI unit.  However, clearly describing a measurement requires more than just the unit definition. In one dataset the unit might be referring to “grams of carbon per square meter,” in another “grams of nitrogen per square meter.” So a measurement consists both of the unit (grams per square meter) and the entity or entities (carbon, nitrogen) the unit applies to. In our article, we focused on the “unit” portion of the measurement, a necessary first step towards a more comprehensive set of annotations more fully describing a measurement. In addition to describing the entities to which a unit applies, additional annotations can capture important context, such as whether measurement of carbon was collected in soil or water. So adding additional annotations will allow us to strengthen the interoperability and reusability of data.  Our goal is to reach a point where fully automated integration of diverse data will become routine. Adding unit annotations is a solid and necessary first step, but it is only the start of that larger process.

Below is a diagram of RDF triples that can be created with metadata annotations. The example makes use of other ontologies in addition to QUDT: CHEBI (Chemical Elements of Biological Interest), and EnvO (Environment Ontology).  

Enhancing metadata with such annotations will facilitate the automated or semi-automated integration of diverse datasets thus increasing their interoperability and reusability and making them more FAIR. 

Co-Authors Margaret O'Brien, Marina Frants, Stevan Earl, Mary Martin and Christine M. Laney all contributed to this post.