MDS-Onto: A Community-Driven Effort to Standardize Terminologies in Materials and Data Sciences

Published in Research Data, Mathematics, and Statistics

MDS-Onto: A Community-Driven Effort  to Standardize Terminologies in Materials and Data Sciences
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

 Scientific data is often messy. Poor documentation practices, lack of data management policies, non-machine-readable datasets and isolated user-created variables create reproducibility, transparency, and efficiency challenges in research. When data is not properly documented, lacks essential metadata or doesn’t follow community standards, exchanging and sharing information is hard and leads to slow and unreliable progress. 

In this work, we introduce the Materials Data Science Ontology (MDS-Onto) framework, a community-driven initiative composed of an ensemble of user-friendly tools for developing the MDS-Onto ontology (Figure 1) for data FAIRification and semantic reasoning.
‘Ontologies’ are standards that share vocabulary and relationships across domains, thereby facilitating and enhancing data exchange and interoperability of datasets, data analysis, and models trained on data. This definition also poses a question - if data is usually domain specific, how do we create interoperable terms? By utilizing International Standards Organization (ISO) established ontologies, terms are aligned to existing standards while labeling is consistent,  enabling data science workflows to be streamlined and efficient .

MDS-Onto is a community effort with collaborations and partnerships with industry, academia, and national laboratories. Our core development team is located at the SDLE Research Center in the Department of Materials Science and Engineering at Case Western Reserve University and our network of domain experts represents a breadth of institutions. 

Materials Data Science Ontology ( MDS-Onto)

MDS-Onto is a low-level and modular ontology for the domains of Materials and Data Science. Our approach to modularize MDS was created to simplify the process of terms alignment, which can be challenging depending on the alignment level and the user’s experience in ontology development. Creating modular ontologies simply means that we map our terms to MDS-Onto Concepts that were previously mapped to other mid-level ontologies, such as PMDco. If one wants to map the instrument model variable, for instance, they can map the model to mds-tool (concept layer), which is a subclass of pmd:ProcessingNode from PMDco. 

We recommend variables at the sub-domain level to be created following Research Data Alliance (RDA) recommendations for the domain or application field. When domains donot fit into an existing MDS-Onto-Concept category, additional MDS-Onto Concepts can be created and domain or sub-domain ontologies incorporated into MDS-Onto. The MDS-Onto core development teams then map the new ontology to existing interoperable mid/top-level ontologies. 

Figure 1. The Materials Data Science Ontology and its relationship to other ontologies and the Semantic Web.

MDS-Onto Tools: MDS-Onto FindTheDocs, FAIRmaterials, and FAIRLinked

Our MDS-Onto Framework has 3 main components in addition to the MDS-Onto Ontology: FAIRmaterials, a bilingual (R/Python) software package used for ontology creation, visualization, and documentation using a simple interface based on a .csv template. Users populate the .csv file with domain/subdomain terms and map these directly to mds: (or to mid-level ontologies), run FAIRmaterias, and it generates ontology files (.ttl, .owl), an image for visualization, and an .html webpage of ontology documentation.  The second component of the MDS-Onto framework is FAIRlinked, a Python package that uses MDS-Onto, translates .csv data into FAIRified .jsonld linked data. While MDS-Onto FindTheDocs, the third component of our framework, is a website for ontology visualization using the WebVOWL graph exploration tool and JSON-LD Playground for .jsonld validation and full MDS-Onto documentation. MDS-Onto FindTheDocs is also where users can download the up-to-date MDS-Onto Ontology files. A snapshot of MDS-Onto FindTheDocs can be seen in Figure 2. Figure 3 illustrates how FAIRlinked uses MDS-Onto Ontology and raw data to create .jsonld linked data. 

Figure 2. MDS-Onto FindTheDocs website containing the ontology files, visualization and validation tools and ontology documentation.  

We created ontologies, so what? 

Now we have several domain and sub-domain ontologies that describe unified knowledge and vocabulary in particular domains as terms and relationships, as illustrated in Figure 1. How can we make use of those ontologies beyond being tools for terminology guidance? How can we integrate ontologies to guide FAIR data creation and automated scientific analysis workflows? 

The answer is FAIRlinked, our most recent package that was briefly introduced in the previous section .FAIRlinked was designed to fill the gaps between ontology development and  FAIR principles implementation. The basic approach of FAIRlinked is to take ontology files with interoperable terms and relationships from MDS-Onto, creating templates to be populated with raw data.  These are then serialized in a second interaction to create JSON-LD files. JSON-LD  is a standard data format that is a W3C recommendation for linked data. By using the RDF data cube vocabulary in the measured dimension approach, users can decide on creating JSON-LD for entire dataframes as a single instance or creating one JSON-LD file per row. The choice will depend on how the study object and domain are organized and what makes the most sense for that particular domain.

Figure 3. Workflow of the FAIRlinked package, using MDS-Onto and initial data, to create FAIRified .jsonld linked data. 

FAIRlinked creates the JSON-LD files with parseable filenames that are globally uniquely identified. The parseable filenames convention and order will depend on community preference, standards, and relevance for that domain. All metadata is stored as a key in the .jsonld files, so in theory, we do not need metadata information in the file name. However, to meet the unique identifier requirement in the Findable principle,  the file name should use hashes or Universally Unique Identifiers (uuids). Such file names would resemble  24d470987fda1278c63c3j78jb30869b8218c64f.jsonld – not very user friendly or easily interpretable by a human reader.

An alternative way to meet the “Findable” principle of FAIR is by designing more human-friendly parseable file names, which is what we choose to adopt by defining our parseable file names starting with the researcher's Open Researcher and Contributor IDentification (ORCID).  For Photovoltaics modules, for example,  where the study object is the module id, the file name convention adopted is orcid-sampleID-timestamp.json

Once we have data and metadata all stored in .jsonld linked datafiles, which are consistent throughout a domain, it becomes easier, quicker, and more efficient to write scripts and establish workflows that can be reused, to extract, analyse, and model information. 

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Research Data
Research Communities > Community > Research Data
Analysis
Mathematics and Computing > Mathematics > Analysis
Data Analysis and Big Data
Mathematics and Computing > Statistics > Data Analysis and Big Data

Related Collections

With collections, you can get published faster and increase your visibility.

Clinical informatics

This Scientific Data Collection presents descriptions of a series of datasets for use in clinical informatics fields. Datasets in clinical informatics are vital for improving healthcare quality, efficiency, and patient outcomes.

Publishing Model: Open Access

Deadline: Sep 19, 2025

Text and speech corpora for natural language processing and corpus linguistics

This Collection presents a series of annotated text and speech corpora alongside linguistic models tailored for CL and NLP applications. These resources aim to enrich the arsenals of CL and NLP users and facilitate interdisciplinary research.

Publishing Model: Open Access

Deadline: Jul 24, 2025