A materials terminology knowledge graph automatically constructed from text corpus

Machine learning (ML) and artificial intelligence (AI) have demonstrated significant potential in accelerating the discovery of new materials. Materials science data, as the foundation for ML and AI applications, has become a critical premise, leading to the establishment of numerous materials data infrastructures for collecting, hosting, providing, and analyzing materials data. Advances in computational science have accelerated these infrastructures into materials discovery platforms. However, existing materials data infrastructures often focus on specific material classification schemes, making it challenging to align and integrate data across different infrastructures. Therefore, a unified, scalable, and reusable materials knowledge representation is becoming increasingly important to enhance data sharing efficiency.
A knowledge graph (KG), as a formal collection of term entities and relationships, is conceptually significant in achieving this goal. With the rapid development of AI technologies, the construction of material terminology KGs has entered a new era of automation. Semi-automatic and automatic extraction methods based on natural language processing (NLP) can quickly and massively extract terms and vocabulary from natural language texts, significantly reducing the difficulty of KG construction. Recent studies have demonstrated pipelines using NLP techniques to automatically extract data on organic and inorganic compounds and alloys from articles in the fields of chemistry and materials science.
In this work, we present the Materials Genome Engineering Database Knowledge Graph (MGED-KG). We developed an NLP-based approach to construct MGED-KG, involving corpus preprocessing, named entity recognition (NER), and relationship construction among entities. MGED-KG enhances data sharing efficiency through query expansion, term recommendation, and data recommendation, and is integrated into our National Materials Data Management and Service Platform (NMDMS) to improve data retrieval and discovery.
Fig. 1 The semantic workflow of materials terminology extraction, construction of MGED-KG and its application.
MGED-KG is the most comprehensive knowledge graph of materials terminology in both Chinese and English, containing 8,660 terms and their explanations. It covers 11 major categories, including metals, inorganic non-metallic materials, and organic polymer materials, each with multiple subcategories, totaling 235 distinct labels. The terminology classification is hierarchical in three levels, with a total of 235 category labels. The first level comprises fundamentals of materials science, metals, inorganic non-metallic materials, organic polymer materials, composites, information materials, energy materials, biomedical materials, natural materials and their products, functional materials, as well as nanomaterials. And for the second and third levels, the term categories are more nuanced. For example, the term “ferrovanadium” has the category label of “metals -> steels -> iron”. Figure 2 shows the hierarchical classification structure of MGED-KG with the first two levels of categories and the number of third level categories below them.
Fig. 2 The visualization of MGED-KG in hierarchical structure with first two levels of categories and the number of third level categories below them.
We also developed an RDF-based ontology for MGED-KG to facilitate cross-domain interoperability. Figure 3 provides an overview of the structure of MGED-KG. The left part shows the schema of ontology, and right part illustrates some instances related to the terms and categories. Our work establishes a unified semantic knowledge graph foundation for materials, accelerating data sharing and integration, and advancing data-driven materials research.
Fig. 3 MGED-KG Material Knowledge Graph Structure Diagram.
With the support of web technologies, MGED-KG can be distributed in a digital form through an online platform. We used the Django framework (https://www.djangoproject.com/), which is a highly scalable Python web framework that can quickly develop and deploy complex web applications, to construct an MGED-KG system (http://mged.nmdms.ustb.edu.cn/MGEDKG/) for data standardization and terminology sharing among the materials community. The MGED-KG system includes functional modules of term catalog, term retrieval, term recommendation, and data recommendation associated with the NMDMS database.
Due to the benefits of the knowledge graph, MGED-KG can effectively reduce the difficulty of materials terminology reuse in a semantic knowledge repository and enhance the possibility of effective collaboration. This allows engineers, researchers, and other professionals in different material domains to quickly transform knowledge into the required format and aid in the utilization of knowledge within intelligent systems. To demonstrate its capability, MGED-KG was successfully applied to three scenarios regarding query expansion, term, and data recommendation.
Fig. 4 An example of term catalog. The left section of the page is the three-level hierarchical categories, and the right section shows there are 20 material terms under the classification of metal material “Iron”.
Fig. 5 An example of a term page, containing the Chinese and English name (in parentheses) of the term, the explanation, and the classification. In the right section, there are related terms that associate with the current term according to the term relation in MGED-KG. A language switch button for “Explanation” is provided, offering term explanations in Chinese and English. Additionally, clicking on terms with underlines in the explanation allows users to directly navigate to the detailed page for that specific term.
The query expansion function significantly enhances the user experience when retrieving material terms by automatically completing and standardizing user input. The term recommendation function leverages the connections and associations within the knowledge graph to quickly and accurately infer terms related to the user's input, thereby providing a more comprehensive and precise list of relevant terms. This helps users better understand and access knowledge in the materials domain. Additionally, the MGED-KG system is integrated into the NMDMS platform, where it establishes correlations between data instances to achieve data recommendation, thereby accelerating the discovery of valuable data.
With the continuous development of large model technology, the application prospects of the MGED-KG system in the field of materials science will become even broader. The powerful semantic understanding and generation capabilities of large models provide new methods and tools for the construction and optimization of knowledge graphs, enhancing the MGED-KG system's capabilities in materials knowledge discovery and reasoning. By integrating large models, the MGED-KG system can achieve more intelligent term and data recommendations, further reducing the difficulty of reusing material terms and improving cross-domain collaboration efficiency. For example, natural language processing technologies based on large models can help the system more accurately understand user queries and provide more relevant term and data recommendations. Moreover, the adaptive learning capabilities of large models can continuously optimize the performance of the MGED-KG system, enhancing its applicability in different scenarios. The MGED-KG system is expected to play a key role in addressing more complex material science problems, promoting knowledge sharing and innovation, and facilitating the deep integration of scientific research and engineering practices.
You can see details through these articles:
Zhang, Y., Chen, F., Liu, Z. et al. A materials terminology knowledge graph automatically constructed from text corpus. Sci Data 11, 600 (2024).
Gong, H. et al. A repository for the publication and sharing of heterogeneous materials data. Scientific Data 9, 787 (2022).
Follow the Topic
-
Scientific Data
A peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.
Related Collections
With collections, you can get published faster and increase your visibility.
Epidemiological data
Publishing Model: Open Access
Deadline: Mar 27, 2025
Data for epigenetics research
Publishing Model: Open Access
Deadline: Mar 28, 2025
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in