The explosion of scientific literature has amassed a vast amount of invaluable knowledge and data. Text mining, as the key to unlocking this treasure, provides an ever-flowing wellspring of vitality for data-driven materials design. We introduce an automated pipeline for superalloy composition, synthesis and processing routine, and property extraction by text mining. Unlike the large-scale corpora in the field of chemistry, there is a scarcity of literature on alloys. For a relatively small corpus that lacks high-quality annotations, how can we extract knowledge and data by less expert intervention while maintaining high accuracy and recall? That was what we set out to answer in our study. The whole pipeline (Fig. 1) comprises several stages of scientific documents download, preprocessing, table parsing, text classification, named entity recognition (NER), table and text relation extraction (RE), and interdependency resolution.
Amongst, rule-based NER and RE methods for composition/property are proposed to guarantee the precision and recall on small corpus. For synthesis and processing actions, a semi-supervised recommendation algorithm (Fig. 2) for token-level action and a multi-level bootstrapping algorithm for chunk-level actions are developed with just a limited amount of domain knowledge and human-machine interaction. Utilizing this pipeline, we obtain a machine learning available dataset containing alloy chemical composition, synthesis and processing actions and γ’ size records from a corpus of 16,604 superalloy articles published up to 2022.
The data are further used to capture an explicitly expressed synthesis factor, providing valuable insights for the study of phase evolution in superalloy (Fig. 3). We have shown how knowledge presented in the past literature can be extracted by text mining and provide actionable insights for materials discovery. As the scientific literature grows, it is inevitable that NLP will become a promising tool to extract and learn from published and unpublished work and provide a format that is machine-readable and AI-useable.
In recent years, large-scale language models (LLMs), such as GPT (Generative Pretraining Transformer), have revolutionized the field of natural language processing. These models are trained on vast amounts of unannotated texts and can then be fine-tuned for specific NLP tasks. Essentially, these models are creating a “well-read” black box that interprets language at a high level and can perform a multitude of tasks within that language. GPT-4 has supported multi-modal input by integrating the visual information. This new wave of technology would potentially lead to a prosperous ecosystem of real-world applications based on LLMs. We argue that by fine-tuning some prompts, it is possible to exploit the emergent abilities of LLMs for regression, classification, and information extraction in a small corpus and attain a higher accuracy and recall.
As we stand on the precipice of the age of artificial general intelligence, the potential for synergy between AI and materials is vast and promising. Creating AI powered assistants offers unprecedented opportunities to revolutionize the landscape of material research by applying knowledge across various disciplines, efficiently processing labor-intensive and time-consuming tasks such as literature searches, compound screening, and data analysis.
You can read more about our work in our article in npj Computational Materials following the link: https://www.nature.com/articles/s41524-023-01138-w and https://www.nature.com/articles/s41524-021-00687-2.