Behind the Paper

HALD, a human aging and longevity knowledge graph for precision gerontology and geroscience analyses

We presented HALD, a text mining-based human aging and longevity dataset of the biomedical knowledge graph from all published literature related to human aging and longevity in PubMed, which uses multiple state-of-the-art natural language processing (NLP) techniques.

Published in Research Data

Dec 03, 2023

Zexu Wu

PhD student, Zhejiang University

Liked by India Ambler and 1 other

Explore the Research

The published literature is one of the most accessible data sources of molecular and disease information related to aging and longevity. However, due to the huge amount of biomedical literature, it is time-consuming and inefficient for researchers to conduct information retrieval from the major databases of medical journals like PubMed. Integrated datasets with comprehensive knowledge are crucial for researchers to leverage existing resources. In the field of life sciences, a biomedical KG can not only link biomedical entities through certain relations, but also predict the potential relationships between existing entities and discover new relational facts.

In this paper, we presented HALD, a human aging and longevity dataset of the biomedical KG from human aging and longevity-related literature in PubMed. Figure 1 illustrates the workflow of biomedical literature mining using multiple NLP techniques.

Figure 1. The workflow of HALD. (1) In the Literature Retrieval phase, we collected abstracts, PMIDs, and other information from PubMed. (2) In the Named Entity Recognition phase, we employed PubTator, Python’s re Module, Stanford CoreNLP, ScispaCy, and BERN methods to identify and normalize named entities. (3) In the Relation Extraction phase, we used NetworkX, OpenIE, and AllenNLP tools to extract relations, in which Main Verbs Formation and Negation Detection were included. (4) In the Biomarkers Identification phase, we classified the relationships into positive, association, and negative ones based on their types. Further identification as biomarkers for human aging and longevity was performed.

Up to September 2023, we had annotated 339,918 abstracts from PubMed and curated 12,227 entities in 10 types (gene, RNA, carbohydrate, peptide, lipid, protein, pharmaceutical preparations, toxin, mutation, and disease entities), 115,522 relations, 1,855 aging biomarkers, and 525 longevity biomarkers in HALD. The distributions of entities and relations are shown in Fig. 2a,b.

Figure 2. The distribution and evaluation of HALD. (a) The pie chart of entity distribution. (b) The Sankey diagram of relation distribution. (c) The comparison of aging-related gene counts among HALD, Aging Atlas, GenAge and AgingBank (Pro). (d) The comparison of longevity-related gene counts among LongevityMap, HALD, and AgingBank (Pro).

The contributions of HALD are listed as followings:

HALD is the first human aging and longevity knowledge dataset of the biomedical knowledge graph mined from published literature using NLP technologies.
HALD provides 10 types of credible human aging and longevity biomedical entities.
HALD links biomedical entities through certain relations and predicts the potential relationships.
HALD identifies aging and longevity biomarkers from curated entities and elucidates their associations with aging-related diseases.

Literature retrieval

A search for (“aging” [Title/Abstract] OR “ageing” [Title/Abstract] OR “longevity” [Title/Abstract] OR “centenarian” [Title/Abstract] OR “the elderly” [Title/Abstract] OR “the aged” [Title/Abstract] OR “old people” [Title/Abstract] OR “older people” [Title/Abstract] OR “old age” [Title/Abstract] OR “gerontology” [Title/Abstract] OR “geroscience” [Title/Abstract] OR “lifespan” [Title/Abstract] OR “healthspan” [Title/Abstract] OR “life expectancy” [Title/Abstract] AND “Journal Article” [ptyp] AND “humans” [MeSH Terms] AND “English” [lang]) was used to retrieve PubMed biomedical journal articles related to human aging and longevity directly with the Bio.Entrez python package.

Named entity recognition

We combined web-based, dictionary-based, rule-based, and DL-based methods to conduct NER, and recognized 10 types of entities including gene, RNA, protein, carbohydrate, lipid, peptide, pharmaceutical preparations, toxin, mutation, and disease.

Relation extraction

Once two entities co-exist in one sentence, and a main verb lies between the 2 entities at the meanwhile, there is likely to be some relationship between these two entities. We selected sentences with no less than two entities to conduct RE through the NetworkX, OpenIE, and AllenNLP methods.

Human aging and longevity biomarkers identification

We further identified human aging and longevity biomarkers by investigating the characteristics of the relationships between gene, RNA, protein, carbohydrate, lipid, peptide, pharmaceutical preparations, toxin, mutation entities and disease entities. The relationships between the potential human aging and longevity biomarkers and disease entities were divided into three classes as follows:

Positive relationship. Positive relationships like “lead” and “cause” were considered aging-promoting relationships.
Association relationship. Relationships that can indicate an association like “associated” and “related” were considered aging-promoting relationships.
Negative relationship. Negative relationships like “prevent” and “ameliorate” were considered longevity-promoting relationships.

Framework of KG

Generally, resource description framework (RDF) and graph database are two main storage forms of KG. RDF is convenient for designers to publish and share data, while graph dababase provides a user-friendly interface to browse data. Thus, we developed the graph database-based HALD to explore the human aging and longevity-related KG. The front end was built with React (https://react.dev/) and Elasticsearch (https://www.elastic.co/) was used to realize a real-time search and management. We employed Neo4j (https://neo4j.com/) to offer an intuitive network demonstration of the entities and relations knowledge. All analyses in this study were done inside JupyterLab (https://jupyter.org/) notebooks with the Python kernel. Automatic updates would be executed monthly to keep the KG up-to-date.

The presentation of the HALD dataset aimed to provide significant convenience to researchers in the field of human aging and longevity and reduce the workload of further sifting through vast amounts of data. Additionally, HALD predicted biomarkers of aging and longevity from published literature, making it a valuable reference for precision gerontology and geroscience analyses. HALD is publicly available at Figshare, an open scientific data repository. For scientific researchers who want to explore the dataset intuitively, please visit https://bis.zju.edu.cn/hald for user-interactive browsing.

Users are welcome to contribute data and give suggestions in the Feedback module on the website at any time, by directly filling the form and click the “FEEDBACK” button to submit it. We will promptly check all the feedback, respond via email, and make necessary adjustments as soon as possible.

Code availability

All code used in this paper can be downloaded on GitHub at https://github.com/zexuwu/hald.

Zexu Wu

PhD student, Zhejiang University

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Research Data

Research Communities > Community > Research Data

Scientific Data

Scientific Data

A peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Genomics in freshwater and marine science

This Scientific Data collection of articles focuses on transcriptomic datasets and genome assemblies from freshwater and marine taxa.

Publishing Model: Open Access

Deadline: Jul 23, 2026

Explore this Collection

Computer vision in plant science and agriculture

This Scientific Data Collection invites Data Descriptors documenting the generation, curation, and validation of datasets that underpin computer vision applications across plant biology, crop science, and agricultural systems.

Publishing Model: Open Access

Deadline: Jul 10, 2026

Explore this Collection

Anion–Diluent Decoupled Solvation Chemistry in Ionic Liquid‑Based Localized High‑Concentration Electrolytes Toward High‑Voltage Lithium Metal Batteries

Behind the Paper

Same but Different: Why Are All Listeria monocytogenes Not Created Equal?

Behind the Paper, Life in Research, ECR Hub, Primary immunodeficiency disorders Hub

Traditional Plant Remedies Keeping Livestock Healthy in Rural Ethiopia

Behind the Paper

Evidence of refugia buried in geohistorical records can guide conservation efforts

Behind the Paper, Opportunities

Alcohol After Bariatric Surgery: From a Patient Story to Long-Term Evidence

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

HALD, a human aging and longevity knowledge graph for precision gerontology and geroscience analyses

Share this post

Share with...

...or copy the link