PediCXR: Advancing the interpretation of common thoracic diseases in children

We created and released PediCXR, a high-quality, large-scale pediatric chest X-ray dataset with annotations to support research into diagnostic models for pediatric diseases, addressing the lack of such datasets.
PediCXR: Advancing the interpretation of common thoracic diseases in children

Common thoracic diseases cause several hundred thousand deaths every year among children under five years old. The chest radiograph or CXR is the first-line and most commonly performed imaging examination in the assessment of the pediatric patient. Computer-aided diagnosis (CAD) systems for the identification of lung abnormality in adult CXRs have recently achieved great success thanks to the availability of large labeled datasets.  Many large-scale CXR datasets of adult patients have been established and released in recent years. Unfortunately, the creation of pediatric CXR datasets is still unexploited, and the number of benchmark pediatric CXR datasets is limited. This becomes the main obstacle in developing and transferring new machine learning-based CAD systems for pediatric CXR in clinical practice.

In an effort to provide a large-scale pediatric CXR dataset with high-quality annotations for the research community, we have built the PediCXR dataset in DICOM format. The dataset consists of 9,125 posteroanterior (PA) view CXR scans in patients younger than 10 years that were retrospectively collected from three major hospitals in Vietnam from 2020 to 2021. In particular, all CXR scans come with both the localization of critical findings and the classification of common thoracic diseases. These images were annotated by a group of three radiologists with at least 10 years of experience for the presence of 36 critical findings and 15 diagnoses. To the best of our knowledge, this is the first and largest pediatric CXR dataset containing lesion-level annotations and image-level labels for the detection of multiple findings and diseases.

Several examples of pediatric CXR images with radiologists’ annotations. Local labels marked by radiologists are plotted on the original images. The global labels, that classify images into diseases, are in bold.

As you can see in the figure above, the local labels should be annotated with rectangle bounding boxes that localize the findings, while the global labels reflect the diagnostic impression of the radiologist at the image level. For algorithm development, we randomly divided the dataset into two parts: the training set of 7,728 scans (84.7%) and the test set of 1,397 scans (15.3%). To the best of our knowledge, the released PediCXR is currently the largest public pediatric CXR dataset with radiologist-generated annotations in both training and test sets. We believe the introduction of the PediCXR provides a suitable imaging source for investigating the ability of supervised machine learning models in identifying common lung diseases in pediatric patients. The dataset characteristics of PediCXR are shown below.

Dataset characteristics of PediCXR.
Dataset characteristics of PediCXR.

To encourage new advances in pediatric CXR interpretation using data-driven approaches, we provide a detailed description of the PediCXR data sample and make the dataset publicly available on 

Large-scale, open and high-quality data are the key to bringing medical AI algorithms to clinical settings and improving patient care. Besides the PediCXR, we commit our time and efforts to create more and more open datasets to release them to the research community. In 2022, we introduced and published "VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations" in Scientific Data [1]. In 2023, we introduced the "VinDr-Mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography" [2]. We also released novel data called VinDr-SpineXR: A large annotated medical image dataset for spinal lesions detection and classification from radiographs[3]. We believe that these imaging resources will play an important role in the development and validation of machine learning and deep learning algorithms for medical imaging research [4,5,6,7,8,9].


  1. Nguyen, Ha Q., et al. "VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations." Scientific Data 9.1 (2022): 429.
  2. Nguyen, Hieu Trung, et al. "VinDr-Mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography." MedRxiv (2022): 2022-03.
  3. Nguyen, Hieu T., et al. "VinDr-SpineXR: A deep learning framework for spinal lesions detection and classification from radiographs." Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24. Springer International Publishing, 2021.
  4. Pham, Hieu H., et al. "Interpreting chest X-rays via CNNs that exploit hierarchical disease dependencies and uncertainty labels." Neurocomputing 437 (2021): 186-194.
  5. Tran, Thanh T., et al. "Learning to automatically diagnose multiple diseases in pediatric chest radiographs using deep convolutional neural networks." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
  6. Pham, Hieu H., et al. "An Accurate and Explainable Deep Learning System Improves Interobserver Agreement in the Interpretation of Chest Radiograph." IEEE Access 10 (2022): 104512-104531.
  7. Nguyen, Ngoc Huy, et al. "Deployment and validation of an AI system for detecting abnormal chest radiographs in clinical settings." Frontiers in Digital Health (2022): 130.
  8. Nguyen, Huyen TX, et al. "A novel multi-view deep learning approach for BI-RADS and density assessment of mammograms." 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2022.
  9. Le, Khiem H., et al. "Learning from multiple expert annotators for enhancing anomaly detection in medical image analysis." IEEE Access 11 (2023): 14105-14114.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Research Data
Research Communities > Community > Research Data

Related Collections

With collections, you can get published faster and increase your visibility.

Medical imaging data for digital diagnostics

This Collection presents a series of articles describing annotated datasets of medical images and video. All medical specialities are considered and data can be derived from study participants, tissue samples, electronic health records (EHRs) or other sources.

Publishing Model: Open Access

Deadline: Dec 20, 2023

Meteorology and hydroclimate observations and models

This Collection presents a series of articles describing hydroclimate datasets, including data sourced from remote sensing, primary measurements or theoretical models. Datasets are presented without analyses in order to support policy development and further research, with Data Descriptors providing full details of data sources, modelling, and any associated code.

Publishing Model: Open Access

Deadline: Dec 15, 2023