Behind the Paper

VinDr-CXR: The largest public chest X-ray dataset with radiologist-generated annotations for machine learning-based computer-aided diagnosis (CAD)

We introduce the largest public chest X-ray (CXR) dataset with radiologist-generated annotations in both training and test sets. It will accelerate the development and evaluation of new machine-learning models for both localization and classification of thoracic lesions and diseases on CXR scans.

Published in Research Data

Mar 27, 2023

Hieu Pham and Ha Q. Nguyen

2 contributors

VinDr-CXR: The largest public chest X-ray dataset with radiologist-generated annotations for machine learning-based computer-aided diagnosis (CAD)

Liked by Evelina Satkevic and 3 others

Explore the Research

In 2022, we published the paper "VinDr-CXR: An open dataset of chest X-rays with radiologist annotations" in Scientific Data. We also released the whole dataset on PhysioNet. After just one year, the paper reached almost 100 citations and now became a standard, benchmarking dataset for developing and evaluating machine learning, deep learning, and computer vision for chest X-ray interpretation. Below we would like to share our story.

Why did we build the dataset?

Most existing chest radiograph datasets depend on automated rule-based labelers that either use keyword matching or an NLP model to extract disease labels from free-text radiology reports. These tools can produce labels on a large scale but, at the same time, introduce a high rate of inconsistency, uncertainty, and errors. These noisy labels may lead to the deviation of deep learning-based algorithms from reported performances when evaluated in a real-world setting. Furthermore, the report-based approaches only associate a CXR image with one or several labels in a predefined list of findings and diagnoses without identifying their locations.

There are a few CXR datasets that include annotated locations of abnormalities but they are either too small for training deep learning models or not detailed enough. The interpretation of a CXR is not all about image-level classification; it is even more important, from the perspective of a radiologist, to localize the abnormalities on the image. This partly explains why the applications of computer-aided detection (CAD) systems for CXR in clinical practice are still very limited.

We faced major challenges

Building high-quality datasets of annotated images is costly and time-consuming due to several constraints: (1) medical data are hard to retrieve from hospitals or medical centers; (2) manual annotation by physicians is both time-consuming and expensive; (3) the annotation of medical images requires a consensus of several expert readers to overcome human error; and (4) it lacks an efficient labeling framework to manage and annotate large-scale medical datasets.

Our approach

The building of the VinDr-CXR dataset is divided into three main steps: (1) data collection, (2) data filtering, and (3) data labeling. Between 2018 and 2020, we retrospectively collected more than 100,000 CXRs in DICOM format from local PACS servers of two hospitals in Vietnam.

The flow of creating VinDr-CXR dataset: (1) raw images in DICOM format were collected retrospectively from the
hospital’s PACS and got de-identified to protect patient’s privacy; (2) invalid files, such as images of other modalities, other body parts, low quality, or incorrect orientation, were automatically filtered out by a CNN-based classifier; (3) A web-based labeling tool, VinDr Lab, was developed to store, manage, and remotely annotate DICOM data: each image in the training set of 15,000 images was independently labeled by a group of 3 radiologists and each image in the test set of 3,000 images was labeled by the consensus of 5 radiologists.

About the dataset

The dataset contains more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data, we released 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases. The released dataset is divided into a training set of 15,000 and a test set of 3,000. Each scan in the training set was independently labeled by 3 radiologists, while each scan in the test set was labeled by the consensus of 5 radiologists. All images are in DICOM format and the labels from training and test sets are made publicly available.

Examples of CXRs with radiologist’s annotations. Abnormal findings (local labels) marked by radiologists are plotted on the original images for visualization purposes. The global labels are in bold and listed at the bottom of each example.

Again, we believe that large-scale, open, and high-quality data are the key to bringing medical AI algorithms to clinical settings and improving patient care. Besides the VinDr-CXR[1]. we commit our time and efforts to create more and more open datasets to release them to the research community. In 2023, we introduced the "VinDr-Mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography" [2] and "PediCXR: An open, large-scale chest radiograph dataset for interpretation of common thoracic diseases in children" [2]. We also released novel data called VinDr-SpineXR: A large annotated medical image dataset for spinal lesions detection and classification from radiographs[3]. We believe that these imaging resources will play an important role in the development and validation of machine learning and deep learning algorithms for medical imaging research [4,5,6,7,8,9].

References

Nguyen, Ha Q., et al. "VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations." Scientific Data 9.1 (2022): 429.
Nguyen, Hieu Trung, et al. "VinDr-Mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography." MedRxiv (2022): 2022-03.
Nguyen, Hieu T., et al. "VinDr-SpineXR: A deep learning framework for spinal lesions detection and classification from radiographs." Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24. Springer International Publishing, 2021.
Pham, Hieu H., et al. "Interpreting chest X-rays via CNNs that exploit hierarchical disease dependencies and uncertainty labels." Neurocomputing 437 (2021): 186-194.
Tran, Thanh T., et al. "Learning to automatically diagnose multiple diseases in pediatric chest radiographs using deep convolutional neural networks." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Pham, Hieu H., et al. "An Accurate and Explainable Deep Learning System Improves Interobserver Agreement in the Interpretation of Chest Radiograph." IEEE Access 10 (2022): 104512-104531.
Nguyen, Ngoc Huy, et al. "Deployment and validation of an AI system for detecting abnormal chest radiographs in clinical settings." Frontiers in Digital Health (2022): 130.
Nguyen, Huyen TX, et al. "A novel multi-view deep learning approach for BI-RADS and density assessment of mammograms." 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2022.
Le, Khiem H., et al. "Learning from multiple expert annotators for enhancing anomaly detection in medical image analysis." IEEE Access 11 (2023): 14105-14114.

Multiple Contributors

Hieu Pham and Ha Q. Nguyen

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Research Data

Research Communities > Community > Research Data

Scientific Data

Scientific Data

A peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Invertebrate omics

This Scientific Data Collection welcomes Data Descriptors documenting the curation, validation, and open sharing of genomic, transcriptomic, and proteomic datasets for invertebrate species.

Publishing Model: Open Access

Deadline: Feb 08, 2026

Explore this Collection

Data for crop management

This Scientific Data Collection welcomes submissions of Data Descriptors associated with datasets for crop management, which are essential for optimising agricultural productivity, sustainability, and food security.

Publishing Model: Open Access

Deadline: Jan 17, 2026

Explore this Collection

VinDr-Mammo: The largest public dataset of full-field digital mammography to evaluate and compare algorithmic support systems for breast cancer screening

Behind the Paper

PediCXR: Advancing the interpretation of common thoracic diseases in children

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

VinDr-CXR: The largest public chest X-ray dataset with radiologist-generated annotations for machine learning-based computer-aided diagnosis (CAD)

Share this post

Share with...

...or copy the link