Behind the Paper

What is the diagnostic accuracy of deep learning algorithms on medical imaging and what is the quality of evidence supporting this?

Our systematic review and meta-analysis provides a critical assessment of the deep learning literature to date and proposes recommendations to improve the quality of deep learning research in the future.

Published in Healthcare & Nursing

Apr 07, 2021

Ravi Aggarwal, Viknesh Sounderajah & Hutan Ashrafian

3 contributors

Like Be the first to like this

Explore the Research

Deep learning (DL) algorithms offer huge potential to transform medical diagnostics, especially medical imaging.¹ Whilst there are multiple reports in the academic literature studying the performance of DL algorithms to diagnose pathology from medical imaging,^2,3 it is noted that the critical appraisal of these technologies is still in its infancy. This is now becoming especially important as regulatory bodies in the USA and Europe have started to grant approval for the use of DL interventions in clinical practice.⁴

In our recently published paper in npj Digital Medicine, we performed a systematic review and meta-analysis assessing studies that measured the diagnostic accuracy of DL algorithms on medical imaging. In addition to better understanding the diagnostic performance, we wanted to evaluate the quality of the evidence, in particular study design and reporting standards. We chose to perform meta-analysis on studies in the fields of ophthalmology, respiratory medicine and breast disease, as these were the three specialities with the largest number of studies with available data. Through systematic searching, we identified 503 studies suitable for inclusion in the review, with 82 studies in ophthalmology, 82 in breast disease and 115 in respiratory disease included for meta-analysis and 224 studies from other specialities included for qualitative review.

Our meta-analysis revealed that:

in ophthalmology, AUC’s ranged between 0.933 – 1 for diagnosing diabetic retinopathy, age-related macular degeneration and glaucoma on retinal fundus photographs and optical coherence tomography.
in respiratory imaging, AUC’s ranged between 0.864 – 0.937 for diagnosing lung nodules or lung cancer on chest X-ray or CT scan.
in breast imaging, AUC’s ranged between 0.868 – 0.909 for diagnosing breast cancer on mammogram, ultrasound, MRI and digital breast tomosynthesis.

Whilst estimates of the diagnostic performance of DL algorithms seemed to be high and potentially clinically acceptable, our review identified significant heterogeneity, variance and risk of bias between the studies. This led to considerable uncertainty over the estimates, which could be explained a lack of consensus on how to conduct and report DL diagnostic studies.

We identified three clusters of limitations amongst the studies included in our review. Firstly, issues with the datasets used for training and testing were highlighted including the use of retrospective data and poor quality reference standards. Secondly, we identified that there were minimal prospective studies and only two randomised studies currently in the literature and a majority of studies did not report accuracy from an external dataset. Thirdly, the lack of appropriate reporting standards in this field was identified with large variations in terminology and metrics reported across the studies.

In our review, we propose that the quality of DL research being conducted and reported may be improved by the development of AI specific reporting standards. We recognise that the STARD 2015 statement⁵ (designed for the reporting of diagnostic accuracy studies) is not fully applicable to DL studies. Recent publication of CONSORT-AI⁶ and SPIRIT-AI⁷ guidelines (for randomised trials and interventional trial protocols respectively) have been well received. However, as much of the AI interventions close to translation being published are in the field of diagnostics, there is also a requirement for an AI specific extension to the STARD statement (STARD-AI).⁸ Our research group is currently in the process of convening STARD-AI and we anticipate that these guidelines will be published later in 2021. We hope that they will provide a framework that enables higher quality and more consistent reporting across future DL diagnostic studies.

Overall, whilst our meta-analysis demonstrated that DL algorithms do have high diagnostic accuracy, it is important that these findings are assumed in the presence of non-standardised design, conduct and reporting of studies. This can only be improved with guidance around study design and reporting, which will be crucial prior to widespread clinical application and implementation of this potentially transformative technology. Only then would the potential of deep learning in diagnostic healthcare be truly realised in clinical practice.

Correspondance to h.ashrafian@ic.ac.uk

References

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine. 2019;25(1):44-56.
Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The Lancet Digital Health. 2019;1(6):e271-e297.
Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689.
Muehlematter UJ, Daniore P, Vokinger KN. Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe: a comparative analysis. The Lancet Digital Health. 2021;3(3):e195-e203.
Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ : British Medical Journal. 2015;351:h5527.
Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nature Medicine. 2020;26(9):1364-1374.
Cruz Rivera S, Liu X, Chan A-W, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nature Medicine. 2020;26(9):1351-1363.
Sounderajah V, Ashrafian H, Aggarwal R, et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group. Nature Medicine. 2020;26:807–808

Multiple Contributors

Ravi Aggarwal, Hutan Ashrafian & Viknesh Sounderajah

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Health Care

Life Sciences > Health Sciences > Health Care

npj Digital Medicine

npj Digital Medicine

An online open-access journal dedicated to publishing research in all aspects of digital medicine, including the clinical application and implementation of digital and mobile technologies, virtual healthcare, and novel applications of artificial intelligence and informatics.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Artificial Intelligence in Emergency and Critical Care Medicine

This Collection focuses on the unique challenges and opportunities for artificial intelligence (AI) applications in the emergency department (ED) and intensive care unit (ICU), environments where rapid decision-making and precision are critical to patient survival. These settings are characterized by their fast pace, high patient turnover, unpredictable workloads, and the need to manage acute and life-threatening conditions.

Publishing Model: Open Access

Deadline: Jan 10, 2026

Explore this Collection

Digital Health Equity and Access

This Collection explores innovations and challenges in advancing digital health equity and access, focusing on diverse populations and inclusive technologies.

Publishing Model: Open Access

Deadline: Mar 03, 2026

Explore this Collection

Latest Content

Opportunities, From the Editors

Call for papers: Robotic and autonomous materials Collection

Opportunities, From the Editors

Call for papers: Methods and technologies in nutrition research Collection

Opportunities, From the Editors

Call for papers: Metal-sulfur battery Collection

Opportunities, From the Editors

Call for papers: Next-generation power electronics Collection

Life in Research

No to Inorganic Agriculture (NIA): How a Bengal-Born Innovation Is Addressing the World’s Most Severe Agricultural Crisis

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

What is the diagnostic accuracy of deep learning algorithms on medical imaging and what is the quality of evidence supporting this?

Share this post

Share with...

...or copy the link