What is the diagnostic accuracy of deep learning algorithms on medical imaging and what is the quality of evidence supporting this?

Our systematic review and meta-analysis provides a critical assessment of the deep learning literature to date and proposes recommendations to improve the quality of deep learning research in the future.
Published in Healthcare & Nursing

Deep learning (DL) algorithms offer huge potential to transform medical diagnostics, especially medical imaging.1 Whilst there are multiple reports in the academic literature studying the performance of DL algorithms to diagnose pathology from medical imaging,2,3 it is noted that the critical appraisal of these technologies is still in its infancy. This is now becoming especially important as regulatory bodies in the USA and Europe have started to grant approval for the use of DL interventions in clinical practice.4

In our recently published paper in npj Digital Medicine, we performed a systematic review and meta-analysis assessing studies that measured the diagnostic accuracy of DL algorithms on medical imaging. In addition to better understanding the diagnostic performance, we wanted to evaluate the quality of the evidence, in particular study design and reporting standards. We chose to perform meta-analysis on studies in the fields of ophthalmology, respiratory medicine and breast disease, as these were the three specialities with the largest number of studies with available data. Through systematic searching, we identified 503 studies suitable for inclusion in the review, with 82 studies in ophthalmology, 82 in breast disease and 115 in respiratory disease included for meta-analysis and 224 studies from other specialities included for qualitative review.

Our meta-analysis revealed that:

  • in ophthalmology, AUC’s ranged between 0.933 – 1 for diagnosing diabetic retinopathy, age-related macular degeneration and glaucoma on retinal fundus photographs and optical coherence tomography.
  • in respiratory imaging, AUC’s ranged between 0.864 – 0.937 for diagnosing lung nodules or lung cancer on chest X-ray or CT scan.
  • in breast imaging, AUC’s ranged between 0.868 – 0.909 for diagnosing breast cancer on mammogram, ultrasound, MRI and digital breast tomosynthesis.

Whilst estimates of the diagnostic performance of DL algorithms seemed to be high and potentially clinically acceptable, our review identified significant heterogeneity, variance and risk of bias between the studies. This led to considerable uncertainty over the estimates, which could be explained a lack of consensus on how to conduct and report DL diagnostic studies.

We identified three clusters of limitations amongst the studies included in our review. Firstly, issues with the datasets used for training and testing were highlighted including the use of retrospective data and poor quality reference standards. Secondly, we identified that there were minimal prospective studies and only two randomised studies currently in the literature and a majority of studies did not report accuracy from an external dataset. Thirdly, the lack of appropriate reporting standards in this field was identified with large variations in terminology and metrics reported across the studies.

In our review, we propose that the quality of DL research being conducted and reported may be improved by the development of AI specific reporting standards. We recognise that the STARD 2015 statement5 (designed for the reporting of diagnostic accuracy studies) is not fully applicable to DL studies. Recent publication of CONSORT-AI6 and SPIRIT-AI7 guidelines (for randomised trials and interventional trial protocols respectively) have been well received. However, as much of the AI interventions close to translation being published are in the field of diagnostics, there is also a requirement for an AI specific extension to the STARD statement (STARD-AI).8 Our research group is currently in the process of convening STARD-AI and we anticipate that these guidelines will be published later in 2021. We hope that they will provide a framework that enables higher quality and more consistent reporting across future DL diagnostic studies.

Overall, whilst our meta-analysis demonstrated that DL algorithms do have high diagnostic accuracy, it is important that these findings are assumed in the presence of non-standardised design, conduct and reporting of studies. This can only be improved with guidance around study design and reporting, which will be crucial prior to widespread clinical application and implementation of this potentially transformative technology. Only then would the potential of deep learning in diagnostic healthcare be truly realised in clinical practice.

Correspondance to h.ashrafian@ic.ac.uk


  1. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine. 2019;25(1):44-56.
  2. Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The Lancet Digital Health. 2019;1(6):e271-e297.
  3. Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689.
  4. Muehlematter UJ, Daniore P, Vokinger KN. Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe: a comparative analysis. The Lancet Digital Health. 2021;3(3):e195-e203.
  5. Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ : British Medical Journal. 2015;351:h5527.
  6. Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nature Medicine. 2020;26(9):1364-1374.
  7. Cruz Rivera S, Liu X, Chan A-W, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nature Medicine. 2020;26(9):1351-1363.
  8. Sounderajah V, Ashrafian H, Aggarwal R, et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group. Nature Medicine. 2020;26:807–808 

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Health Care
Life Sciences > Health Sciences > Health Care
  • npj Digital Medicine npj Digital Medicine

    An online open-access journal dedicated to publishing research in all aspects of digital medicine, including the clinical application and implementation of digital and mobile technologies, virtual healthcare, and novel applications of artificial intelligence and informatics.

Related Collections

With collections, you can get published faster and increase your visibility.

Clinical applications of AI in mental health care

This joint venture Collection between npj Mental Health Research and npj Digital Medicine highlights how AI can be safely, ethically, & impactfully utilized to advance our understanding of mental illnesses & improve patient care.

Publishing Model: Open Access

Deadline: Jun 22, 2024

Harnessing digital health technologies to tackle climate change and promote human health

This collection invites research on the use of digital health technologies that innovate solutions to improve sustainable health care practice and delivery.

Publishing Model: Open Access

Deadline: Apr 30, 2024