Deep learning (DL) algorithms offer huge potential to transform medical diagnostics, especially medical imaging.1 Whilst there are multiple reports in the academic literature studying the performance of DL algorithms to diagnose pathology from medical imaging,2,3 it is noted that the critical appraisal of these technologies is still in its infancy. This is now becoming especially important as regulatory bodies in the USA and Europe have started to grant approval for the use of DL interventions in clinical practice.4
In our recently published paper in npj Digital Medicine, we performed a systematic review and meta-analysis assessing studies that measured the diagnostic accuracy of DL algorithms on medical imaging. In addition to better understanding the diagnostic performance, we wanted to evaluate the quality of the evidence, in particular study design and reporting standards. We chose to perform meta-analysis on studies in the fields of ophthalmology, respiratory medicine and breast disease, as these were the three specialities with the largest number of studies with available data. Through systematic searching, we identified 503 studies suitable for inclusion in the review, with 82 studies in ophthalmology, 82 in breast disease and 115 in respiratory disease included for meta-analysis and 224 studies from other specialities included for qualitative review.
Our meta-analysis revealed that:
- in ophthalmology, AUC’s ranged between 0.933 – 1 for diagnosing diabetic retinopathy, age-related macular degeneration and glaucoma on retinal fundus photographs and optical coherence tomography.
- in respiratory imaging, AUC’s ranged between 0.864 – 0.937 for diagnosing lung nodules or lung cancer on chest X-ray or CT scan.
- in breast imaging, AUC’s ranged between 0.868 – 0.909 for diagnosing breast cancer on mammogram, ultrasound, MRI and digital breast tomosynthesis.
Whilst estimates of the diagnostic performance of DL algorithms seemed to be high and potentially clinically acceptable, our review identified significant heterogeneity, variance and risk of bias between the studies. This led to considerable uncertainty over the estimates, which could be explained a lack of consensus on how to conduct and report DL diagnostic studies.
We identified three clusters of limitations amongst the studies included in our review. Firstly, issues with the datasets used for training and testing were highlighted including the use of retrospective data and poor quality reference standards. Secondly, we identified that there were minimal prospective studies and only two randomised studies currently in the literature and a majority of studies did not report accuracy from an external dataset. Thirdly, the lack of appropriate reporting standards in this field was identified with large variations in terminology and metrics reported across the studies.
In our review, we propose that the quality of DL research being conducted and reported may be improved by the development of AI specific reporting standards. We recognise that the STARD 2015 statement5 (designed for the reporting of diagnostic accuracy studies) is not fully applicable to DL studies. Recent publication of CONSORT-AI6 and SPIRIT-AI7 guidelines (for randomised trials and interventional trial protocols respectively) have been well received. However, as much of the AI interventions close to translation being published are in the field of diagnostics, there is also a requirement for an AI specific extension to the STARD statement (STARD-AI).8 Our research group is currently in the process of convening STARD-AI and we anticipate that these guidelines will be published later in 2021. We hope that they will provide a framework that enables higher quality and more consistent reporting across future DL diagnostic studies.
Overall, whilst our meta-analysis demonstrated that DL algorithms do have high diagnostic accuracy, it is important that these findings are assumed in the presence of non-standardised design, conduct and reporting of studies. This can only be improved with guidance around study design and reporting, which will be crucial prior to widespread clinical application and implementation of this potentially transformative technology. Only then would the potential of deep learning in diagnostic healthcare be truly realised in clinical practice.
Correspondance to firstname.lastname@example.org
- Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine. 2019;25(1):44-56.
- Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The Lancet Digital Health. 2019;1(6):e271-e297.
- Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689.
- Muehlematter UJ, Daniore P, Vokinger KN. Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe: a comparative analysis. The Lancet Digital Health. 2021;3(3):e195-e203.
- Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ : British Medical Journal. 2015;351:h5527.
- Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nature Medicine. 2020;26(9):1364-1374.
- Cruz Rivera S, Liu X, Chan A-W, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nature Medicine. 2020;26(9):1351-1363.
- Sounderajah V, Ashrafian H, Aggarwal R, et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group. Nature Medicine. 2020;26:807–808