Can AI really interpret images as well as physicians?

The diagnostic performance of deep learning models is on par to that of human experts, according to a new literature review published Sept. 25 in The Lancet Digital Health. Many of these models, however, weren’t validated on external data.

Upon an analysis of more than 60 studies, Xiaoxuan Liu, MBChB, with the University Hospitals Birmingham NHS Foundation Trust in the UK, and colleagues found little difference in the overall sensitivity and specific between AI platforms and physicians. Less than half of those studies included out-of-sample external validation—important to making sure AI is generalizable to other institutions and populations.

“After careful selection of studies with transparent reporting of diagnostic performance and validation of the algorithm in an out-of-sample population, we found deep learning algorithms to have equivalent sensitivity and specificity to health-care professionals,” the researchers wrote.

The overall number of AI in radiology studies being published in on the rise, and more and more are reporting their diagnostic accuracy to be better or equivalent to expert readers, but bias and generalizability remain sizeable concerns, the researchers noted. Additionally, systematic reviews comparing algorithms against healthcare professionals are often disease-specific, unlike the current study which included all diseases.

“This review is the first to systematically compare the diagnostic accuracy of all deep learning models against health-care professionals using medical imaging published to date,” Liu et al. added.

For their study, the team identified 31,587 studies across four databases that were published from Jan. 1, 2012, to June 6, 2019. The team extracted binary diagnostic accuracy data to construct contingency tables to determine the sensitivity and specificity of models. Of the studies, 69 had enough data to determine the test’s accuracy; mean sensitivity was 79.1% and mean specificity was 88.3%.

Fourteen studies compared deep learning models’ performance to physicians; the researcher performed an out-of-sample external validation on all of them, revealing a pooled sensitivity of 87% for deep learning and 86.4% for healthcare professionals, and a pooled specificity of 92.5% for deep learning compared to 90.5% for human experts.

“Although this estimate seems to support the claim that deep learning algorithms can match clinician-level accuracy, several methodological deficiencies that were common across most included studies should be considered,” Liu and colleagues wrote.

For example, many studies assess deep learning in isolation, which is not reflective of clinical practice, the authors noted. There were also few studies completed in real life clinical environments and varying metrics were often used to report the diagnostic accuracy of deep learning models.

It’s not all bad news though. The quality of studies has improved over the past year, an “encouraging finding” for Liu et al.

Overall, they wrote, all algorithms should be tested on external data and international standards for such protocol and reporting may help ensure deep learning studies are well executed going forward.

“From this exploratory meta-analysis, we cautiously state that the accuracy of deep learning algorithms is equivalent to health-care professionals, while acknowledging that more studies considering the integration of such algorithms in real-world settings are needed,” the authors concluded.