Algorithm’s ‘unexpected’ weakness raises larger concerns about AI’s potential in broader populations

A new investigation revealed “unexpected” shortcomings when using a federally cleared artificial intelligence tool to detect intracranial hemorrhages. The findings pushed researchers to call for more standardization when evaluating AI-based clinical decision support platforms.

Radiologists are faced with choosing between hundreds of independently developed algorithms on top of the nearly 80 cleared by the U.S. Food and Drug Administration. Each has its own supporting evidence, making it difficult to assess quality and generalizability, experts explained April 3 in JACR.

The group scrutinized one specific AI decision support system from Aidoc, which screens non-contrast CT exams for intracranial hemorrhage. Upon inspection, sensitivity and positive predictive values proved lower than prior studies, with ICH features and a history of neurosurgery associated with lower performance.

First author Andrew F. Voter, PhD, with the University of Wisconsin-Madison’s School of Medicine and Public Health, and co-authors say their results raise concerns regarding AI’s effectiveness across broader patient populations.

“These results further highlight the need for standardized study design to allow for rigorous and reproducible site-to-site comparison of emerging deep learning technologies,” Voter and colleagues explained on Saturday.

The experts based their conclusions on a retrospective review of 3,605 consecutive, emergency head CTs performed between July and December 2019. Both a neuroradiologist and AI software evaluated each scan for ICH, making it the largest study of its kind to utilize radiologist manual review as ground truth, the authors noted.

In total, 349 hemorrhages were discovered, with human diagnoses matching software interpretations 96.9% of the time. As mentioned, overall sensitivity (92.3%) and positive predictive value (81.3%) were surprisingly lower than in prior studies, the authors reported.

Decreased AI performance was significantly associated with prior neurosurgery, type of ICH, and number of ICHs, according to Voter et al. At the same time, image quality did not affect performance.

The team said they were unable to pinpoint the exact source of their discordant results, despite extensive searching.

And while artificial intelligence tools have improved over recent years, comparing their performance across multiple institutions and populations remains challenging.

In light of their findings, Voter and colleagues encouraged developers to train their algorithms on more diverse datasets, including local data and postsurgical patients.

Read the entire study published in the Journal of the American College of Radiology here.

""

Matt joined Chicago’s TriMed team in 2018 covering all areas of health imaging after two years reporting on the hospital field. He holds a bachelor’s in English from UIC, and enjoys a good cup of coffee and an interesting documentary.

Trimed Popup
Trimed Popup