Physicians with no coding experience are able to create AI algorithms to classify medical images at levels comparable to state-of-the-art platforms, according to a new study published in The Lancet Digital Health. However, some experts questioned whether those without experience should really be creating such technology.
Two physicians without any deep learning experience completed a 10 hour self-study period to understand the basics of programming before developing and testing their algorithms. Overall, the models performed well at binary classification tasks, but fell short when validated on external data, according to Livia Faes, MD, with Cantonal Hospital Lucerne in Switzerland, and colleagues.
“The availability of automated deep learning might be a cornerstone for the democratization of sophisticated algorithmic modelling in health care. It allows the derivation of classification models without requiring a deep understanding of the mathematical, statistical, and programming principles,” they added. “However, the translation of this technological success to meaningful clinical effect requires concerted efforts and a careful stepwise approach to avoid biasing the results.”
For their study, the researchers utilized five publicly available open-source datasets which were fed into a neural architecture search framework that automatically created a deep learning model to classify common disease. The datasets included: retinal fundus images (MESSIDOR); optical coherence tomography (OCT) images (Guangzhou Medical University and Shiley Eye Institute, version 3); images of skin lesions (Human Against Machine [HAM] 10000), and both pediatric and adult chest x-ray (CXR) images (Guangzhou Medical University and Shiley Eye Institute, version 3 and the National Institute of Health [NIH] dataset, respectively).
The researchers performed a literature review of traditional models to compare their performance against the newly created machine learning models.
Overall, upon internal validations, the non-expert models performed well in the binary classification task. Sensitivity ranged from 73.3% to 97%; specificity was between 67%-100%; and area under the precision recall curve (AURPC) came in at 0.87-1.00. For multiple classifications the models did not perform as well, ranging from 38% to 100% for sensitivity and from 67% to 100% for specificity. The five models ranged from 0.57 to 1.00 in terms of AUPRC.
When the team performed an external validation test, the top performing model demonstrated an AUPRC of .047; a sensitivity of 49% and positive predictive value of 52%. The worst performing model overall was that trained on the NIH chest x-ray dataset.
“From a methodological viewpoint, our results—as is also the case with the results reported in state-of-the-art deep learning studies—might be overly optimistic, because we were not able to test all the models out of sample, as recommended by current guidelines,” the researchers wrote.
Faes and colleagues suggested researchers and clinicians may be able to start producing in-house models based on data within their own institutions, but warned that regulatory guidelines will be important before such models can be used in clinical practice.
In a related editorial, Tom J. Pollard, PhD, with MIT, Cambridge, Massachusetts, and colleagues acknowledged that the study is “compelling,” but also expressed ethical reservations.
“We cautiously share the authors’ optimism that removing obstacles to algorithmic modelling will lead to improvements in patient care, but the risks of bypassing mathematical, statistical, and programming expertise must be emphasized,” Pollard et al. wrote.
“The use of machine learning methods without in-depth knowledge can result in misleading or outright erroneous results that would cause harm if used to guide the delivery of care,” the editorialists added. “A reliance on simple performance metrics alone does not allow the practitioner to interpret other aspects of model development.”