AI trained with less than 1,000 CT scans may solve 'black box' challenge

Using less than 1,000 imaging cases, researchers from Massachusetts General Hospital (MGH) in Boston were able to train an artificial intelligence (AI) algorithm to detect intracranial hemorrhage (ICH) and classify its five subtypes on unenhanced head CT scans, according to research published online Dec. 17 in the journal Nature Biomedical Engineering.  

The deep learning algorithm was designed to reveal the reasoning behind its decision, often called AI's 'black box' problem, through an "attention map" which highlights important regions on the images used to make its predictions. It also eliminates the need for radiologists to annotate the large, high-quality data sets used to train most deep learning models.

Overall, the researchers found the model performed with comparable accuracy but higher sensitivity than trained radiologists. 

Because brain hemorrhage is a potentially fatal condition, having an automated sensitive model that reliably detects it can expedite treatment for patients. It may also help neuroradiologists of varying levels of expertise determine the presence or absence of bleeding from a brain scan sooner and avoid delayed or missed diagnoses of ICH.  

“As many facilities do not have subspecialty-trained neuroradiologists, especially at night and on weekends, non-expert healthcare providers are often required to make a decision to diagnose or exclude acute hemorrhage,” Synho Do, PhD, director of the Laboratory of Medical Imaging and Computation at Harvard University in Boston and assistant medical director at MGH, and colleagues wrote. “A reliable second opinion that is trained by neuroradiologists can help make healthcare providers more efficient and confident to enhance patient care, empower patients and cut costs.”  

For the study, the researchers developed a system that incorporated four ImageNet pre-trained deep convolutional neural networks—VGG16, ResNet-50, Inception-v3 and Inception-ResNet-v2—as well as an image pre-processing pipeline, an atlas creation module and a prediction-basis-selection module. 

After realizing that the performance of the deep learning model was associated with the quality of the imaging labels, the researchers aimed to improve the quality instead of quantity of the training data.  

To achieve this, five neuroradiologists individually labeled 904 non-contrast CT scans collected from their institutions PACS. The images were labeled as either having no hemorrhage, intraparenchymal hemorrhage, intraventricular hemorrhage, subdural hemorrhage, epidural hemorrhage or subarachnoid hemorrhage.  

Of the total number of cases, only 704 were used to train the model; 100 cases with ICH and 100 cases without ICH were put aside to be used for later validation.

The researchers also tested the model on a separate prospective dataset containing 79 cases with ICH and 117 cases without ICH, which were collected from the MGH’s emergency department over a period of four months.  

The results were then compared with those of the five neuroradioloigsts. The model performed with comparable accuracy but higher sensitivity than the radiologists and more than 90 percent sensitivity and specificity were achieved in the two datasets collected retrospectively and prospectively, according to the researchers.  

When classifying for ICH subtypes, the algorithm achieved an area under the curve (AUC) on the retrospective dataset that ranged between 0.92 (for epidural hemorrhage) and 0.98 (for intraparenchymal hemorrhage). For the prospective dataset, the model had an AUC of 0.88 (for subdural hemorrhage) to 0.97 (for intraventricular hemorrhage). 

The findings suggest that such algorithms could help in the development of deep learning systems for different clinical applications for use in clinical practice.

“Currently, there is widespread belief that the answers to many crucial questions can be found from big data by using deep learning. However, a large portion of healthcare big data is unstructured and equivocal, making it unsuitable for building a deep-learning model,” the researchers concluded. “The balance between data size and quality plus careful data processing tailored to each application is the key to developing high-performance deep-learning algorithms.”