5 terms to know about machine learning

Matt O'Connor | October 18, 2018 | Health Imaging | Artificial Intelligence

As artificial intelligence (AI) adoption expands in radiology, there is growing concern that AI algorithms needs to undergo quality assurance (QA) reviews. How to validate radiology AI? How can you validate medical imaging AI?

Machine learning (ML) has taken imaging by storm. Though many radiologists are familiar with the concept, assessing the results of various algorithmic approaches can be complex, wrote authors of a recent American Journal of Roentgenology perspective.

“For those who are unfamiliar with the field of ML, the emerging research can be daunting, with a wide variation in the terms used and the metrics presented,” wrote first author, Guy S. Handelman, with Belfast City Hospital in Northern Ireland, U.K., and colleagues. “How can we, as readers, tell if the predictive model being presented is good or is even better than another model presented elsewhere?”

With that in mind, Handelman et al. focused on how to better analyze ML study results. Below are five important areas readers should focus on.

1. Cross-validation

Many algorithms generate performance measures using cross-validation, the authors wrote.

A training dataset contains subjects that are used to create the algorithm that will execute predictions. The model can improve its ability by assigning a different group of study patients to the training set and also to the testing set multiple times over.

“Each iteration will not only improve the performance of the model, because the program can compare between each training set's results to see what performs best and can alter its overall predictive capability, but also improve the generalizability of the results,” they added. “As an algorithm deals with many combinations of patients, the chance of overfitting the predictive algorithm decreases.”

2. ROC Curve

The effect of different levels of sensitivity on specificity is represented by the ROC curve, but determining which levels to use are dependent on the task and system, wrote Handelman et al.

For example, a radiologist might want a higher sensitivity in screening for large areas of ischemia on brain CTs—knowing there will be false-positives seen by the reader. But, with too many false-positives, the sensitivity can be changed so a more acceptable burden of false-positives are shown, they added.

“Algorithms that perform better will have a higher sensitivity and specificity and thus the area under the plotted line will be greater than those that perform worse,” Handelman and colleagues wrote.

3. Confusion Matrix

Readers who are interested in a specific metric can use the confusion matrix to view that data and compare an algorithm with others, according to Handelman and colleagues. It is largely comprised of true-positive and false-positive rate, specificity, accuracy, positive predictive value, likelihood ratios and diagnostic odds ratio.

Accuracy is a commonly used value in studies, which refers to the number of correct predictions made as a ratio to all predictions. Depending on the clinical question, Handelman et al. wrote, it may be more important to look at other metrics, such as negative likelihood value, which can provide more direct information.

4. Mean squared error and mean absolute error

The relationship between variables in ML—regression—are expressed through an equation which minimizes the distance between a fitted line and data point. The degree of regression and its reliability to make predictions is represented by the mean squared error (MSE).

There are multiple ways to express error, such as mean absolute error and root mean squared error, which are not directly comparable. However, “smaller is better,” the authors wrote, except in the case of coefficient of determination (R2) metric.

5. Image segmentation evaluation

In detection tasks, traditional metrics such as ROC or MSE aren’t useful. Instead, values such as Dice coefficient or intersection over union, which incorporates the accuracy of image registration, are used.

For example, the authors wrote, an ML algorithm which can detect a brain tumor on cross-sectional images and outline tumor extent can be compared to a radiologist performing the same task using a degree of overlap. This measures the accuracy of the model on a scale of 0 to 1. Different values are acceptable, but should be greater than 0.5, they wrote.

Overall, Handelman et al. agreed readers should look at the sample size when judging ML algorithms, but noted data sharing will be a key to verifying algorithms going forward.

“It is important that databases are expanded and maintained with open access to data and the predictive algorithm so that reported performance can be verified and competing algorithms are drawing from the same data pool,” they concluded.

Matt O'Connor

Matt joined Chicago’s TriMed team in 2018 covering all areas of health imaging after two years reporting on the hospital field. He holds a bachelor’s in English from UIC, and enjoys a good cup of coffee and an interesting documentary.

Related Content