Publicly available imaging datasets may not be as reliable as radiologists think

A new deep-dive into the accuracy of two large public medical-imaging datasets revealed such openly available collections might not be as reliable as they seem. 

An Australian researcher found labeling problems, some of which were “significant,” within the ChestXray14 dataset (more than 112,000 images) and the Musculoskeletal Radiology (MURA) dataset (more than 40,000 images). The findings highlight the need to improve transparency and standardize the process of indicating a particular finding on images, often referred to as "labeling."

“The disconnect between the dataset development and the usage of that data can lead to a variety of major problems in public datasets,” Luke Oakden-Rayner, with the Australian Institute for Machine Learning, wrote Nov. 6 in Academic Radiology. “The accuracy, meaning and clinical relevance of the labels can be significantly impacted, particularly if the dataset development is not explained in detail and the labels produced are not thoroughly checked.”

What’s more, developing and training useful AI depends heavily on the accuracy of well-curated datasets, Oakden-Rayner explained. If such systems are tested on data generated from an already flawed or misunderstood set of images, problems may occur “silently;” with results appearing to be accurate, but the actual clinical performance of a system will likely falter.

A board-certified radiologist analyzed a subset of 700 images from both the CXR14 and MURA datasets, paying particular attention to the quality of the images' original labels.

Overall, Oakden-Rayner found the CXR14 labels did not accurately reflect what was seen on the images, with low positive predictive values that could lead to various radiology reporting styles and inter-observer variability. Labels also had “significant” problems related to “label disambiguation failure.” For example, a majority of cases labeled emphysema actually showed evidence of subcutaneous emphysema, Oakden-Rayner noted, which could require different treatment.

Labels in the MURA dataset were more reliable. However, the original/abnormal labels were inaccurate in a subset of images related to degenerative joint disease.

“In both datasets, the errors in the labels appear directly related to the weaknesses of the respective labeling methods,” Oakden-Rayner explained. “One way to partially mitigate the problems that users of the data may face is to produce a smaller second dataset purely for testing models trained on the original data, using a less flawed method, ideally involving expert visual review of cases.”