Stanford researchers release large chest x-ray dataset to train AI models

Researchers from Stanford University in California have published a large, public dataset containing more than 224,000 chest x-rays from more than 65,000 patients to train AI algorithms. The team also announced a competition inviting developers to submit their chest x-ray interpretation models to detect pathologies more accurately than certified radiologists.  

The dataset—called CheXpert—was released by Stanford’s Machine Learning Group. In total, the dataset features a collection of 224,316 chest x-rays and radiology reports from 65,240 patients. The exams were performed at Stanford Hospital’s inpatient and outpatient centers between October 2002 and July 2017.  

Additionally, each chest radiograph was labeled as either positive, negative or uncertain for the presence of 14 observations. These structured labels for the images were created by an automated rule-based labeler (made available for the public to download), which the researchers developed to extract observations from free-text radiology reports.  

“Automated chest radiograph interpretation at the level of practicing radiologists could provide substantial benefit in many medical settings, from improved workflow prioritization and clinical decision support to large-scale screening and global population health initiatives,” the researchers wrote on the CheXpert webpage.  

Led by Stanford graduate and doctoral students Jeremy Irvin and Pranav Rajpurkar, the team realized that for the development and validation of automated algorithms to progress, there was a need for a labeled dataset that was large, had strong reference standards and provided expert human performance metrics for comparison.  

“We believe that AI has the potential to make an immense impact on healthcare; our philosophy is to encourage a worldwide collaboration to solve these challenging problems and maximize the benefits of these technologies on the healthcare system,” Irvin told Health Imaging. “We believe the dataset will help encourage research on medical imaging that will improve interpretation algorithms beyond chest x-rays and move closer to the adoption of these algorithms in hospitals worldwide.” 

The Stanford researchers are co-releasing CheXpert with MIMIC-CXR, an even larger dataset of 371,920 chest x-rays associated with 227,943 imaging studies sourced from the Beth Israel Deaconess Medical Center in Boston between 2011-2016. The CheXpert labeler was used with both datasets to create the same kind of structured labels for the images.  

For the competition, once developers train their chest x-ray algorithms, they are invited to submit their models to see if they can detect different pathologies as well as radiologists, according to the CheXpert webpage. Participants can now submit their algorithms to be tested with 500 studies from 500 patients not previously seen by the models.  

The researchers noted that eight radiologists individually annotated each of the studies and classified each of the 14 observations as “present," “uncertain likely” (either indicating positive results), “uncertain unlikely” or “absent” (either indicating negative results).   

The majority vote of five radiologist annotations serves as a strong ground truth and the remaining three radiologist annotations were used to benchmark radiologist performance, the researchers noted on the CheXpert webpage. 

In regard to their own chest radiograph interpretation model, the researchers noted it performed better than three radiologists for detecting cardiomegaly, edema, and pleural effusion. The radiologists more accurately detected atelectasis than the model. On consolidation, however, the model performed better than two of the three radiologists.  

Algorithm scores on the test set will be displayed on a leaderboard found on the CheXpert website beginning in February.