AJR: Radpeer has ‘little worth’ in tracking physician performance

Evan Godt | November 30, 2012 | Health Imaging | Artificial Intelligence

The Radpeer system, which has become a part of physician performance evaluation in many practices, is unreliable and too subjective for the evaluation of discrepant interpretations, according to a study published in the December issue of the American Journal of Roentgenology.

“We found marked variability in the assessments of a group of radiologists evaluating a sample of known discrepant interpretations using the Radpeer scoring system,” wrote Leila C. Bender, MD, of the University of Washington, Seattle, and colleagues.

During the Radpeer review process, a random retrospective evaluation of previously interpreted cases assigns a score on a four-point scale indicating agreement with the original interpretation—1 indicates agreement and 2-4 indicate one of three types of disagreement. The American College of Radiology (ACR) requires use of Radpeer or a similar system to receive accreditation, and the American Board of Radiology recognizes it as a component of practice quality improvement for the maintenance of certification.

To assess the reliability of Radpeer, a sample of 25 discrepant cases was extracted from the quality assurance database at the University of Washington. After images were made anonymous and other information was removed, 21 subspecialist attending radiologists rated the cases using Radpeer’s scoring system.

“Interrater agreement was slight to fair compared with that expected by chance,” wrote the authors. The kappa values were 0.11 with the standard scoring system, and 0.20 with dichotomized scores that divided Radpeer ratings as either 1/2 or 3/4.

“Our findings imply that the Radpeer scoring system has little worth and underscore the point that QA efforts should not be limited to the collection of Radpeer scores,” wrote Bender and colleagues. “The findings cause us to question the fundamental value of Radpeer as a means of comparing facility and physician performance.”

The authors noted there was disagreement about whether a discrepancy occurred in a total of 20 cases.

They suggested that more resources be devoted to developing a more robust and objective assessment procedure, as this would improve score validity and relevance of the peer review process.

Evan Godt, Writer

Evan joined TriMed in 2011, writing primarily for Health Imaging. Prior to diving into medical journalism, Evan worked for the Nine Network of Public Media in St. Louis. He also has worked in public relations and education. Evan studied journalism at the University of Missouri, with an emphasis on broadcast media.

Related Content