Machine learning’s success may depend on addressing 'gray areas' of cancer diagnosis

Much is made about AI’s potential to improve radiologists’ efficiency and boost their detection capabilities, but when it comes to cancer care, a few experts believe the coming tech revolution may face some problems.

That was the sentiment expressed by pathologists from the University of Texas in Austin and Brigham and Women’s Hospital in a Dec. 12 New England Journal of Medicine perspective. Machine learning, they wrote, will certainly help clinicians interpret more images with better accuracy, but it can’t solve the problem at the heart of diagnosing cancer: the lack of a histopathological “gold standard.”

“Diagnoses of early-stage cancer made using machine learning algorithms will undoubtedly be more consistent and more replicable than those based on human interpretation,” Adewole S. Adamson, MD, with the Texas institution’s division of dermatology, and colleagues wrote. “But they won’t necessarily be closer to the truth—that is, algorithms may not be any better than humans at determining which tumors are destined to cause symptoms or death.”

This is because there is no single answer to the question, they argued. Clinically, physicians are interested in cancer as a dynamic process, which begins as a tumor that may spread and cause symptoms if left untreated. Pathologically, however, identifying cancer requires static observation, achieved by examining individual cells, surrounding tissue, biomarkers and the relationship between the three.

And pathologists have shown they cannot agree on the underlying histopathological diagnosis, which varies in prostate and thyroid cancer, breast lesions and suspected melanoma, for example. So what could happen if an algorithm is trained on the disagreement of pathologists?

For one, it could increase the already ongoing problem of overdiagnosis, the authors explained. Machine learning algorithms can read digitized pathology slides in seconds, much faster than a human and at a fraction of the price. This, in turn, may encourage clinicians to perform more biopsies.

“Higher throughput—more tissue, more patients—will only increase opportunities for overdiagnosis,” Adamson and colleagues wrote.

What can be done?

The experts suggest that algorithm creators incorporate an external standard based on cancer judgements from a diverse panel of pathologists. This could capitalize on the disagreement and train the platform to discriminate between three cancer diagnoses: total agreement on the presence of cancer, total agreement on it absence and disagreement on whether cancer is present.

“We believe this intermediate category contains important information about lesions that are in the gray zone between ‘cancer’ and ‘not cancer,’” Adamson and co-authors wrote.

Using these three categories could call attention to slides containing uncertainties that may require a closer look from multiple experts; it may encourage more conservative treatments due to the uncertainty of lesions; and such an approach should spur more research into how to handle intermediate lesions.

There will be instances in which these three categories fall short, the authors acknowledged, such as Gleason scoring used for prostate cancer. In these cases, researchers must think about the level of detail necessary to classify prostate cancers, and whether the three categories are specific enough.

But as is the case with any new approach, there are benefits and drawbacks. Adamson and co-authors maintain that addressing the gray areas of cancer should be a priority.

“We believe that the possibility of training machine learning algorithms to recognize an intermediate category between ‘cancer’ and ‘not cancer’ should be given serious consideration before this technology is widely adopted,” the experts concluded. “Highlighting the existence of gray areas could present an important opportunity for pathologists to discuss decisions about what constitutes cancer.”