SIIM: Simple search, indexing tools can substitute for NLP
Although not a replacement for a sophisticated natural language processing (NLP) application, there are simple search and indexing tools that can be used by faculty to identify cases for research and education purposes, according to a presentation at the 2009 Society for Imaging Informatics in Medicine (SIIM) conference this week in Charlotte, N.C.

For the past 114 years, and until medical image interpretation moves to structured reporting with controlled terminologies, the majority of the description of image pixel meaning is captured as free text in diagnostic imaging reports, according to Jonathan Thirman from Yale University in New Haven, Conn., and colleagues. NLP of these radiology reports has been the subject of extensive research. They said that a "common request from radiology faculty, absent the knowledge of the complexities of NLP, however, is for a simple radiology report search tool that can be used to identify cases for research and education purposes."

Thirman and colleagues extracted 3.3 million free text radiology reports directly from the Sybase database of their GE Healthcare Centricty PACS test system. To de-identify the reports, they removed metadata headers and footers, added by the Cerner RadNet RIS.

After assigning medical concepts to each report, the Lucene open-source tool was used to index the reports for search. The Apache Tomcat server along with Sun Microsystem's Java Server Pages was used to host the system.

Using this infrastructure, the investigators said that their system offers simple text searches of radiology reports, concept searches where Universal Medical Language System concepts or free text may be searched, and an advanced search that allows filtering results by modality, age, sex and procedure name.

Thirman and colleagues said that the system return results as lists of studies whose report matches the search criteria, ranked by relevancy of the report to the search terms. The results include the modality, the procedure name, the age of the patient, the date of the study, and the accession number. The display of the results allows browsing of the de-identified reports, along with the mapped concepts for that report.

They noted that mapping process takes between 0.5 and 180 seconds per report, so mapping a large corpus can take more than one month of processing time. Using these tools, the researchers mapped 3.3 million radiology reports to 260,975 concepts in a subset of their thesaurus. The maximum number of concepts for a single report was 39; the mean was 20.

Based on their results, Thirman and colleagues concluded that such a system can provide an "interim service until more complete NLP systems are more commonly available. Even after the inevitable transition to fully structured reporting, there large collections of radiology reports will still remain for which NLP and even simple searching and indexing will prolong their usefulness."