Blanket patient identification protection policies, such as Safe Harbor, leave different organizations vulnerable to re-identification at different rates and provide justification for locally performed re-identification risk estimates prior to sharing data, according to a study published in the March issue of the Journal of the American Informatics Association.
“To realize the benefits of sharing data while minimizing privacy concerns, many healthcare organizations have turned to ‘de-identification,’ a technique that strips explicit identifying information, such as personal names or Social Security numbers, from disclosed records,” wrote Kathleen Benitez, a health systems analyst programmer, and Bradley Malin, PhD, an assistant professor from the department of biomedical informatics and the school of medicine at Vanderbilt University in Nashville, Tenn.
Presently, healthcare organizations tend to employ at least two policy tiers: public use (removing a number of explicit identifiers) and restricted access research (retaining more detailed features, such as dates and geocodes), according to the authors.
Because investigations into the effectiveness of de-identification policies are limited and because these policies are often applied without knowledge of the risk of “re-identification,” the authors sought to estimate re-identification risk for data-sharing policies of the HIPAA Privacy Rule and evaluate the risk of a specific re-identification attack using voter registration lists.
“Most risk evaluation metrics for individual level data focus on one of the following factors: (1) the number, or proportion, of unique individuals, or (2) … the identifiability of the most vulnerable record in the dataset,” wrote the authors.
Benitez and Malin defined their risk metrics: expected number of re-identifications; estimated proportion of a population in a group of size g or less and monetary cost per re-identification. They utilized HIPAA policies for secondary data, voter registration access policies for each state and population descriptors derived from 2000 U.S. Census demographic summary statistics as resources for their evaluation.
For each state, the authors estimated the risk posed to hypothetical datasets, protected by the HIPAA Safe Harbor and Limited Dataset policies, by an attacker with full knowledge of patient identifiers and with limited knowledge of voter registries.
The percentage of a state's population estimated to be vulnerable to unique re-identification ranges from 0.01 percent to 0.25 percent when protected by Safe Harbor, and from 10 percent to 60 percent when protected by Limited Datasets, according to the study.
“In the voter attack, vulnerability drops for many states, and for some states is 0 percent, due to the variable availability of voter registries in the real world. We also find that re-identification cost ranges from $0 to $17,000, further confirming risk variability,” the authors said.
“Our analysis provides a basis for comparing different privacy protection analyses both theoretically and with respect to real-world attacks. As such, the approach may be useful to privacy officials defining new policies,” the authors wrote. “We believe that with the methods…and awareness of how different policies interact to affect privacy, a policy maker can make more informed policy decisions tailored to the needs and concerns of particular datasets.”