Personal Genetic Data Exposed To Security Vulnerabilities
January 18, 2013

Private Genetic Data Is Just A Few Clicks Away Say MIT Researchers

redOrbit Staff & Wire Reports - Your Universe Online

Armed with nothing but a laptop and an Internet connection, researchers from the Whitehead Institute for Biomedical Research at MIT say they were able to uncover the identities of almost 50 people who donated DNA to genetic research studies. Their work raises series questions about the privacy practices used to safeguard personal data for volunteers in genomic research.

The results of their so-called “vulnerability research” study appeared in yesterday's issue of the journal Science. Experts say that the team´s work explores a new, previously unidentified way that interested parties could outfox existing privacy controls that are designed to protect sensitive genetic information.

The researchers found that they were easily able to identify men who had donated DNA to public databases because they could trace their Y chromosomes along with their surnames. They checked last names in genealogy databases and then narrowed their search by matching last names to the donors´ ages and states of residence. Using only last names, they were able to find the full names and identities of the participants, even in cases where their genetic information was stored in de-identified form in large databases.

“This is an important result that points out the potential for breaches of privacy in genomics studies,” says Yaniv Erlich, a Whitehead Fellow and lead researcher in the study.

Erlich´s team started by analyzing unique genetic markers known as short tandem repeats that are located on the Y chromosomes (Y-STRs) of males. The sample DNA they used was collected by the Center for the Study of Human Polymorphisms (CEPH) and was made publicly available as part of the 1000 Genomes Project.

Like last names, men receive their Y chromosome exclusively from their fathers, meaning that there is a strong correlation between surnames and the genetic information found on the Y chromosome in research databases.

Fully aware of this relationship between Y chromosomes and surnames, hobby genealogists and genealogy companies have already created publicly accessible databases that store Y-STR data by last names. Using a simple strategy known as “surname inference,” Erlich´s team was able to find out the family names of the men by submitting their Y-STRs to these databases.

For an initial trial run, Erlich used the information from his colleague Craig Venter, a prominent geneticist who both helped to map the human genome and also donated to the project himself.

“There were thousands of Venters, so we thought, what happens if we know the age and state of residency of the individual?” Ehrlich explained to Bloomberg writer Elizabeth Lopatto. The team used these two pieces of information because they were some of the only data that are legally allowed for use under the Health Insurance Portability and Accountability Act of 1996. “So we got two matches, one of which was Craig Venter.”

Erlich pointed out that not every name would necessarily be this easy to identify. But for white, middle-class males, a last name and a state of residence will on average turn up about 12 candidates. “At that point, you can just call all 12 and ask if they participated.”

Using names and states, the team trolled through a variety of publically available online information sources, including obituaries, genealogical websites and demographic data from the National Institute of General Medical Sciences (NIGMS) Human Genetic Cell Repository. They were able to identify almost 50 U.S. men and women who had participated in CEPH projects.

Perhaps even more disturbing, Erlich´s team was able to connect and identify individuals who were distantly related to the original DNA donor. “We show that if, for example, your Uncle Dave submitted his DNA to a genetic genealogy database, you could be identified,” explains Melissa Gymrek, a researcher in the Erlich lab and first author of the Science paper, in a press statement. “In fact, even your fourth cousin Patrick, whom you´ve never met, could identify you if his DNA is in the database, as long as he is paternally related to you.”

Considering that uncovering this kind of information didn´t even require illegal hacking, it comes as no surprise that Erlich´s findings have both the research community and privacy activists worried about the possibility for privacy violations on unsuspecting research participants.

Princeton geneticist Leonid Kruglyak told Bloomberg that while this privacy loophole isn´t exactly catastrophic yet, it could eventually become a much bigger problem.

“This isn´t an immediate thing, but what if someone wanted to enable this and designed a piece of software to make it easy?” Kruglyak said in a telephone interview with the news agency. “I don´t think this is something that people sitting at their keyboards at home would be able to do.”

However, if this were to happen, says Kruglyak, and someone invented a piece of software that basically did the trolling work for you, then employers or insurance companies could potentially find out whether an individual has a genetic disease, or people who thought they were related might find out that they actually aren´t.

Of utmost importance going forward is that people who participate in these studies are fully aware of the risks to their privacy says Laura Lyman Rodriguez, director of policy, communications, and education at the National Human Genome Research Institute in Bethesda, Maryland.

“This risk that this will affect people is low,” said Rodriguez in an interview Bloomberg. “That doesn´t mean the risk is zero. Informed consent is always important, to talk about the risks.”

Erlich agrees with Rodriguez´ assessment. And while he knows that his team´s work has stirred up a potential hornet´s nest, he stresses they do not want regulators to start cracking down on the sharing of genetic information that is so vital to public research.

“Our aim is to better illuminate the current status of identifiability of genetic data,” he explains. “More knowledge empowers participants to weigh the risks and benefits and make more informed decisions when considering whether to share their own data. We also hope that this study will eventually result in better security algorithms, better policy guidelines, and better legislation to help mitigate some of the risks described.”