Last week, the Wall Street Journal reported on a paper in the journal Science (article free with registration), regarding the ability to identify supposedly anonymous donors to genetic research. Science carried an accompanying perspectives article and news summary.
The upshot: Imagine a fictitious Mr. Hogswobble (we’ll call him “H” in view of my limited typing skills), who donates a blood sample so his DNA can be sequenced as part of a study of genetics of a larger number of people, with the goal of learning something that can eventually help diagnose or treat human disease. H does this because he wants to support good science and medicine, but he’d rather not have his identity known, on the off chance that it could make it harder for him or his family to get insurance, for example, at some unknown time in the future. So the researchers tell H that they will do everything they can to keep his personal identity anonymous. He will not be identified in any scientific publication. The sample and the data gained from it will be “deidentified;” i.e., no personally identifying information, like his name, initials, Social Security Number, etc will be kept in the same place with it. Maybe there is such a linking record somewhere, maybe not, but if there is, it is under lock and key and held securely. His sample is given a unique identifier—maybe a number, like “43” (was that the number in The Hitchhiker’s Guide to the Galaxy?).
But the de-identified specimen and data are made publicly available, in the interest of open access for other scientists to work on it. This kind of sharing is critical to the free operation of good science. Critically, to make scientific sense of it, it likely includes certain “metadata,” such as H’s name, his country or state of residence, how old he was when the sample was obtained, maybe even some level of medical information relevant to the scientific research. But most people looking at the data could tell only that it comes from some guy, not from H personally. Now, to be sure, this information could be used to narrow down the field substantially—there are only so many 55 year-old men in California, for example—but other information on how many of those had, I don’t know, hypertension, let’s say, would NOT be readily had because of privacy laws governing medical records.
The problem is that we as a population freely make lots of other information about ourselves public. (No, I’m not including whether we own a gun, I don’t want to go there.) That’s where the researchers on the Science paper worked. The “metadata” were huge in their work, but the treasure trove was a public genealogy service. Send us a sample for your DNA, and some personal information (like your name), and we will make all of that public to help you and similarly-interested people find your long-lost relatives, for whatever reason you or they have for being interested.
So there are two public databases—the one more limited one with the DNA data and some metadata, and the broader one with DNA data and names—including, quite possibly, one or more men named Hogswabble. From the first database, the scientific research one, a list of genetic markers can be obtained—in this case, ones called “SNPs,” but we will call them, collectively, “Steve”—and that list can be compiled, then compared with the genealogy database to see how many H’s have DNA with “Steve” in them. That gives one a guess of whether any of the donors for the first study—the supposedly anonymous donors—are named Hogswabble. Surf the ‘Net for other publicly available information and these researchers could finger the identities of 50 donors to an actual scientific study intended to study a total of 1,000 people. So that’s 5%–50/1000.
A fundamental ethical tenet of human subject research is that measures must be in place to protect the privacy and confidentiality of research subjects. But in the age of big data and research on genomics and other large-population based biologic matters, assurance of confidentiality can seem like it’s founded on quicksand. What to do about it? Take whatever measures reasonably can be taken. In the informed consent process, tell a research subject that it is simply not possible to provide an absolute guarantee of confidentiality. Train researchers on ethical behavior—“do not hack Steve,” for example—but realize that in an open-source environment, the sort of steps described here could be done by just about any smart wise guy with Internet access. Limit the amount of metadata available; NIH is doing just that, although some tough judgment calls may be involved. Limit the availability of data? Now things are getting touchy. Better not to over-react, the scientists reasonably counsel.
Laws are in place, such as “GINA,” the Genetic Information Nondiscrimination Act of 2008, to prevent at least some types of discrimination (e.g., health insurance, employment) based on genetic information. These issues are with us to stay. In medical research, protecting our privacy and confidentiality has limits.