Paul Ohm, The Probability of Privacy

Comment by: Michael Frommkin

PLSC 2009

Workshop draft abstract:

Data collectors and aggregators defend themselves against claims that they are invading privacy by invoking a verb of relatively recent vintage—“to anonymize.” By anonymizing the data—by removing or replacing all of the names or other personal identifiers—they argue that they are negating the risk of any privacy harm. Thus, Google anonymizes data in its search query database after nine months; proxy email and web browsing services promise Internet anonymity; and network researchers trade sensitive data only after anonymizing them first.

Recently, two splashy news stories revealed that anonymization is not all it is cracked up to be. First, America Online released twenty million search queries from 650,000 users. Next, Netflix released a database containing 100 Million movie ratings from nearly 500,000 users. In both cases, the personal identifiers in the databases were anonymized, and in both cases, researchers were able to “deanonymize” or “reidentify” at least some of the people in the database.

Even before these results, Computer Scientists had begun to theorize deanonymization. According to this research, none of which has yet been rigorously imported into legal scholarship, the utility and anonymity of data are linked. The only way to anonymize a database perfectly is to strip all of the information from it; any database which is useful is also imperfectly anonymous; the more useful a database, the easier it is to reidentify the personal information in the database.

This Article takes a comprehensive look at both claims of anonymization and theories of reidentification, weaving them into law and policy. It compares online and data privacy with anonymization standards and practices in health policy, where these issues have been grappled with for decades.

The Article concludes that claims of anonymization should be viewed with great suspicion. Data is never “anonymized,” and it is better to speak of “the probability of privacy” of different practices. Finally, the Article surveys research into how to reduce the risk of reidentification, and it incorporates this research into a set of prescriptions for various data privacy laws.