Robert Sprague, An Ontology of Privacy Law Derived from Probabilistic Topic Modeling Applied to Scholarly Works Using Latent Dirichlet Allocation

PLSC 2013

Workshop draft abstract:

Privacy, being an evolutionary product of social development, has been a human need and desire for millennia. Privacy law scholarship, in contrast, is a relatively recent phenomenon. Of all the privacy-related law review articles published in the history of the United States, for example, over ninety percent were published after 1990—and half that amount in the past decade. Within this recent profusion of scholarship lies a fundamental conundrum: there is no clear definition of privacy; there is not even consensus of what would constitute an adequate definition. Fundamental categories of privacy have been identified and analyzed—e.g., seclusion, intimacy, surveillance, anonymity, control of information. But most calls for privacy arise from context, as well as advancing technologies, meaning the legal system often has difficulty identifying and protecting rights to privacy. Without a coherent construction of privacy principles shared by the community of scholars, the legal system can never explicitly articulate those principles.

This paper will report preliminary results from a research project aimed at identifying fundamental privacy law principles derived from the writings of legal scholars themselves using probabilistic topic modeling, which uses a suite of algorithms to discover hidden thematic structures in large archives of documents. Topic modeling algorithms are statistical methods that analyze the words of texts to discover the themes (topics) that run through them, how those themes are connected to each other, and how they change over time. For example, in Warren’s and Brandeis’s Harvard Law Review article “The Right to Privacy,” the word “property” is identified as the most statistically probable primary topic in the article—which makes sense since Warren and Brandeis were postulating privacy as a form of intangible property right. A latent Dirichlet allocation, which identifies sets of terms that more tightly co-occur, is incorporated into the topic modeling analysis to identify words most closely associated with each identified topic. In “The Right to Privacy,” in addition to identifying “property” as the primary topic, the process also identifies the words “privacy” and “individual” as co-occurring most frequently with the topic “property.” The latent Dirichlet allocation therefore provides insight into the context in which each identified topic occurs.

All published law review articles which cite “The Right to Privacy” (some 3,500 articles) are being converted to plain text. “The Right to Privacy” was selected as the focal point of the document corpus because it is the original published scholarly call for a formal legal right to privacy in the United States; hence, the vast majority of privacy law scholarship cites to it. Probabilistic topic modeling using latent Dirichlet allocation is being applied to the document corpus in time slices to reveal the evolution of fundamental privacy law concepts expressed in the legal literature published from 1890 through 2012. Studies in different disciplines have demonstrated the ability of latent Dirichlet allocation to analyze the rich underlying structures of a particular domain—depicting emerging and sustained trends in a given discourse. The ultimate goal of this project is to identify the fundamental conceptual structure of privacy law in the United States as reflected by over a century of legal scholarly work.

The proposed paper will provide an overview of the topic modeling process using latent Dirichlet allocation to explain and validate the underlying analytical methodology. Preliminary results of applying the statistical modeling to the law scholarship document corpus as of May 2013 will be presented and discussed.