consent in the presence of correlation
Dec 11, 2018
Alex Hayes
8 minute read


This post explores some ideas for a normative ethics of personal data.

To begin, I view ethical actions as those that empower individuals to have agency over their own lives. This line of thinking leads to broadly agreed-upon standards of behavior. To ensure that people maintain control over what is theirs, we must obtain consent before engaging in emotional or physical relationships with them. For this consent to be meaningful, it must be active, ongoing and informed.

I propose that we treat data use as an ongoing relationship between an individual and a data analyst. Just as emotional and physical relationships require consent, information sharing relationships also require consent. To see if a particular behavior is ethical, we can then ask: have we obtained consent to use data in this way? Was this consent active, ongoing and informed?

All too often, the answer is no. Beyond flagrant violations of consent, we must also consider how data projects can evolve in unanticipated ways, and how data itself is correlated and has other rich structure1.

Personal Data

When we talk about data autonomy, we are talking about the autonomy of individuals, and their personal data. Personal data means measurements or observations where the unit of observation is a human being. These measurement belong to the person they measure, no matter who collected them or how they were collected.

Examples of personal data include:

  • Demographic info such as age, gender, ethnicity, etc
  • Biological info such as genetic and medical records
  • Behavioral info such as as daily routines
  • Social info such as who your friends are
  • Financial info such as your wealth and spending habits

When we access and analyze personal data, these actions impact those who have shared their data with us. To be ethical, we must let these individuals control the impacts of our data analyses on their lives.

Data Use Bill of Rights

I propose that you and all people have the right to data autonomy. That is, you and all people have the right:

  • to refuse to share your personal data,
  • to revoke consent to data access and use at any time,
  • to have your data deleted at any time,
  • to be told the truth about how your data will be used,
  • to know who has your data,
  • to choose who you will share your data with,
  • to control the visibility and accessibility of any publicly available data,
  • to place arbitrary conditions (such as anonymity) on use of your data,
  • to view your data and correct any errors in your data,
  • to have corrections and deletions occur in a timely manner, and
  • to have your data stored securely.

These rights represent a healthy starting point, but they are far from complete2.

For example, data analysis is an ongoing process, and data projects often change in unanticipated ways. Data-sharers have a right to be kept apprised of these changes, because otherwise they lose the capacity to give ongoing and informed consent.

Further, data sharing can unexpected and long reaching consequences. In cases when data projects have projects have vague goals and no clear path forward, it may be impossible to meaningfully inform individuals about the potential consequences of sharing their data.

Similarly, we should consider that consent “to do whatever you want with my data” is unlikely to be meaningfully informed, and we should not take such “permission” at face value.


Even more complications arise when we work with lots of personal data all at once. Collections of personal data have emergent properties that aren’t present in observations on a single individual.

Suppose a biotech company partners with a regional hospital. As part of a long term research study on environmental impacts on health, they sequence the genomes of 5,000 local residents. The locals agree that the hospital and biotech company can use their data for a large open-ended set of health research projects. This happened in my hometown while I was going to highschool.

So where’s the wrinkle?

While both of my parents consented to participation in the study, my sister and I did not. Yet, 23 & Me, the company who conducted the sequencing, has nearly as much information on my sister and I as they do on my parents3.

If person \(x\) consents to share their data, and their data is correlated with person \(y\)’s data, person \(y\) essentially has a percentage of their data shared without consent. In some cases where the correlation is strong, such as with genetic information, this percentage is far from trivial.

With genetics, study participants know ahead of time that their genetics are correlated with outcomes, and many of these associations are a part of the scientific record. When these associations are known, study participants can give informed consent: they broadly know what additional information they give away when they share their genetic information.

Other times, such associations are not known ahead of time. When analysts have personal data on many people, they can find previously unknown associations, revealing information that study participants did not consent to share. Social network data is a prime example of this. It turns out that if a bunch of Facebook users tell you who their friends are, you can use the resulting graph to predict each user’s politics, wealth, educational status and sexuality with high accuracy4.

The Big Picture

Even as a student, I have had an astonishing amount of personal data plopped into my lap. I imagine a large number of data scientists find themselves in similar situations. We should think about what constitutes ethical data use before data is sitting in front of us. Especially as most data comes with a myriad of incentives to use it but little oversight.

When we have personal data, it is easy to make decisions for other people, whether intentionally or unintentionally. Additionally, when people choose to share their data with us, we are unlikely to know the full impact of that decision until late in the data analysis. When new analytical techniques are constantly being developed, it isn’t really possible to be informed.

When we plan data projects, we need to keep in mind that: (1) the eventual impact of our project is unknown, and may be very different than we imagine, and (2) the data we use is correlated with personal information that we do not have consent to see or use. We may discover hitherto unknowns associations5.

When we consider correlated data, what rights do we have?

I don’t know. Perhaps there are some areas where we can come up with genuine quantitative insights, but on the whole I suspect this is the wrong approach. For the time being, I believe our best bet is to ask ourselves: is this use of data consensual? Have I been given active, ongoing and informed consent?


  1. I unfortunately have little exposure to the privacy and data ethics literatures, so any naivete, mistakes, and recreation of the wheel here are entirely my own fault. Pointers to relevant reading are much appreciated.

  2. This bill of rights is a translation of the Relationship Bill of Rights to the realm of personal data.

  3. Earlier this year, 23 & Me sold this treasure trove of genetic information to Glaxo Cline Smith, a pharmaceutical company, for research purposes. I don’t no how the data is actually being used, but I find it concerning that a private company whose end goal is profit has data that is highly correlated with my personal outcomes.

  4. Dataclysm by Christian Rudder is a fascinating collection of analyses of social network data.

    Social networks highlight yet more considerations. For example, there is a difference between asking for consent to use data, and being able to find data. Even if you can find data, it isn’t always clear if someone originally intended to share it, or if they even are aware they did. Visibility on social media, and more generally the accessibility of data, can have big consequences.

    Then some data is genuinely a part of the public record. I strongly believe that the public record is fair game to analyze. But what counts as the public record, and what precise is it okay to do with that information? I haven’t clarified my thinking on this. Also, I suspect that a large amount of personal information currently in the public domain is there nonconsensually.

  5. I am also curious about the responsibilities of methodologists. I have recently gone from apathic to somewhat concerned here. Over the last three months I’ve seen two research groups demonstrate new methodologies in ethically dubious ways.

    A number of researchers I’ve talked to are of the opinion that “someone will do the math if we won’t.” I’m uncomfortable with this. I think you at least have some obligation to consider the potential for misuse of your method, and potentially to withhold it if that potential is large.

    On the other hand, the uncertainty about eventual outcomes makes decisions like this a lot harder. Perhaps, at the least, technological demonstrations should suggest beneficial uses rather than detrimental ones.

comments powered by Disqus