Bias in Big Data: Implications for Multi-Sector Data Sharing

Bias in data is everywhere, from the moment we pose a question to be answered to the point when we implement solutions. The Northwestern Institute for Sexual and Gender Minority Health and Wellbeing hosted a half-day workshop on this topic. The sessions focused on defining what and how bias in big data emerges in our work and the real-world implications. The lessons learned through practitioners’ work with sexual and gender minority communities are valuable and necessary as we build towards equitable results in our multi-sector data sharing initiatives.

Data bias exists, and in the age of big data, that bias can amplified and more easily hidden

We see it in every phase, from the moment we identify a real world problem up until we create a methodology to try to figure out how and why the problem came to be and are able to apply solutions. Through an activity led by Dr. Michelle Birkett, workshop participants named several ways in which we see bias in research. The following are some questions to consider when thinking about structuring your own research/data sharing approach:

  • How are people with lived experience integrated into the different phases of your data collection and data sharing?
  • Who gets to say what is a problem worth examining further?
  • Is the language we’re using to develop a methodology accessible? What biases come into play when it comes to deciding who gets funded?
  • When we receive the results, what do we deem as worthwhile or notable in the data? What biases might our algorithms contain?
  • Is there a push to check if the solution works? How does this solution integrate people with lived experience, or help people retain their sense of autonomy?

Data bias can reproduce historical inequalities, and it can also be used to challenge them

After participants learned the foundational idea that data is not unbiased, the keynote speaker Yeshimabeit Milner, the Founder of Data for Black Lives, illustrated how data continues to hurt vulnerable populations. For Milner, our history and values influence the “objective functions” we seek to optimize in data models, and ultimately the algorithms we choose. She argued that already biased data values, plus these algorithms, produce measures that reproduce inequity, such as risk ratios, credit scores, and car insurance payment calculations. Additionally, some of the data values we use, such as zip codes, can serve as proxies for socioeconomic status or race. Her key argument was, “some narratives were created by data, and can only be disrupted by data.” She highlighted the tenets of Data for Black Lives, which included data as accountability, and data as collective action.

For Milner, data can be used to inform equitable interventions. For instance, she touched on public pressures put on Facebook to put anonymized data into a public data trust. She also highlighted a case in St. Paul, MN where community residents halted a proposed data-sharing plan merging private data sets to identify youth who were “at-risk” for future involvement in the criminal justice system without community input.

To mitigate data bias, we must adapt our strategies to reflect the communities we serve

The workshop brought the conversation back to how data bias affects sexual and gender minority populations, who face a particular challenge: what do we do when there’s not enough data collected or data collection is poor and inconsistent? Dr. Gregory Phillips II and Dr. Lauren Beach presented some considerations that may be useful across different populations.

For instance, integrating data with an identifier across various data sources could potentially facilitate important research that is not otherwise feasible; or, it could lead to a collapse of context and invasion of privacy. In other words, data users might lose the nuance that comes with looking at certain variables, such as race and socioeconomic status with health care data. With other data we collect, like in the case of sexual and gender identity, how people identify may vary, so creating standardization across other variables may allow for nuances to emerge in the data without over complicating the results.

Other considerations to keep in mind included shaping principles and a framework for data analysis and integration, creating stronger agreements around data privacy and security, and incorporating community engagement in data collection and sharing as a form of accountability.

So, what does this mean for multi-sector work and data sharing?

Data sharing and collaboration across sectors provides an opportunity to mitigate biases: supporting this kind of work can help develop new methods of capturing and analyzing data. However, collaboration alone doesn’t mitigate bias; it is an active process that includes developing a deeper understanding of stigmas in our society and the production of health disparities within vulnerable populations and apply this knowledge to shape new interventions. For more information, check out the #BiasInBigData website for suggested readings and materials.