Identifying bias when sensitive attribute data is unavailable

Photo by Ryoji Iwata on Unsplash

The perils of automated decision-making systems are becoming increasingly apparent, with racial and gender bias documented in algorithmic hiring decisions, health care provision, and beyond. Decisions made by algorithmic systems may reflect issues with the historical data used to build them, and understanding discriminatory patterns in these systems can be a challenging task [1]. Moreover, to search for and understand possible bias in their algorithmic decision-making, organizations must first know which individuals belong to each race, sex, or other legally protected group. In practice, however, access to such sensitive attribute data may be limited or nonexistent. 

Existing mandates and practices

Businesses may be legally restricted from asking their clients about their race, sex and other protected characteristics, or they may only be allowed to ask in certain circumstances [2]. If data is collected, it may be inconsistent or may only be available for a portion of the relevant population [3],[4]. While access to this data is alone not sufficient to ensure equitable decision-making, a lack of access limits an organization’s ability to assess their tools for the inequities they may seek to eliminate [5]. 

Policies and practices governing the collection of this kind of data are myriad and context-dependent. For example, within the credit industry, the Equal Credit Opportunity Act (ECOA) limits lenders’ ability to collect sensitive attribute data, while the Home Mortgage Disclosure Act (HMDA) requires financial institutions to collect data on the race, sex, ethnicity, and other characteristics of mortgage applicants and submit annual reports with this information to the Federal Reserve Board. The varying mandates of the ECOA and HMDA are products of complex legislative histories and debates about whether collecting sensitive attribute data elucidates or amplifies discrimination in lending [5]. 

Effects of a changing policy landscape

Within the technology industry, issues of possible bias in algorithmic systems are high priority for many companies. Companies like Facebook and Linkedin are studying and adjusting their ad delivery mechanisms and presentation of search results to prevent bias. Companies offering hiring tools that use artificial intelligence are committed to studying and rooting out bias-related issues in their models. And while mandates regarding sensitive attribute collection in this area have been historically less clear then in industries like consumer lending, new laws and proposed legislation governing use of consumer data, privacy, and bias in decision-making are starting to affect that landscape. The California Consumer Privacy Act (CCPA), which took effect this year, gives consumers the right to know what personal information companies have stored about them and to request that such data be deleted. The European Union’s General Data Protection Regulation (GDPR) includes a similar “right to erasure.”  Many effects of these new mandates — including how they will impact organizations’ efforts to examine possible bias in the long term — remain to be seen. 

Even when data on personal characteristics is unavailable, organizations often still seek to understand possible bias and discrimination. To address this problem of missing data, numerous techniques have emerged for inferring individuals’ protected characteristics from available data, like their name or the area they live in. While not perfect, these techniques have been used in real, high-stakes settings to infer protected characteristics and make decisions based on those inferences [5]. In our next post, we’ll explore one technique that has been used by the Consumer Financial Protection Bureau to construct race proxies for mortgage applicants [6]. 

References:
[1]: Barocas, S., & Selbst, A. D. (2016). Big data’s disparate impact. Calif. L. Rev., 104, 671.
[2]: 12 CFR Part 1002 (Regulation B)
[3]: Agency for Healthcare Research and Quality. (2012). Race, Ethnicity and Language Data: Standardization for Health Care Quality Improvement
[4]: Hasnain‐Wynia, R., & Baker, D. W. (2006). Obtaining data on patient race, ethnicity, and primary language in health care organizations: current challenges and proposed solutions. Health services research, 41(4p1), 1501-1518.
[5]: Bogen, M., Rieke, A., & Ahmed, S. (2020, January). Awareness in practice: tensions in access to sensitive attribute data for antidiscrimination. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 492-500).
[6]: Bureau, C. F. P. (2014). Using publicly available information to proxy for unidentified race and ethnicity: A methodology and assessment. Washington, DC: CFPB, Summer.

Links: 

  • Dastin, J. (2018). Amazon scraps secret AI recruiting tool that showed bias against women. San Fransico, CA: Reuters. 
  • Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453.
  • Equal Credit Opportunity Act
  • Home Mortgage Disclosure Act
  • Scheiber, N. & Isaac, M. (2018). Facebook halts ad targeting cited in bias complaints. New York Times. 
  • Chan, R. (2018). LinkedIn is using AI to make recruiting diverse candidates a no-brainer. Business Insider. 
  • Larsen, L. (2018). HireVue Assessments and Preventing Algorithmic Bias. HireVue. 
  • California Consumer Privacy Act
  • General Data Protection Regulation