Explainable AI

Machine Learning Fairness

ML Monitoring

MLOps

Welcome back to Fiddler’s AI Explained Series. In this article and the accompanying talk, we’re following up on the ideas introduced in our last segment on algorithmic fairness and bias. We’ve already talked about why machine learning systems can end up giving preferential or prejudiced treatment to certain groups of people for a variety of causes, like pre-existing societal biases and unbalanced training data. In this segment, we’ll discuss a paper that we recently published called “Characterizing Intersectional Group Fairness with Worst-Case Comparisons.” This paper was a joint effort between Northeastern University and Fiddler Labs, and is authored by Avijit Ghosh, Lea Genuit, and Mary Reagan. It was accepted as a workshop paper at AAAI 2021.

Humans are multi-faceted: We belong to different subgroups across dimensions like gender, sexual orientation, race, religion, or national origin. Worldwide, many of these groups have faced discrimination and are legally protected. Intersectional fairness takes the concept of fairness and extends it to the ways that people’s identities overlap and intersect, forming subgroups like “heterosexual Jewish male.”

It’s important to measure intersectional fairness in order to get a complete picture of the bias that may exist in our AI systems. Looking at fairness one dimension at a time doesn’t always tell us the whole story. For example, consider the graph below showing decisions made by an algorithm:

This type of problem has been referred to as “fairness gerrymandering.” The dark blue circles represent people the algorithm passed, while the light blue circles denote the people it failed. The same number of women passed as men, and the same number of Black people passed as White people. This one-dimensional assessment hides the bias in the system: all Black women and White men failed, while all Black men and White women passed. We need an intersectional measure of fairness in order to catch this kind of imbalance across subgroups.

In our paper, we present a method of expanding fairness metrics to measure intersectional fairness, which we call the **Worst-Case Disparity Framework**. Based on the Rawlsian principle of distributive justice, this framework aims to improve the treatment of the worst treated subgroup. Its goal is simple: Find the largest difference in fairness between two subgroups (the “worst-case disparity”) and then minimize this difference.

We select a fairness metric f_{metric}, compute the fairness metric for each subgroup in the data, f_{metric}(group_i), then find the min/max ratio. To ensure intersectional fairness, our goal is for this ratio to be close to 1.

\frac{min \big\{f_{metric}(group_i) \forall i \in N\big\}}{max\big\{f_{metric}(group_i) \forall i \in N\big\}}

Let’s look at some of the traditional fairness metrics and see how our framework helps us expand them.

One of the most commonly used fairness metrics is **demographic parity**, which says that each subgroup should receive positive outcomes at equal rates:

P(\hat{Y}\mid A \in sg_i) = P(\hat{Y}\mid A \in sg_j) \forall i,j \in N, i \neq j

We can extend this metric using our Worst-Case Disparity Framework by looking across all the subgroups to find the minimum pass rate and comparing it with the overall maximum pass rate. If this demographic parity ratio is far from 1, that means there is a large disparity between the “worst off” and “best off” subgroup.

DPR = \frac{min \big\{P(\hat{Y}\mid A \in sg_i) \forall i \in N\big\}}{max\big\{P(\hat{Y}\mid A \in sg_i) \forall i \in N\big\}}

**Disparate impact** measures indirect and unintentional discrimination in which certain decisions disproportionately affect members of a protected group. Disparate impact compares the pass rate of one group against another.

How do we make this metric intersectional using our framework? We calculate the pass rate difference between all possible pairs of subgroups. Then we can simply take the minimum of these ratios, which represents the worst-case disparate impact:

DI = min\big\{ \frac{ P(\hat{Y}\mid A \in sg_i)}{P(\hat{Y}\mid A \in sg_j) }; \forall i,j \in N, i \neq j \big\}

In a similar manner, many other fairness metrics such as group benefit and equal opportunity can be expanded using the Worst-Case Disparity Framework. For more discussion on this, see our paper.

So far, the fairness metrics we have discussed are defined for binary classification models. How can we use the Worst-Case Disparity Framework with other types of models?

Some models have a range of possible classes. We expand our framework to handle these cases by measuring the odds ratio for each subgroup across each possible discrete output. The minimum of these values represents the worst-case disparity, where subgroup membership is most likely to bias toward one classification over another.

M-EOddR = min\big\{ \frac{ min \big\{P(\hat{Y}=y_k\mid A \in sg_i),\forall i \in N\big\}}{max \big\{P(\hat{Y}=y_k\mid A \in sg_i), \forall i \in N \big\}} \big\}\forall k \in K

What if the model’s output is not discrete but continuous? For example, the model might output a probabilistic number between 0 and 1. In this graph, red could represent the model’s outputs for Asian men and blue could represent outputs for Hispanic women.

With our framework, we can think of the worst-case disparity as representing the maximum distance between any two subgroups. There are many different approaches you can use to calculate the distance between two distributions. The one we’ve chosen to leverage for our paper is the Kullback-Leibler divergence, also known as relative entropy:

D_{KL}( \pi _1 \| \pi _2) = \int_ \infty ^ \infty \pi_1(x)log( \frac{\pi_1(x)}{\pi_2(x)})dx

By finding the KL divergence between all possible pairs of subgroups, you can calculate the worst-case disparity by taking the maximum of these values. Ideally, the max distance will be close to 0.

W-D_{KL} = max\big\{ \int_ \infty ^ \infty \pi_{sg_i}(x)log( \frac{\pi_{sg_i}(x)}{\pi_{sg_j}(x)})dx \big\}\forall i,j \in N, i \neq j

If your model returns a ranked list, there are additional considerations to make sure that the ranking is fair. Let’s look at two different kinds of methods for intersectional fair ranking.

One approach to fair ranking is to look at representation in the ranked list and compare it to overall representation in the population. The **skew** metric at rank K measures how more or less represented a subgroup is in the top K of the ranking (for example, the top 10 or 100) compared to the entire population. A skew of 1 is ideal because it shows there is no representational disparity.

For intersectional fairness, we want to ensure that no group is particularly skewed compared to others. So we look at the worst-case min/max skew ratio across all subgroups:

SR@K = \frac{min\big\{Skew_{sg_i}@k( \tau),\forall i \in N \big\} } {max\big\{Skew_{sg_i}@k( \tau),\forall i \in N \big\}}

Representation isn’t the only way to look at fairness for ranking. You can also consider the idea of visual attention. The paper “Quantifying the Impact of User Attention on Fair Group Representation in Ranked Lists” by Sapiezynski et al shows that when people look at a ranked list, they give the most attention to the items at the top:

Thus, a fair ranking could try to provide equal exposure to all subgroups, where a subgroup’s exposure is measured as the average attention a person in the subgroup receives:

MA_{sg_i} = \frac {1}{ | sg_i | } \sum_{k=1}^{ | \tau | } Att(k) where sg_k^\tau = sg_i

The worst-case can be represented again as the most disparate attention ratio between any two subgroups:

AR = \frac{min\big\{MA_{sg_i}, \forall i \in N \big\}}{max\big\{MA_{sg_i}, \forall i \in N \big\}}

To wrap up our discussion, let’s show how to apply our worst-case framework to a real-life modeling scenario. We’ve taken a pre-trained TensorFlow model from Google and tested it on a law school admissions dataset. Here is the graph of false negative rates across all subgroups:

If we just look at the binaries and we don’t consider the outcomes intersectionally, we might make some incorrect assumptions. We could be led to think that men always have a higher false negative rate than women. However, Black women have higher false negative rates than White, Asian, and Hispanic males.

To apply our framework to this problem, we would take the subgroup with the lowest false negative rate and the subgroup with the highest false negative rate and calculate the min/max ratio.

The min/max ratio is 0.002398/0.065327 = 0.036, which is far from the ideal value of 1. If we were trying to optimize the model to be fairer, there are many approaches we might take, but one is to add constraints during training. We would assert that the model needs to meet a certain worst-case disparity constraint while learning.

We’ve proposed a simple framework to extend conventional fairness metrics. The worst-case disparity metric packs a lot of information into one number and can be helpful for optimizing a model to minimize unfairness across subgroups. We’ve already deployed some of these intersectional fairness metrics into our fairness monitoring system at Fiddler to help teams have more confidence in their AI systems.

Of course, there are limitations to our framework. For example, it doesn’t help measure continuous features that are difficult or impossible to bucket into subgroups, like age or the gender spectrum. There may also be people who belong to multiple subgroups or have partial group measurement. We are conducting research on how to incorporate metrics that take into account these issues.