Fiddler Labs was selected to present at Plug and Play’s Fall Expo on October 24th, 2019. As part of their Fintech batch, we were very pleased to see so many attendees in the audience! From investors and corporations to startups innovating in the space.
Plug and Play is the ultimate innovation platform, bringing together the best startups, investors, and the world’s largest corporations. PnP’s fintech arm has partnered with over 60 corporations and allowed us to have access to a large number of partners.
Fiddler’s CEO, Krishna Gade, (pictured below) spoke about explainability and compliance for banks today deploying AI/ML models.
At Fiddler, our mission is to enable businesses of all sizes to unlock the AI black box and deliver trustworthy and responsible AI experiences.
We had a great time meeting with people in the industry and spreading the word about explainability!
As artificial intelligence (AI) adoption grows, so do the risks of today’s typical black-box AI. These risks include customer mistrust, brand risk and compliance risk. As recently as last month, concerns about AI-driven facial recognition that was biased against certain demographics resulted in a PR backlash.
With customer protection in mind, regulators are staying ahead of this technology and introducing the first wave of AI regulations meant to address AI transparency. This is a step in the right direction in terms of helping customers trust AI-driven experiences while enabling businesses to reap the benefits of AI adoption.
This first group of regulations relates to the understanding of an AI-driven, automated decision by a customer. This is especially important for key decisions like lending, insurance and health care but is also applicable to personalization, recommendations, etc.
The General Data Protection Regulation (GDPR), specifically Articles 13 and 22, was the first regulation about automated decision-making that states anyone given an automated decision has the right to be informed and the right to a meaningful explanation. According to clause 2(f) of Article 13:
“[Information about] the existence of automated decision-making, including profiling … and … meaningful information about the logic involved [is needed] to ensure fair and transparent processing.”
One of the most frequently asked questions is what the “right to explanation” means in the context of AI. Does “meaningful information about the logic involved” mean that companies have to disclose the actual algorithm or source code? Would explaining the mechanics of the algorithm be really helpful for the individuals? It might make more sense to provide information on what inputs were used and how they influenced the output of the algorithm.
For example, if a loan application or insurance claim is denied using an algorithm or machine learning model, under Articles 13 and 22, the loan or insurance officer would need to provide specific details about the impact of the user’s data to the decision. Or, they could provide general parameters of the algorithm or model used to make that decision.
Similar laws working their way through the U.S. state legislatures of Washington, Illinois and Massachusetts are
WA House Bill 1655, which establishes guidelines for “the use of automated decision systems in order to protect consumers, improve transparency, and create more market predictability.”
MA Bill H.2701, which establishes a commission on “automated decision-making, artificial intelligence, transparency, fairness, and individual rights.”
IL HB3415, which states that “predictive data analytics in determining creditworthiness or in making hiring decisions…may not include information that correlates with the race of zip code of the applicant.”
Fortunately, advances in AI have kept pace with these needs. Recent research in machine learning (ML) model interpretability makes compliance to these regulations feasible. Cutting-edge techniques like Integrated Gradients from Google Brain along with SHAP and LIME from the University of Washington enable unlocking the AI black box to get meaningful explanations for consumers.
Ensuring fair automated decisions is another related area of upcoming regulations. While there is no consensus in the research community on the right set of fairness metrics, some approaches like equality of opportunity are already required by law in use cases like hiring. Integrating AI explainability in the ML lifecycle can also help provide insights for fair and unbiased automated decisions. Assessing and monitoring these biases, along with data quality and model interpretability approaches, provides a good playbook towards developing fair and ethical AI.
The recent June 26 US House Committee hearing is a sign that financial services need to get ready for upcoming regulations that ensure transparent AI systems. All these regulations will help increase trust in AI models and accelerate their adoption across industries toward the longer-term goal of trustworthy AI.
Two different explanation algorithm types, best in different situations.
Some of the most accurate predictive models today are black box models, meaning it is hard to really understand how they work. To address this problem, techniques have arisen to understand feature importance: for a given prediction, how important is each input feature value to that prediction? Two well-known techniques are SHapley Additive exPlanations (SHAP) and Integrated Gradients (IG). In fact, they each represent a different type of explanation algorithm: a Shapley-value-based algorithm (SHAP) and a gradient-based algorithm (IG).
There is a fundamental difference between these two algorithm types. This post describes that difference. First, we need some background. Below, we review Shapley values, Shapley-value-based methods (including SHAP), and gradient-based methods (including IG). Finally, we get back to our central question: When should you use a Shapley-value-based algorithm (like SHAP) versus a gradient-based explanation explanation algorithm (like IG)?
What are Shapley values?
The Shapley value (proposed by Lloyd Shapley in 1953) is a classic method to distribute the total gains of a collaborative game to a coalition of cooperating players. It is provably the only distribution with certain desirable properties (fully listed on Wikipedia).
In our case, we formulate a game for the prediction at each instance. We consider the “total gains” to be the prediction value for that instance, and the “players” to be the model features of that instance. The collaborative game is all of the model features cooperating to form a prediction value. The Shapley value efficiency property says the feature attributions should sum to the prediction value. The attributions can be negative or positive, since a feature can lower or raise a predicted value.
There is a variant called the Aumann-Shapley value, extending the definition of the Shapley value to a game with many (or infinitely many) players, where each player plays only a minor role, if the worth function (the gains from including a coalition of players) is differentiable.
What is a Shapley-value-based explanation method?
A Shapley-value-based explanation method tries to approximate Shapley values of a given prediction by examining the effect of removing a feature under all possible combinations of presence or absence of the other features. In other words, this method looks at function values over subsets of features like F(x1, <absent>, x3, x4, …, <absent>, …, xn). How to evaluate a function F with one or more absent features is subtle.
For example, SHAP (SHapely Additive exPlanations) estimates the model’s behavior on an input with certain features absent by averaging over samples from those features drawn from the training set. In other words, F(x1, <absent>, x3, …, xn) is estimated by the expected prediction when the missing feature x2 is sampled from the dataset.
Exactly how that sample is chosen is important (for example marginal versus conditional distribution versus cluster centers of background data), but I will skip the fine details here.
Once we define the model function (F) for all subsets of the features, we can apply the Shapley values algorithm to compute feature attributions. Each feature’s Shapley value is the contribution of the feature for all possible subsets of the other features.
The “kernel SHAP” method from the SHAP paper computes the Shapley values of all features simultaneously by defining a weighted least squares regression whose solution is the Shapley values for all the features.
The high-level point is that all these methods rely on taking subsets of features. This makes the theoretical version exponential in runtime: for N features, there are 2N combinations of presence and absence. That is too expensive for most N, so these methods approximate. Even with approximations, kernel SHAP can be slow. Also, we don’t know of any systematic study of how good the approximation is.
There are versions of SHAP specialized to different model architectures for speed. For example, Tree SHAP computes all the subsets by cleverly keeping track of what proportion of all possible subsets flow down into each of the leaves of the tree. However, if your model architecture does not have a specialized algorithm like this, you have to fall back on kernel SHAP, or another naive (unoptimized) Shapley-value-based method.
A Shapley-value-based method is attractive as it only requires black box access to the model (i.e. computing outputs from inputs), and there is a version agnostic to the model architecture. For instance, it does not matter whether the model function is discrete or continuous. The downside is that exactly computing the subsets is exponential in the number of features.
What is a gradient-based explanation method?
A gradient-based explanation method tries to explain a given prediction by using the gradient of (i.e. change in) the output with respect to the input features. Some methods like Integrated Gradients (IG), GradCAM, and SmoothGrad literally apply the gradient operator. Other methods like DeepLift and LRP apply “discrete gradients.”
Let me describe IG, which has the advantage that it tries to approximate Aumann-Shapley values, which are axiomatically justified. IG operates by considering a straight line path, in feature space, from the input at hand (e.g., an image from a training set) to a certain baseline input (e.g., a black image), and integrating the gradient of the prediction with respect to input features (e.g., image pixels) along this path.
This paper explains the intuition of the IG algorithm as follows. As the input varies along the straight line path between the baseline and the input at hand, the prediction moves along a trajectory from uncertainty to certainty (the final prediction probability). At each point on this trajectory, one can use the gradient with respect to the input features to attribute the change in the prediction probability back to the input features. IG aggregates these gradients along the trajectory using a path integral.
IG (roughly) requires the prediction to be a continuous and piecewise differentiable function of the input features. (More precisely, it requires the function is continuous everywhere and the partial derivative along each input dimension satisfies Lebesgue’s integrability condition, i.e., the set of discontinuous points has measure zero.)
Note it is important to choose a good baseline for IG to make sensible feature attributions. For example, if a black image is chosen as baseline, IG won’t attribute importance to a completely black pixel in an actual image. The baseline value should both have a near-zero prediction, and also faithfully represent a complete absence of signal.
IG is attractive as it is broadly applicable to all differentiable models, easy to implement in most machine learning frameworks (e.g., TensorFlow, PyTorch, Caffe), and computationally scalable to massive deep networks like Inception and ResNet with millions of neurons.
When should you use a Shapley-value-based versus a gradient-based explanation method?
Finally, the payoff! Our advice: If the model function is piecewise differentiable and you have access to the model gradient, use IG. Otherwise, use a Shapley-value-based method.
Any model trained using gradient descent is differentiable. For example: neural networks, logistic regression, support vector machines. You can use IG with these. The major class of non-differentiable models is trees: boosted trees, random forests. They encode discrete values at the leaves. These require a Shapley-value-based method, like Tree SHAP.
The IG algorithm is faster than a naive Shapley-value-based method like kernel SHAP, as it only requires computing the gradients of the model output on a few different inputs (typically 50). In contrast, a Shapley-value-based method requires computing the model output on a large number of inputs sampled from the exponentially huge subspace of all possible combinations of feature values. Computing gradients of differentiable models is efficient and well supported in most machine learning frameworks. However, a differentiable model is a prerequisite for IG. By contrast, a Shapley-value-based method makes no such assumptions.
Several types of input features that look discrete (hence might require a Shapley-value-based method) actually can be mapped to differentiable model types (which let us use IG). Let us walk through one example: text sentiment. Suppose we wish to attribute the sentiment prediction to the words in some input text. At first, it seems that such models may be non-differentiable as the input is discrete (a collection of words). However, differentiable models like deep neural networks can handle words by first mapping them to a high-dimensional continuous space using word embeddings. The model’s prediction is a differentiable function of these embeddings. This makes it amenable to IG. Specifically, we attribute the prediction score to the embedding vectors. Since attributions are additive, we sum the attributions (retaining the sign) along the fields of each embedding vector and map it to the specific input word that the embedding corresponds to.
A crucial question for IG is: what is the baseline prediction? For this text example, one option is to use the embedding vector corresponding to empty text. Some models take fixed length inputs by padding short sentences with a special “no word” token. In such cases, we can take the baseline as the embedding of a sentence with just “no word” tokens.
In many cases (a differentiable model with a gradient), you can use integrated gradients (IG) to get a more certain and possibly faster explanation of feature importance for a prediction. However, a Shapley-value-based method is required for other (non-differentiable) model types.
At Fiddler, we support both SHAP and IG. (Full disclosure: Ankur Taly, a co-author of IG, works at Fiddler, and is a co-author of this post.) Feel free to email email@example.com for more information, or just to say hi!
You can’t always change a human’s input to see the output.
At Fiddler Labs, we place great emphasis on model explanations being faithful to the model’s behavior. Ideally, feature importance explanations should surface and appropriately quantify all and only those factors that are causally responsible for the prediction. This is especially important if we want explanations to be legally compliant (e.g., GDPR, article 13 section 2f, people have a right to ‘[information about] the existence of automated decision-making, including profiling .. and .. meaningful information about the logic involved’), and actionable. Even when making post-processing explanations human-intelligible, we must preserve faithfulness to the model.
How do we differentiate between features that are correlated with the outcome, and those that cause the outcome? In other words, how do we think about the causality of a feature to a model output, or to a real-world task? Let’s take those one at a time.
Explaining causality in models is hard
When explaining a model prediction, we’d like to quantify the contribution of each (causal) feature to the prediction.
For example, in a credit risk model, we might like to know how important income or zip code is to the prediction.
Note that zip code may be causal to a model’s prediction (i.e. changing zip code may change the model prediction) even though it may not be causal to the underlying task (i.e. changing zip code may not change the decision of whether to grant a loan). However, these two things may be related if this model’s output is used in the real-world decision process.
The good news is that since we have input-output access to the model, we can probe it with arbitrary inputs. This allows examining counterfactuals, inputs that are different from those of the prediction being explained. These counterfactuals might be elsewhere in the dataset, or they might not.
Shapley values (a classic result from game theory) offer an elegant, axiomatic approach to quantify feature contributions.
One challenge is they rely on probing with an exponentially large set of counterfactuals, too large to compute. Hence, there are several papers on approximating Shapley values, especially for specific classes of model functions.
However, a more fundamental challenge is that when features are correlated, not all counterfactuals may be realistic. There is no clear consensus on how to address this issue, and existing approaches differ on the exact set of counterfactuals to be considered.
To overcome these challenges, it is tempting to rely on observational data. For instance, using the observed data to define the counterfactuals for applying Shapley values. Or more simply, fitting an interpretable model on it to mimic the main model’s prediction and then explaining the interpretable model in lieu of the main model. But, this can be dangerous.
Consider a credit risk model with features including the applicant’s income and zip code. Say the model internally only relies on the zip code (i.e., it redlines applicants). Explanations based on observational data might reveal that the applicant’s income, by virtue of being correlated to zip code, is as predictive of the model’s output. This may mislead us to explain the model’s output in terms of the applicant’s income. In fact, a naive explanation algorithm will split attributions equally between two perfectly correlated features.
To learn more, we can intervene in features. One counterfactual changing zip code but not income will reveal that zip code causes the model’s prediction to change. A second counterfactual that changes income but not zip code will reveal that income does not. These two together will allow us to conclude that zip code is causal to the model’s prediction, and income is not.
Explaining causality requires the right counterfactuals.
Explaining causality in the real world is harder
Above we outlined a method to try to explain causality in models: study what happens when features change. To do so in the real world, you have to be able to apply interventions. This is commonly called a “randomized controlled trial” (also known as an “A/B testing” when there are two variants, especially in the tech industry). You divide a population into two or more groups randomly, and apply different interventions to each group. The randomization ensures that the only differences among the groups are your intervention. Therefore, you can conclude that your intervention causes the measurable differences in the groups.
The challenge in applying this method to real-world tasks is that not all interventions are feasible. You can’t ethically ask someone to take up smoking. In the real world, you may not be able to get the data you need to properly examine causality.
We can probe models as we wish, but not people.
Natural experiments can provide us an opportunity to examine situations where we would not normally intervene, like in epidemiology and economics. However, these provide us a limited toolkit, leaving many questions in these fields up for debate.
There are proposals for other theories that allow us to use domain knowledge to separate correlation from causation. These are subject to ongoing debate and research.
Now you know why explaining causality in models is hard, and explaining it in the real world is even harder.
To learn more about explaining models, email us at firstname.lastname@example.org. (Photo credit: pixabay.) This post was co-written with Ankur Taly.
First, as Yogi Berra said, “It’s tough to make predictions, especially about the future.” Where is my flying car?
Second, the title reads like clickbait, but surprisingly it appears to be pretty close to the actual survey, which asked AI researchers when “high-level machine intelligence” will arrive, defined as “when unaided machines can accomplish every task better and more cheaply than human workers.” What is a ‘task’ in this definition? Does “every task” even make sense? Can we enumerate all tasks?
Third and most important, is high-level intelligence just accomplishing tasks? This is the real difference between artificial and human intelligence: humans define goals, AI tries to achieve them. Is the hammer going to displace the carpenter? They each have a purpose.
This difference between artificial and human intelligence is crucial to understand, both to interpret all the crazy headlines in the popular press, and more importantly, to make practical, informed judgements about the technology.
The rest of this post walks through some types of artificial intelligence, types of human intelligence, and given how different they are, plausible and implausible risks of artificial intelligence. The short story: unlike humans, every AI technology has a perfectly mathematically well-defined goal, often a labeled dataset.
Types of artificial intelligence
In supervised learning, you define a prediction goal and gather a training set with labels corresponding to the goal. Suppose you want to identify whether a picture has Denzel Washington in it. Then your training set is a set of pictures, each labeled as containing Denzel Washington or not. The label has to be applied outside of the system, mostly likely by people. If your goal is to do facial recognition, your labeled dataset is pictures along with a label (the person in the picture). Again, you have to gather the labels somehow, likely with people. If your goal is to match a face with another face, you need a label of whether the match was successful or not. Always labels.
Almost all the machine learning you read about is supervised learning. Deep learning, neural networks, decision trees, random forests, logistic regression, all training on labeled datasets.
In unsupervised learning, again you define a goal. A very common unsupervised learning technique is clustering (e.g., the well-known k-means clustering). Again, the goal is very well-defined: find clusters minimizing some mathematical cost function. For example, where the distance between points within the same cluster is small, and the distance between points not within the same cluster is large. All of these goals are so well-defined they have mathematical formalism:
This formula feels very different from how humans specify goals. Most humans don’t understand these symbols at all. They are not formal. Also, a “goal-oriented” mindset in a human is unusual enough that it has a special term.
In reinforcement learning, you define a reward function to reward (or penalize) actions that move towards a goal. This is the technology people have been using recently for games like chess and Go, where it may take many actions to reach a particular goal (like checkmate), so you need a reward function that gives hints along the way. Again, not only a well-defined goal, but even a well-defined on-the-way-to-goal reward function.
These are types of artificial intelligence (“machine learning”) that are currently hot because of recent huge gains in accuracy, but there are plenty of others that people have studied.
Genetic algorithms are another way of solving problems inspired by biology. One takes a population of mathematical constructs (essentially functions), and selects those that perform best on a problem. Although people get emotional about the biological analogy, still the fitness function that defines “best” is a concrete, completely specified mathematical function chosen by a human.
There is computer-generated art. For example, deep dream (gallery) is a way to generate images from deep learning neural networks. This would seem to be more human and less goal-oriented, but in fact people are still directing. The authors described the goal at a high level: “Whatever you see there, I want more of it!” Depending on which layer of the network is asked, the features amplified might be low level (like lines, shapes, edges, colors, see the addaxes below) or higher level (like objects).
Expert systems are a way to make decisions using if-then rules on a formally expressed body of knowledge. They were somewhat popular in the 1980s. These are a type of “Good Old Fashioned Artificial Intelligence” (GOFAI), a term for AI based on manipulating symbols.
Another common difference between human and artificial intelligence is that humans learn over a long time, while AI is often retrained from the beginning for each problem. This difference, however, is being narrowed. Transfer learning is the process of training a model, and then using or tweaking the model for use in a different context. This is industry practice in computer vision, where deep learning neural networks that have been trained using features from previous networks (example).
One interesting research project in long-term learning is NELL, Never-Ending Language Learning. NELL crawls the web collecting text, and trying to extract beliefs (facts) like “airtran is a transportation system”, along with a confidence. It’s been crawling since 2010, and as of July 2017 has accumulated over 117 million candidate beliefs, of which 3.8 million are high-confidence (at least 0.9 of 1.0).
In every case above, humans not only specify a goal, but have to specify it unambiguously, often even formally with mathematics.
Types of human intelligence
What are the types of human intelligence? It’s hard to even come up with a list. Psychologists have been studying this for decades. Philosophers have been wrestling with it for millennia.
IQ (the Intelligence Quotient) is measured with verbal and visual tests, sometimes abstract. It is predicated on the idea that there is a general intelligence (sometimes called the “g factor”) common to all cognitive ability. This idea is not accepted by everyone, and IQ itself is hotly debated. For example, some believe that people with the same latent ability but from different demographic groups might be measured differently, called Differential Item Functioning, or simply measurement bias.
People describe fluid intelligence (the ability to solve novel problems) and crystallized intelligence (the ability to use knowledge and experience).
The concept of emotional intelligence shows up in the popular press: the ability of a person to recognize their own emotions and those of others, and use emotional thinking to guide behavior. It is unclear how accepted this is by the academic community.
More widely accepted are the Big Five personality traits: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism. This is not intelligence (or is it?), but illustrates a strong difference with computer intelligence. “Personality” is a set of stable traits or behavior patterns that predict a person’s behavior. What is the personality of an artificial intelligence? The notion doesn’t seem to apply.
With humor, art, or the search for meaning, we get farther and farther from well-defined problems, yet closer and closer to humanity.
Risks of artificial intelligence
Can artificial intelligence surpass human intelligence?
One risk that captures the popular press is The Singularity. The writer and mathematician Verner Vinge gives a compelling description in an essay from 1993: “Within thirty years, we will have the technological means to create superhuman intelligence. Shortly after, the human era will be ended.”
There are at least two ways to interpret this risk. The common way is that some magical critical mass will cause a phase change in machine intelligence. I’ve never understood this argument. “More” doesn’t mean “different.” The argument is something like “As we mimic the human brain closely, something near-human (or super-human) will happen.” Maybe?
Yes, the availability of lots of computing power and lots of data has resulted in a phase change in AI results. Speech recognition, automatic translation, computer vision, and other problem domains have been completely transformed. In 2012, when researchers at Toronto re-applied neural networks to computer vision, the error rates on a well-known dataset started dropping fast, until within a few years the computers were beating the humans. But computers were doing the same well-defined task as before, only better.
The more compelling observation is: “The chance of a singularity might be small, but the consequences are so serious we should think carefully.”
Another way to interpret the risk of the singularity is that the entire system will have a phase change. That system includes computers, technology, networks, and humans setting goals. This seems more correct and entirely plausible. The Internet was a phase change, as were mobile phones. Under this interpretation, there are plenty of plausible risks of AI.
One plausible risk is algorithmic bias. If algorithms are involved in important decisions, we’d like them to be trustworthy. (In a previous post, we discussed how to measure algorithmic fairness.)
Tay, a Microsoft chatbot, was taught by Twitter to be racist and woman-hating within 24 hours. But Tay didn’t really understand anything, it just “learned” and mimicked.
Amazon’s facial recognition software Rekognition, falsely matched 28 U.S. Congresspeople (mostly people of color) with known criminals. Amazon’s response was that the ACLU (who conducted the test) used an unreliable cutoff of only 80 percent confident. (They recommended 95 percent.)
MIT researcher Joy Buolamwini showed that gender identification error rates in several computer vision systems were much higher for people with dark skin.
All of these untrustworthy results arise at least partially from the training data. In Tay’s case, it was deliberately fed hateful data by Twitter users. In the computer vision systems, there may well have been less data for people of color.
Another plausible risk is automation. As artificial intelligence becomes more cost-efficient at solving problems like driving cars or weeding farm plots, the people who used to do those tasks may be thrown out of work. This is the risk of AI plus capitalism: businesses will each try to be cost effective. We can only address this at a societal level, which makes it very difficult.
One final risk is bad goals, possibly aggravated by single-mindedness. This is memorably illustrated by the paper-clip problem, first described by Nick Bostrom in 2003: “It also seems perfectly possible to have a superintelligence whose sole goal is something completely arbitrary, such as to manufacture as many paperclips as possible, and who would resist with all its might any attempt to alter this goal. For better or worse, artificial intellects need not share our human motivational tendencies.” There is even a web game inspired by this idea.
Understand your goals
How do we address some of the plausible risks above? A complete answer is another full post (or book, or lifetime). But let’s mention one piece: understand the goals you’ve given your AI. Since all AI is simply optimizing a well-defined mathematical function, that is the language you use to say what problem you want to solve.
Does that mean you should start reading up on integrals and gradient descent algorithms? I can feel your eyeballs closing. Not necessarily!
The goals are a negotiation between what your business needs (human language) and how it can be measured and optimized (AI language). You need people to speak to both sides. That is often a business or product owner in collaboration with a data scientist or quantitative researcher.
Let me give an example. Suppose you want to recommend content using a model. You choose to optimize the model to increase engagement with the content, as measured by clicks. Voila, now you understand one reason the Internet is full of clickbait: the goal is wrong. You actually care about more than clicks. Companies modify the goal without touching the AI by trying to filter out content that doesn’t meet policies. That is one reasonable strategy. Another strategy might be to add a heavy penalty to the training dataset if the AI recommends content later found to be against policy. Now we are starting to really think through how our goal affects the AI.
This example also explains why content systems can be so jumpy: you click on a video on YouTube, or a pin on Pinterest, or a book on Amazon, and the system immediately recommends a big pile of things that are almost exactly the same. Why? The click is usually measured in the short-term, so the system optimizes for short-term engagement. This is a well-known recommender challenge, centered around mathematically defining a good goal. Perhaps a part of the goal should be whether the recommendation is irritating, or whether there is long-term engagement.
Another example: if your model is accurate, but your dataset or measurements don’t look at under-represented minorities in your business, you may be performing poorly for them. Your goal may really to be accurate for all sorts of different people.
A third example: if your model is accurate, but you don’t understand why, that might be a risk for some critical applications, like healthcare or finance. If you have to understand why, you might need to use a human-understandable (“white box”) model, or explanation technology for the model you have. Understandability can be a goal.
Conclusion: we need to understand AI
AI cannot fully replace humans, despite what you read in the popular press. The biggest difference between human and artificial intelligence is that only humans choose goals. So far, AIs do not.
If you can take away one thing about artificial intelligence: understand its goals. Any AI technology has a perfectly well-defined goal, often a labeled dataset. To the extent the definition (or dataset) is flawed, so too will be the results.
One way to understand your AI better is to explain your models. We formed Fiddler Labs to help. Feel free to reach us at email@example.com.
‘In 1935, the Federal Home Loan Bank Board asked the Home Owners’ Loan Corporation to look at 239 cities and create “residential security maps” to indicate the level of security for real-estate investments in each surveyed city. On the maps, “Type D” neighborhoods were outlined in red and were considered the most risky for mortgage support..
‘In the 1960s, sociologist John McKnight coined the term “redlining” to describe the discriminatory practice of fencing off areas where banks would avoid investments based on community demographics. During the heyday of redlining, the areas most frequently discriminated against were black inner city neighborhoods…’
Redlining is clearly unfair, since the decision to invest was not based on an individual homeowner’s ability to repay the loan, but rather on location; and that basis systematically denied loans to one racial group, black people. In fact, part 1 of a Pulitzer Prize-winning series in the Atlanta Journal-Constitution in 1988 suggests that location was more important than income: “Among stable neighborhoods of the same income [in metro Atlanta], white neighborhoods always received the most bank loans per 1,000 single-family homes. Integrated neighborhoods always received fewer. Black neighborhoods — including the mayor’s neighborhood — always received the fewest.”
More recently, in 2018, WUNC reported that blacks and latinos in some cities in North Carolina were denied mortgages at higher rates than whites:
“Lenders and their trade organizations do not dispute the fact that they turn away people of color at rates far greater than whites. But they maintain that the disparity can be explained by two factors the industry has fought to keep hidden: the prospective borrowers’ credit history and overall debt-to-income ratio. They singled out the three-digit credit score — which banks use to determine whether a borrower is likely to repay a loan — as especially important in lending decisions.”
The WUNC example raises an interesting point: it is possible to look unfair via one measure (loan rates by demographic), but not by another (ability to pay as judged by credit history and debt-to-income ratio). Measuring fairness is complicated. In this case, we can’t tell if the lending practices are fair because the data on credit history and debt-to-income ratio for these particular groups are not available to us to evaluate lenders’ explanations of the disparity.
In 2007, the federal reserve board (FRB) reported on credit scoring and its effects on the availability and affordability of credit. They concluded that the credit characteristics included in credit history scoring models are not a proxy for race, although different demographic groups have substantially different credit scores on average, and “for given credit scores, credit outcomes — including measures of loan performance, availability, and affordability — differ for different demographic groups.” This FRB study supports the lenders’ claims that credit score might explain disparity in mortgage denial rates (since demographic groups have different credit scores), while also pointing out that credit outcomes are different for different groups.
Is this fair or not?
As machine learning (ML) becomes widespread, there is growing interest in fairness, accountability, and transparency in ML (e.g., the fat* conference and fatml workshops).
Some researchers say that fairness is not a statistical concept, and no statistic will fully capture it. There are many statistical definitions that people try to relate to (if not define) fairness.
First, here are two legal concepts that come up in many discussions on fairness:
Disparate treatment: “unequal behavior toward someone because of a protected characteristic (e.g., race or gender) under Title VII of the United States Civil Rights Act.” Redlining is disparate treatment if the intent is to deny black people loans.
Disparate impact: “practices .. that adversely affect one group of people of a protected characteristic more than another, even though rules applied .. are formally neutral.” (“The disparate impact doctrine was formalized in the landmark U.S. Supreme Court case Griggs v. Duke Power Co. (1971). In 1955, the Duke Power Company instituted a policy that mandated employees have a high school diploma to be considered for promotion, which had the effect of drastically limiting the eligibility of black employees. The Court found that this requirement had little relation to job performance, and thus deemed it to have an unjustified — and illegal — disparate impact.” [Corb2018])
[Lipt2017] points out that these are legal concepts of disparity, and creates corresponding terms for technical concepts of parity applied to machine learning classifiers:
Treatment parity: a classifier should be blind to a given protected characteristic. Also called anti-classification in [Corb2018], or “fairness through unawareness.”
Impact parity: the fraction of people given a positive decision should be equal across different groups. This is also called demographic parity, statistical parity, or independence of the protected class and the score [Fair2018].
There is a large body of literature on algorithmic fairness. From [Corb2018], two more definitions:
Classification parity: some given measure of classification error is equal across groups defined by the protected attributes. [Hard2016] called this equal opportunity if the measure is true positive rates, and equalized odds if there were two equalized measures, true positive rates and false positive rates.
Calibration: outcomes are independent of protected attributes conditional on risk score. That is, reality conforms to risk score. For example, about 20% of all loans predicted to have a 20% chance of default actually do.
There is lack of consensus in the research community on an ideal statistical definition of fairness. In fact, there are impossibility results on achieving multiple fairness notions simultaneously ([Klei2016] [Chou2017]). As we noted previously, some researchers say that fairness is not a statistical concept.
No definition is perfect
Each statistical definition described above has counterarguments.
Treatment parity unfairly ignores real differences. [Corb2018] describes the case of the COMPAS score used to predict recidivism (whether someone will commit a crime if released from jail). After controlling for COMPAS score and other factors, women are less likely to recidivate. Thus, ignoring sex in this prediction might unfairly punish women. Note that the Equal Credit Opportunity Act legally mandates treatment parity: “Creditors may ask you for [protected class information like race] in certain situations, but they may not use it when deciding whether to give you credit or when setting the terms of your credit.” Thus, [Corb2018] implies that this sort of unfairness is enshrined in law.
Impact parity doesn’t ensure fairness (people argue against quotas), and can cripple a model’s accuracy, harming the model’s utility to society. [Hard2016] discusses this issue (using the term “demographic parity”) in its introduction.
Corbett et al. [Corb2018] argue at length that classification parity is naturally violated: “when base rates of violent recidivism differ across groups, the true risk distributions will necessarily differ as well — and this difference persists regardless of which features are used in the prediction.”
They also argue that calibration is not sufficient to prevent unfairness. Their hypothetical example is a bank that gives loans based solely on the default rate within a zip code, ignoring other attributes like income. Suppose that (1) within zip code, white and black applicants have similar default rates; and (2) black applicants live in zip codes with relatively high default rates. Then the bank’s plan would unfairly punish creditworthy black applicants, but still be calibrated.
In summary, likely fairness has no single measure. We took a whirlwind tour of four statistical definitions, two motivated by history and two more recently motivated by machine learning, and summarized the counterarguments to each.
This also means it is challenging to automatically decide if an algorithm is fair. Open-source fairness-measuring packages reflect this by offering many different measures.
However, this doesn’t mean we should ignore statistical measures. They can give us an idea of whether we should look more carefully. Food for thought. We should feed our brain well, it being the most likely to make the final call.
(A note: this subject is rightfully contentious. Our intention is to add to the conversation in a productive, respectful way. We welcome feedback of any kind.)
[Chou2017] Chouldechova, Alexandra. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.” Big Data 5, no. 2 (June 1, 2017): 153–63. https://doi.org/10.1089/big.2016.0047.
[Corb2018] Corbett-Davies, Sam, and Sharad Goel. “The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning.” ArXiv:1808.00023 [Cs], July 31, 2018. http://arxiv.org/abs/1808.00023.
[Hard2016] Hardt, Moritz, Eric Price, and Nathan Srebro. “Equality of Opportunity in Supervised Learning.” ArXiv:1610.02413 [Cs], October 7, 2016. http://arxiv.org/abs/1610.02413.
[Klei2016] Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. “Inherent Trade-Offs in the Fair Determination of Risk Scores.” ArXiv:1609.05807 [Cs, Stat], September 19, 2016. http://arxiv.org/abs/1609.05807.
[Lipt2017] Lipton, Zachary C., Alexandra Chouldechova, and Julian McAuley. “Does Mitigating ML’s Impact Disparity Require Treatment Disparity?” ArXiv:1711.07076 [Cs, Stat], November 19, 2017. http://arxiv.org/abs/1711.07076.