As artificial intelligence (AI) adoption grows, so do the risks of today’s typical black-box AI. These risks include customer mistrust, brand risk and compliance risk. As recently as last month, concerns about AI-driven facial recognition that was biased against certain demographics resulted in a PR backlash.
With customer protection in mind, regulators are staying ahead of this technology and introducing the first wave of AI regulations meant to address AI transparency. This is a step in the right direction in terms of helping customers trust AI-driven experiences while enabling businesses to reap the benefits of AI adoption.
This first group of regulations relates to the understanding of an AI-driven, automated decision by a customer. This is especially important for key decisions like lending, insurance and health care but is also applicable to personalization, recommendations, etc.
The General Data Protection Regulation (GDPR), specifically Articles 13 and 22, was the first regulation about automated decision-making that states anyone given an automated decision has the right to be informed and the right to a meaningful explanation. According to clause 2(f) of Article 13:
“[Information about] the existence of automated decision-making, including profiling … and … meaningful information about the logic involved [is needed] to ensure fair and transparent processing.”
One of the most frequently asked questions is what the “right to explanation” means in the context of AI. Does “meaningful information about the logic involved” mean that companies have to disclose the actual algorithm or source code? Would explaining the mechanics of the algorithm be really helpful for the individuals? It might make more sense to provide information on what inputs were used and how they influenced the output of the algorithm.
For example, if a loan application or insurance claim is denied using an algorithm or machine learning model, under Articles 13 and 22, the loan or insurance officer would need to provide specific details about the impact of the user’s data to the decision. Or, they could provide general parameters of the algorithm or model used to make that decision.
Similar laws working their way through the U.S. state legislatures of Washington, Illinois and Massachusetts are
WA House Bill 1655, which establishes guidelines for “the use of automated decision systems in order to protect consumers, improve transparency, and create more market predictability.”
MA Bill H.2701, which establishes a commission on “automated decision-making, artificial intelligence, transparency, fairness, and individual rights.”
IL HB3415, which states that “predictive data analytics in determining creditworthiness or in making hiring decisions…may not include information that correlates with the race of zip code of the applicant.”
Fortunately, advances in AI have kept pace with these needs. Recent research in machine learning (ML) model interpretability makes compliance to these regulations feasible. Cutting-edge techniques like Integrated Gradients from Google Brain along with SHAP and LIME from the University of Washington enable unlocking the AI black box to get meaningful explanations for consumers.
Ensuring fair automated decisions is another related area of upcoming regulations. While there is no consensus in the research community on the right set of fairness metrics, some approaches like equality of opportunity are already required by law in use cases like hiring. Integrating AI explainability in the ML lifecycle can also help provide insights for fair and unbiased automated decisions. Assessing and monitoring these biases, along with data quality and model interpretability approaches, provides a good playbook towards developing fair and ethical AI.
The recent June 26 US House Committee hearing is a sign that financial services need to get ready for upcoming regulations that ensure transparent AI systems. All these regulations will help increase trust in AI models and accelerate their adoption across industries toward the longer-term goal of trustworthy AI.
Last week, the Explainable AI Summit, hosted by Fiddler Labs, returned to discuss top-of-mind issues that leaders face when implementing AI. Over eighty data scientists and business leaders joined us at Galvanize to hear from the keynote speaker and Fiddler’s head of data science, Ankur Taly, and our distinguished panelists moderated by our CEO Krishna Gade:
Manasi Joshi, Engineering Director of ML Productivity, Google Brain
Reprising some topics from our summit in February, the H2 summit focused on explainability techniques and industry-specific challenges.
Takeaway #1: In financial services, companies are working through regulatory and technical hurdles to integrate machine learning techniques into their business model.
Financial services understand the potential of AI and want to adopt machine learning techniques. But they are reasonably wary of running afoul of regulations. If someone suspects a creditor has been discriminatory, the Federal Trade Commission explicitly suggests that he or she consider suing the creditor in federal district court.
Banks and insurance companies already subject models to strict, months-long validation by legal teams and risk teams. But if using opaque deep learning methods means forgoing certainty around model fairness, then these methods cannot be a priority.
However, some use cases are less regulated or not regulated at all, allowing financial services to explore AI integration selectively. Especially if regulators continue to accept that AI may never reach full explainability, AI usage in financial services will increase.
Takeaway #2: And across all industries, leaders are prioritizing trustworthiness of their models.
Most companies understand the risk to their brand and consumer trust if models go awry and make poor decisions. So leaders are implementing more checks before models are promoted to production.
Once in production, externally facing models generate questions and concerns from customers themselves. Business leaders are seeing the need to build explainability tools to address these questions surrounding content selection. Fortunately, many explainability tools are available in open source, like Google’s TCAV and Tensorflow Model Analyzer.
And as automated ML platforms attract hundreds of thousands of users, platform developers are ramping up education about incorrect usage. Ramping up education is necessary but not sufficient. ML Platforms should assist modelers with capabilities to inspect model behavior against sub groups of choice to inform if there is potential bias as manifested by their models.
Takeaway #3: Integrated Gradients is a best-in-class technique to attribute a deep network’s (or any differentiable model’s) prediction to its input features.
A major component of explaining an AI model is the attribution problem. For any given prediction by a model, how can we attribute that prediction to the model’s features?
Currently, several approaches in use are ineffective. For example, an ablation study (dropping a feature and observing the change in prediction) is computationally expensive and misleading when features interact.
To define a better attribution approach, Ankur Taly and his co-creators first established the desirable criteria, or axioms. One axiom, for instance, is insensitivity: a variable that has no effect on the output should get no attribution. These axioms then uniquely define the Integrated Gradients method, which is described by the equation below.
Integrated gradients is easy to apply and widely applicable to differentiable models. Data science teams should consider this method to evaluate feature attribution inexpensively and accurately.
Thank you to Galvanize for hosting the event, to our panelists, and to our engaged audience! We look forward to our next in-person event, and in the meantime, stay tuned for our first webinar. For more information, please email firstname.lastname@example.org.
Two different explanation algorithm types, best in different situations.
Some of the most accurate predictive models today are black box models, meaning it is hard to really understand how they work. To address this problem, techniques have arisen to understand feature importance: for a given prediction, how important is each input feature value to that prediction? Two well-known techniques are SHapley Additive exPlanations (SHAP) and Integrated Gradients (IG). In fact, they each represent a different type of explanation algorithm: a Shapley-value-based algorithm (SHAP) and a gradient-based algorithm (IG).
There is a fundamental difference between these two algorithm types. This post describes that difference. First, we need some background. Below, we review Shapley values, Shapley-value-based methods (including SHAP), and gradient-based methods (including IG). Finally, we get back to our central question: When should you use a Shapley-value-based algorithm (like SHAP) versus a gradient-based explanation explanation algorithm (like IG)?
What are Shapley values?
The Shapley value (proposed by Lloyd Shapley in 1953) is a classic method to distribute the total gains of a collaborative game to a coalition of cooperating players. It is provably the only distribution with certain desirable properties (fully listed on Wikipedia).
In our case, we formulate a game for the prediction at each instance. We consider the “total gains” to be the prediction value for that instance, and the “players” to be the model features of that instance. The collaborative game is all of the model features cooperating to form a prediction value. The Shapley value efficiency property says the feature attributions should sum to the prediction value. The attributions can be negative or positive, since a feature can lower or raise a predicted value.
There is a variant called the Aumann-Shapley value, extending the definition of the Shapley value to a game with many (or infinitely many) players, where each player plays only a minor role, if the worth function (the gains from including a coalition of players) is differentiable.
What is a Shapley-value-based explanation method?
A Shapley-value-based explanation method tries to approximate Shapley values of a given prediction by examining the effect of removing a feature under all possible combinations of presence or absence of the other features. In other words, this method looks at function values over subsets of features like F(x1, <absent>, x3, x4, …, <absent>, …, xn). How to evaluate a function F with one or more absent features is subtle.
For example, SHAP (SHapely Additive exPlanations) estimates the model’s behavior on an input with certain features absent by averaging over samples from those features drawn from the training set. In other words, F(x1, <absent>, x3, …, xn) is estimated by the expected prediction when the missing feature x2 is sampled from the dataset.
Exactly how that sample is chosen is important (for example marginal versus conditional distribution versus cluster centers of background data), but I will skip the fine details here.
Once we define the model function (F) for all subsets of the features, we can apply the Shapley values algorithm to compute feature attributions. Each feature’s Shapley value is the contribution of the feature for all possible subsets of the other features.
The “kernel SHAP” method from the SHAP paper computes the Shapley values of all features simultaneously by defining a weighted least squares regression whose solution is the Shapley values for all the features.
The high-level point is that all these methods rely on taking subsets of features. This makes the theoretical version exponential in runtime: for N features, there are 2N combinations of presence and absence. That is too expensive for most N, so these methods approximate. Even with approximations, kernel SHAP can be slow. Also, we don’t know of any systematic study of how good the approximation is.
There are versions of SHAP specialized to different model architectures for speed. For example, Tree SHAP computes all the subsets by cleverly keeping track of what proportion of all possible subsets flow down into each of the leaves of the tree. However, if your model architecture does not have a specialized algorithm like this, you have to fall back on kernel SHAP, or another naive (unoptimized) Shapley-value-based method.
A Shapley-value-based method is attractive as it only requires black box access to the model (i.e. computing outputs from inputs), and there is a version agnostic to the model architecture. For instance, it does not matter whether the model function is discrete or continuous. The downside is that exactly computing the subsets is exponential in the number of features.
What is a gradient-based explanation method?
A gradient-based explanation method tries to explain a given prediction by using the gradient of (i.e. change in) the output with respect to the input features. Some methods like Integrated Gradients (IG), GradCAM, and SmoothGrad literally apply the gradient operator. Other methods like DeepLift and LRP apply “discrete gradients.”
Let me describe IG, which has the advantage that it tries to approximate Aumann-Shapley values, which are axiomatically justified. IG operates by considering a straight line path, in feature space, from the input at hand (e.g., an image from a training set) to a certain baseline input (e.g., a black image), and integrating the gradient of the prediction with respect to input features (e.g., image pixels) along this path.
This paper explains the intuition of the IG algorithm as follows. As the input varies along the straight line path between the baseline and the input at hand, the prediction moves along a trajectory from uncertainty to certainty (the final prediction probability). At each point on this trajectory, one can use the gradient with respect to the input features to attribute the change in the prediction probability back to the input features. IG aggregates these gradients along the trajectory using a path integral.
IG (roughly) requires the prediction to be a continuous and piecewise differentiable function of the input features. (More precisely, it requires the function is continuous everywhere and the partial derivative along each input dimension satisfies Lebesgue’s integrability condition, i.e., the set of discontinuous points has measure zero.)
Note it is important to choose a good baseline for IG to make sensible feature attributions. For example, if a black image is chosen as baseline, IG won’t attribute importance to a completely black pixel in an actual image. The baseline value should both have a near-zero prediction, and also faithfully represent a complete absence of signal.
IG is attractive as it is broadly applicable to all differentiable models, easy to implement in most machine learning frameworks (e.g., TensorFlow, PyTorch, Caffe), and computationally scalable to massive deep networks like Inception and ResNet with millions of neurons.
When should you use a Shapley-value-based versus a gradient-based explanation method?
Finally, the payoff! Our advice: If the model function is piecewise differentiable and you have access to the model gradient, use IG. Otherwise, use a Shapley-value-based method.
Any model trained using gradient descent is differentiable. For example: neural networks, logistic regression, support vector machines. You can use IG with these. The major class of non-differentiable models is trees: boosted trees, random forests. They encode discrete values at the leaves. These require a Shapley-value-based method, like Tree SHAP.
The IG algorithm is faster than a naive Shapley-value-based method like kernel SHAP, as it only requires computing the gradients of the model output on a few different inputs (typically 50). In contrast, a Shapley-value-based method requires computing the model output on a large number of inputs sampled from the exponentially huge subspace of all possible combinations of feature values. Computing gradients of differentiable models is efficient and well supported in most machine learning frameworks. However, a differentiable model is a prerequisite for IG. By contrast, a Shapley-value-based method makes no such assumptions.
Several types of input features that look discrete (hence might require a Shapley-value-based method) actually can be mapped to differentiable model types (which let us use IG). Let us walk through one example: text sentiment. Suppose we wish to attribute the sentiment prediction to the words in some input text. At first, it seems that such models may be non-differentiable as the input is discrete (a collection of words). However, differentiable models like deep neural networks can handle words by first mapping them to a high-dimensional continuous space using word embeddings. The model’s prediction is a differentiable function of these embeddings. This makes it amenable to IG. Specifically, we attribute the prediction score to the embedding vectors. Since attributions are additive, we sum the attributions (retaining the sign) along the fields of each embedding vector and map it to the specific input word that the embedding corresponds to.
A crucial question for IG is: what is the baseline prediction? For this text example, one option is to use the embedding vector corresponding to empty text. Some models take fixed length inputs by padding short sentences with a special “no word” token. In such cases, we can take the baseline as the embedding of a sentence with just “no word” tokens.
In many cases (a differentiable model with a gradient), you can use integrated gradients (IG) to get a more certain and possibly faster explanation of feature importance for a prediction. However, a Shapley-value-based method is required for other (non-differentiable) model types.
At Fiddler, we support both SHAP and IG. (Full disclosure: Ankur Taly, a co-author of IG, works at Fiddler, and is a co-author of this post.) Feel free to email email@example.com for more information, or just to say hi!
What does debugging look like in the new world of machine learning models? One way uses model explanations.
Machine learning (ML) models are popping up everywhere. There is a lot of technical innovation (e.g., deep learning, explainable AI) that has made them more accurate, more broadly applicable, and usable by more people in more business applications. The lists are everywhere: banking, healthcare, tech, all of the above.
In a deep learning neural network, instead of lines of code written by people, we are looking at possibly millions of weights linked together into an incomprehensible network. (picture credit)
So how do we find bugs in this network? One way is to explain your model predictions. Let’s look at two types of bugs we can find through explanations (data leakage and data bias), illustrated with examples from predicting loan default. Both of these are actually data bugs, but a model summarizes the data, so they show up in the model.
Bug #1: data leakage
Most ML models are supervised. You choose a precise prediction goal (also called the “prediction target”), gather a dataset with features, and label each example with the target. Then you train a model to use the features to predict the target. Surprisingly often there are features in the dataset that relate to the prediction target but are not useful for prediction. For example, they might be added from the future (i.e. long after prediction time), or otherwise unavailable at prediction time.
Here is an example from the Lending Club dataset. We can use this dataset to try modeling predicting loan default with loan_status field as our prediction target. It takes the values “Fully Paid” (okay) or “Charged Off” (bank declared a loss, i.e. the borrower defaulted). In this dataset, there are also fields such as total_pymnt (the payments received) and loan_amnt (amount borrowed). Here are a few example values:
Notice anything? Whenever the loan has defaulted (“Charged Off”), the total payments are less than the loan amount, and delta (=loan_amnt-total_pymnt) is positive. Well, that’s not terribly surprising. Rather, it’s nearly the definition of default: by the end of the loan term, the borrower paid less than what was loaned. Now, delta doesn’t have to be positive for a default: you could default after paying back the entire loan principal amount but not all of the interest. But, in this data, 98% of the time if delta is negative, the loan was fully paid; and 100% of the time delta is positive, the loan was charged off. Including total_pymnt gives us nearly perfect information, but we don’t get total_pymnt until after the entire loan term (3 years)!
Including both loan_amnt and total_pymnt in the data potentially allows nearly perfect prediction, but we won’t really have total_pymnt for the real prediction task. Including them both in the training data is data leakage of the prediction target.
If we make a (cheating) model, it will perform very well. Too well. And, if we run a feature importance algorithm on some predictions (a common form of model explanation), we’ll see these two variables come up as important, and with any luck realize this data leakage.
Below, the Fiddler explanation UI shows “delta” stands out as a huge factor in raising this example prediction.
There are other, more subtle potential data leakages in this dataset. For example, the grade and sub_grade are assigned by a Lending Club proprietary model, which almost completely determines the interest rate. So, if you want to build your own risk scoring model without Lending Club, then grade, sub_grade, and int_rate are all data leakage. They wouldn’t allow you to perfectly predict default, but presumably they would help, or Lending Club would not use their own model. Moreover, for their model, they include FICO score, yet another proprietary risk score, but one that most financial institutions buy and use. If you don’t want to use FICO score, then that is also data leakage.
Data leakage is any predictive data that you can’t or won’t use for prediction. A model built on data with leakage is buggy.
Bug #2: data bias
Suppose through poor data collection or a bug in preprocessing, our data in biased. More specifically, there is a spurious correlation between a feature and the prediction target. In that case, explaining predictions will show an unexpected feature often being important.
We can simulate a data processing bug in our lending data by dropping all the charged off loans from zip codes starting with 1 through 5. Before this bug, zip code is not very predictive of chargeoff (an AUC of 0.54, only slightly above random). After this bug, any zip code starting with 1 through 5 will never be charged off, and the AUC jumps to 0.78. So, zip code will show up as an important feature in predicting (no) loan default from data examples in those zip codes. In this example, we could investigate by looking at predictions where zip code was important. If we are observant, we might notice the pattern, and realize the bias.
Below is what charge-off rate would look like if summarized by the first digit of zip code. Some zips would have no charge-offs, while the rest had a rate similar to the dataset overall.
Below, the Fiddler explanation UI shows zip code prefix stands out as a huge factor in lowering this example prediction.
A model built from this biased data is not useful for making predictions on (unbiased) data we haven’t seen yet. It is only accurate in the biased data. Thus, a model built on biased data is buggy.
Other model debugging methods
There are many other possibilities for model debugging that don’t involve model explanations. For example:
Look for overfitting or underfitting. If your model architecture is too simple, it will underfit. If it is too complex, it will overfit.
Regression tests on a golden set of predictions that you understand. If these fail, you might be able to narrow down which scenarios are broken.
Since explanations aren’t involved with these methods, I won’t say more here.
If you are not sure your model is using your data appropriately, use explanations of feature importance to examine its behavior. You might see data leakage or data bias. Then, you can fix your data, which is the best way to fix your model.