As artificial intelligence (AI) adoption grows, so do the risks of today’s typical black-box AI. These risks include customer mistrust, brand risk and compliance risk. As recently as last month, concerns about AI-driven facial recognition that was biased against certain demographics resulted in a PR backlash.
With customer protection in mind, regulators are staying ahead of this technology and introducing the first wave of AI regulations meant to address AI transparency. This is a step in the right direction in terms of helping customers trust AI-driven experiences while enabling businesses to reap the benefits of AI adoption.
This first group of regulations relates to the understanding of an AI-driven, automated decision by a customer. This is especially important for key decisions like lending, insurance and health care but is also applicable to personalization, recommendations, etc.
The General Data Protection Regulation (GDPR), specifically Articles 13 and 22, was the first regulation about automated decision-making that states anyone given an automated decision has the right to be informed and the right to a meaningful explanation. According to clause 2(f) of Article 13:
“[Information about] the existence of automated decision-making, including profiling … and … meaningful information about the logic involved [is needed] to ensure fair and transparent processing.”
One of the most frequently asked questions is what the “right to explanation” means in the context of AI. Does “meaningful information about the logic involved” mean that companies have to disclose the actual algorithm or source code? Would explaining the mechanics of the algorithm be really helpful for the individuals? It might make more sense to provide information on what inputs were used and how they influenced the output of the algorithm.
For example, if a loan application or insurance claim is denied using an algorithm or machine learning model, under Articles 13 and 22, the loan or insurance officer would need to provide specific details about the impact of the user’s data to the decision. Or, they could provide general parameters of the algorithm or model used to make that decision.
Similar laws working their way through the U.S. state legislatures of Washington, Illinois and Massachusetts are
WA House Bill 1655, which establishes guidelines for “the use of automated decision systems in order to protect consumers, improve transparency, and create more market predictability.”
MA Bill H.2701, which establishes a commission on “automated decision-making, artificial intelligence, transparency, fairness, and individual rights.”
IL HB3415, which states that “predictive data analytics in determining creditworthiness or in making hiring decisions…may not include information that correlates with the race of zip code of the applicant.”
Fortunately, advances in AI have kept pace with these needs. Recent research in machine learning (ML) model interpretability makes compliance to these regulations feasible. Cutting-edge techniques like Integrated Gradients from Google Brain along with SHAP and LIME from the University of Washington enable unlocking the AI black box to get meaningful explanations for consumers.
Ensuring fair automated decisions is another related area of upcoming regulations. While there is no consensus in the research community on the right set of fairness metrics, some approaches like equality of opportunity are already required by law in use cases like hiring. Integrating AI explainability in the ML lifecycle can also help provide insights for fair and unbiased automated decisions. Assessing and monitoring these biases, along with data quality and model interpretability approaches, provides a good playbook towards developing fair and ethical AI.
The recent June 26 US House Committee hearing is a sign that financial services need to get ready for upcoming regulations that ensure transparent AI systems. All these regulations will help increase trust in AI models and accelerate their adoption across industries toward the longer-term goal of trustworthy AI.
Last week, the Explainable AI Summit, hosted by Fiddler Labs, returned to discuss top-of-mind issues that leaders face when implementing AI. Over eighty data scientists and business leaders joined us at Galvanize to hear from the keynote speaker and Fiddler’s head of data science, Ankur Taly, and our distinguished panelists moderated by our CEO Krishna Gade:
Manasi Joshi, Engineering Director of ML Productivity, Google Brain
Reprising some topics from our summit in February, the H2 summit focused on explainability techniques and industry-specific challenges.
Takeaway #1: In financial services, companies are working through regulatory and technical hurdles to integrate machine learning techniques into their business model.
Financial services understand the potential of AI and want to adopt machine learning techniques. But they are reasonably wary of running afoul of regulations. If someone suspects a creditor has been discriminatory, the Federal Trade Commission explicitly suggests that he or she consider suing the creditor in federal district court.
Banks and insurance companies already subject models to strict, months-long validation by legal teams and risk teams. But if using opaque deep learning methods means forgoing certainty around model fairness, then these methods cannot be a priority.
However, some use cases are less regulated or not regulated at all, allowing financial services to explore AI integration selectively. Especially if regulators continue to accept that AI may never reach full explainability, AI usage in financial services will increase.
Takeaway #2: And across all industries, leaders are prioritizing trustworthiness of their models.
Most companies understand the risk to their brand and consumer trust if models go awry and make poor decisions. So leaders are implementing more checks before models are promoted to production.
Once in production, externally facing models generate questions and concerns from customers themselves. Business leaders are seeing the need to build explainability tools to address these questions surrounding content selection. Fortunately, many explainability tools are available in open source, like Google’s TCAV and Tensorflow Model Analyzer.
And as automated ML platforms attract hundreds of thousands of users, platform developers are ramping up education about incorrect usage. Ramping up education is necessary but not sufficient. ML Platforms should assist modelers with capabilities to inspect model behavior against sub groups of choice to inform if there is potential bias as manifested by their models.
Takeaway #3: Integrated Gradients is a best-in-class technique to attribute a deep network’s (or any differentiable model’s) prediction to its input features.
A major component of explaining an AI model is the attribution problem. For any given prediction by a model, how can we attribute that prediction to the model’s features?
Currently, several approaches in use are ineffective. For example, an ablation study (dropping a feature and observing the change in prediction) is computationally expensive and misleading when features interact.
To define a better attribution approach, Ankur Taly and his co-creators first established the desirable criteria, or axioms. One axiom, for instance, is insensitivity: a variable that has no effect on the output should get no attribution. These axioms then uniquely define the Integrated Gradients method, which is described by the equation below.
Integrated gradients is easy to apply and widely applicable to differentiable models. Data science teams should consider this method to evaluate feature attribution inexpensively and accurately.
Thank you to Galvanize for hosting the event, to our panelists, and to our engaged audience! We look forward to our next in-person event, and in the meantime, stay tuned for our first webinar. For more information, please email email@example.com.
Two different explanation algorithm types, best in different situations.
Some of the most accurate predictive models today are black box models, meaning it is hard to really understand how they work. To address this problem, techniques have arisen to understand feature importance: for a given prediction, how important is each input feature value to that prediction? Two well-known techniques are SHapley Additive exPlanations (SHAP) and Integrated Gradients (IG). In fact, they each represent a different type of explanation algorithm: a Shapley-value-based algorithm (SHAP) and a gradient-based algorithm (IG).
There is a fundamental difference between these two algorithm types. This post describes that difference. First, we need some background. Below, we review Shapley values, Shapley-value-based methods (including SHAP), and gradient-based methods (including IG). Finally, we get back to our central question: When should you use a Shapley-value-based algorithm (like SHAP) versus a gradient-based explanation explanation algorithm (like IG)?
What are Shapley values?
The Shapley value (proposed by Lloyd Shapley in 1953) is a classic method to distribute the total gains of a collaborative game to a coalition of cooperating players. It is provably the only distribution with certain desirable properties (fully listed on Wikipedia).
In our case, we formulate a game for the prediction at each instance. We consider the “total gains” to be the prediction value for that instance, and the “players” to be the model features of that instance. The collaborative game is all of the model features cooperating to form a prediction value. The Shapley value efficiency property says the feature attributions should sum to the prediction value. The attributions can be negative or positive, since a feature can lower or raise a predicted value.
There is a variant called the Aumann-Shapley value, extending the definition of the Shapley value to a game with many (or infinitely many) players, where each player plays only a minor role, if the worth function (the gains from including a coalition of players) is differentiable.
What is a Shapley-value-based explanation method?
A Shapley-value-based explanation method tries to approximate Shapley values of a given prediction by examining the effect of removing a feature under all possible combinations of presence or absence of the other features. In other words, this method looks at function values over subsets of features like F(x1, <absent>, x3, x4, …, <absent>, …, xn). How to evaluate a function F with one or more absent features is subtle.
For example, SHAP (SHapely Additive exPlanations) estimates the model’s behavior on an input with certain features absent by averaging over samples from those features drawn from the training set. In other words, F(x1, <absent>, x3, …, xn) is estimated by the expected prediction when the missing feature x2 is sampled from the dataset.
Exactly how that sample is chosen is important (for example marginal versus conditional distribution versus cluster centers of background data), but I will skip the fine details here.
Once we define the model function (F) for all subsets of the features, we can apply the Shapley values algorithm to compute feature attributions. Each feature’s Shapley value is the contribution of the feature for all possible subsets of the other features.
The “kernel SHAP” method from the SHAP paper computes the Shapley values of all features simultaneously by defining a weighted least squares regression whose solution is the Shapley values for all the features.
The high-level point is that all these methods rely on taking subsets of features. This makes the theoretical version exponential in runtime: for N features, there are 2N combinations of presence and absence. That is too expensive for most N, so these methods approximate. Even with approximations, kernel SHAP can be slow. Also, we don’t know of any systematic study of how good the approximation is.
There are versions of SHAP specialized to different model architectures for speed. For example, Tree SHAP computes all the subsets by cleverly keeping track of what proportion of all possible subsets flow down into each of the leaves of the tree. However, if your model architecture does not have a specialized algorithm like this, you have to fall back on kernel SHAP, or another naive (unoptimized) Shapley-value-based method.
A Shapley-value-based method is attractive as it only requires black box access to the model (i.e. computing outputs from inputs), and there is a version agnostic to the model architecture. For instance, it does not matter whether the model function is discrete or continuous. The downside is that exactly computing the subsets is exponential in the number of features.
What is a gradient-based explanation method?
A gradient-based explanation method tries to explain a given prediction by using the gradient of (i.e. change in) the output with respect to the input features. Some methods like Integrated Gradients (IG), GradCAM, and SmoothGrad literally apply the gradient operator. Other methods like DeepLift and LRP apply “discrete gradients.”
Let me describe IG, which has the advantage that it tries to approximate Aumann-Shapley values, which are axiomatically justified. IG operates by considering a straight line path, in feature space, from the input at hand (e.g., an image from a training set) to a certain baseline input (e.g., a black image), and integrating the gradient of the prediction with respect to input features (e.g., image pixels) along this path.
This paper explains the intuition of the IG algorithm as follows. As the input varies along the straight line path between the baseline and the input at hand, the prediction moves along a trajectory from uncertainty to certainty (the final prediction probability). At each point on this trajectory, one can use the gradient with respect to the input features to attribute the change in the prediction probability back to the input features. IG aggregates these gradients along the trajectory using a path integral.
IG (roughly) requires the prediction to be a continuous and piecewise differentiable function of the input features. (More precisely, it requires the function is continuous everywhere and the partial derivative along each input dimension satisfies Lebesgue’s integrability condition, i.e., the set of discontinuous points has measure zero.)
Note it is important to choose a good baseline for IG to make sensible feature attributions. For example, if a black image is chosen as baseline, IG won’t attribute importance to a completely black pixel in an actual image. The baseline value should both have a near-zero prediction, and also faithfully represent a complete absence of signal.
IG is attractive as it is broadly applicable to all differentiable models, easy to implement in most machine learning frameworks (e.g., TensorFlow, PyTorch, Caffe), and computationally scalable to massive deep networks like Inception and ResNet with millions of neurons.
When should you use a Shapley-value-based versus a gradient-based explanation method?
Finally, the payoff! Our advice: If the model function is piecewise differentiable and you have access to the model gradient, use IG. Otherwise, use a Shapley-value-based method.
Any model trained using gradient descent is differentiable. For example: neural networks, logistic regression, support vector machines. You can use IG with these. The major class of non-differentiable models is trees: boosted trees, random forests. They encode discrete values at the leaves. These require a Shapley-value-based method, like Tree SHAP.
The IG algorithm is faster than a naive Shapley-value-based method like kernel SHAP, as it only requires computing the gradients of the model output on a few different inputs (typically 50). In contrast, a Shapley-value-based method requires computing the model output on a large number of inputs sampled from the exponentially huge subspace of all possible combinations of feature values. Computing gradients of differentiable models is efficient and well supported in most machine learning frameworks. However, a differentiable model is a prerequisite for IG. By contrast, a Shapley-value-based method makes no such assumptions.
Several types of input features that look discrete (hence might require a Shapley-value-based method) actually can be mapped to differentiable model types (which let us use IG). Let us walk through one example: text sentiment. Suppose we wish to attribute the sentiment prediction to the words in some input text. At first, it seems that such models may be non-differentiable as the input is discrete (a collection of words). However, differentiable models like deep neural networks can handle words by first mapping them to a high-dimensional continuous space using word embeddings. The model’s prediction is a differentiable function of these embeddings. This makes it amenable to IG. Specifically, we attribute the prediction score to the embedding vectors. Since attributions are additive, we sum the attributions (retaining the sign) along the fields of each embedding vector and map it to the specific input word that the embedding corresponds to.
A crucial question for IG is: what is the baseline prediction? For this text example, one option is to use the embedding vector corresponding to empty text. Some models take fixed length inputs by padding short sentences with a special “no word” token. In such cases, we can take the baseline as the embedding of a sentence with just “no word” tokens.
In many cases (a differentiable model with a gradient), you can use integrated gradients (IG) to get a more certain and possibly faster explanation of feature importance for a prediction. However, a Shapley-value-based method is required for other (non-differentiable) model types.
At Fiddler, we support both SHAP and IG. (Full disclosure: Ankur Taly, a co-author of IG, works at Fiddler, and is a co-author of this post.) Feel free to email firstname.lastname@example.org for more information, or just to say hi!
You can’t always change a human’s input to see the output.
At Fiddler Labs, we place great emphasis on model explanations being faithful to the model’s behavior. Ideally, feature importance explanations should surface and appropriately quantify all and only those factors that are causally responsible for the prediction. This is especially important if we want explanations to be legally compliant (e.g., GDPR, article 13 section 2f, people have a right to ‘[information about] the existence of automated decision-making, including profiling .. and .. meaningful information about the logic involved’), and actionable. Even when making post-processing explanations human-intelligible, we must preserve faithfulness to the model.
How do we differentiate between features that are correlated with the outcome, and those that cause the outcome? In other words, how do we think about the causality of a feature to a model output, or to a real-world task? Let’s take those one at a time.
Explaining causality in models is hard
When explaining a model prediction, we’d like to quantify the contribution of each (causal) feature to the prediction.
For example, in a credit risk model, we might like to know how important income or zip code is to the prediction.
Note that zip code may be causal to a model’s prediction (i.e. changing zip code may change the model prediction) even though it may not be causal to the underlying task (i.e. changing zip code may not change the decision of whether to grant a loan). However, these two things may be related if this model’s output is used in the real-world decision process.
The good news is that since we have input-output access to the model, we can probe it with arbitrary inputs. This allows examining counterfactuals, inputs that are different from those of the prediction being explained. These counterfactuals might be elsewhere in the dataset, or they might not.
Shapley values (a classic result from game theory) offer an elegant, axiomatic approach to quantify feature contributions.
One challenge is they rely on probing with an exponentially large set of counterfactuals, too large to compute. Hence, there are several papers on approximating Shapley values, especially for specific classes of model functions.
However, a more fundamental challenge is that when features are correlated, not all counterfactuals may be realistic. There is no clear consensus on how to address this issue, and existing approaches differ on the exact set of counterfactuals to be considered.
To overcome these challenges, it is tempting to rely on observational data. For instance, using the observed data to define the counterfactuals for applying Shapley values. Or more simply, fitting an interpretable model on it to mimic the main model’s prediction and then explaining the interpretable model in lieu of the main model. But, this can be dangerous.
Consider a credit risk model with features including the applicant’s income and zip code. Say the model internally only relies on the zip code (i.e., it redlines applicants). Explanations based on observational data might reveal that the applicant’s income, by virtue of being correlated to zip code, is as predictive of the model’s output. This may mislead us to explain the model’s output in terms of the applicant’s income. In fact, a naive explanation algorithm will split attributions equally between two perfectly correlated features.
To learn more, we can intervene in features. One counterfactual changing zip code but not income will reveal that zip code causes the model’s prediction to change. A second counterfactual that changes income but not zip code will reveal that income does not. These two together will allow us to conclude that zip code is causal to the model’s prediction, and income is not.
Explaining causality requires the right counterfactuals.
Explaining causality in the real world is harder
Above we outlined a method to try to explain causality in models: study what happens when features change. To do so in the real world, you have to be able to apply interventions. This is commonly called a “randomized controlled trial” (also known as an “A/B testing” when there are two variants, especially in the tech industry). You divide a population into two or more groups randomly, and apply different interventions to each group. The randomization ensures that the only differences among the groups are your intervention. Therefore, you can conclude that your intervention causes the measurable differences in the groups.
The challenge in applying this method to real-world tasks is that not all interventions are feasible. You can’t ethically ask someone to take up smoking. In the real world, you may not be able to get the data you need to properly examine causality.
We can probe models as we wish, but not people.
Natural experiments can provide us an opportunity to examine situations where we would not normally intervene, like in epidemiology and economics. However, these provide us a limited toolkit, leaving many questions in these fields up for debate.
There are proposals for other theories that allow us to use domain knowledge to separate correlation from causation. These are subject to ongoing debate and research.
Now you know why explaining causality in models is hard, and explaining it in the real world is even harder.
To learn more about explaining models, email us at email@example.com. (Photo credit: pixabay.) This post was co-written with Ankur Taly.
What does debugging look like in the new world of machine learning models? One way uses model explanations.
Machine learning (ML) models are popping up everywhere. There is a lot of technical innovation (e.g., deep learning, explainable AI) that has made them more accurate, more broadly applicable, and usable by more people in more business applications. The lists are everywhere: banking, healthcare, tech, all of the above.
In a deep learning neural network, instead of lines of code written by people, we are looking at possibly millions of weights linked together into an incomprehensible network. (picture credit)
So how do we find bugs in this network? One way is to explain your model predictions. Let’s look at two types of bugs we can find through explanations (data leakage and data bias), illustrated with examples from predicting loan default. Both of these are actually data bugs, but a model summarizes the data, so they show up in the model.
Bug #1: data leakage
Most ML models are supervised. You choose a precise prediction goal (also called the “prediction target”), gather a dataset with features, and label each example with the target. Then you train a model to use the features to predict the target. Surprisingly often there are features in the dataset that relate to the prediction target but are not useful for prediction. For example, they might be added from the future (i.e. long after prediction time), or otherwise unavailable at prediction time.
Here is an example from the Lending Club dataset. We can use this dataset to try modeling predicting loan default with loan_status field as our prediction target. It takes the values “Fully Paid” (okay) or “Charged Off” (bank declared a loss, i.e. the borrower defaulted). In this dataset, there are also fields such as total_pymnt (the payments received) and loan_amnt (amount borrowed). Here are a few example values:
Notice anything? Whenever the loan has defaulted (“Charged Off”), the total payments are less than the loan amount, and delta (=loan_amnt-total_pymnt) is positive. Well, that’s not terribly surprising. Rather, it’s nearly the definition of default: by the end of the loan term, the borrower paid less than what was loaned. Now, delta doesn’t have to be positive for a default: you could default after paying back the entire loan principal amount but not all of the interest. But, in this data, 98% of the time if delta is negative, the loan was fully paid; and 100% of the time delta is positive, the loan was charged off. Including total_pymnt gives us nearly perfect information, but we don’t get total_pymnt until after the entire loan term (3 years)!
Including both loan_amnt and total_pymnt in the data potentially allows nearly perfect prediction, but we won’t really have total_pymnt for the real prediction task. Including them both in the training data is data leakage of the prediction target.
If we make a (cheating) model, it will perform very well. Too well. And, if we run a feature importance algorithm on some predictions (a common form of model explanation), we’ll see these two variables come up as important, and with any luck realize this data leakage.
Below, the Fiddler explanation UI shows “delta” stands out as a huge factor in raising this example prediction.
There are other, more subtle potential data leakages in this dataset. For example, the grade and sub_grade are assigned by a Lending Club proprietary model, which almost completely determines the interest rate. So, if you want to build your own risk scoring model without Lending Club, then grade, sub_grade, and int_rate are all data leakage. They wouldn’t allow you to perfectly predict default, but presumably they would help, or Lending Club would not use their own model. Moreover, for their model, they include FICO score, yet another proprietary risk score, but one that most financial institutions buy and use. If you don’t want to use FICO score, then that is also data leakage.
Data leakage is any predictive data that you can’t or won’t use for prediction. A model built on data with leakage is buggy.
Bug #2: data bias
Suppose through poor data collection or a bug in preprocessing, our data in biased. More specifically, there is a spurious correlation between a feature and the prediction target. In that case, explaining predictions will show an unexpected feature often being important.
We can simulate a data processing bug in our lending data by dropping all the charged off loans from zip codes starting with 1 through 5. Before this bug, zip code is not very predictive of chargeoff (an AUC of 0.54, only slightly above random). After this bug, any zip code starting with 1 through 5 will never be charged off, and the AUC jumps to 0.78. So, zip code will show up as an important feature in predicting (no) loan default from data examples in those zip codes. In this example, we could investigate by looking at predictions where zip code was important. If we are observant, we might notice the pattern, and realize the bias.
Below is what charge-off rate would look like if summarized by the first digit of zip code. Some zips would have no charge-offs, while the rest had a rate similar to the dataset overall.
Below, the Fiddler explanation UI shows zip code prefix stands out as a huge factor in lowering this example prediction.
A model built from this biased data is not useful for making predictions on (unbiased) data we haven’t seen yet. It is only accurate in the biased data. Thus, a model built on biased data is buggy.
Other model debugging methods
There are many other possibilities for model debugging that don’t involve model explanations. For example:
Look for overfitting or underfitting. If your model architecture is too simple, it will underfit. If it is too complex, it will overfit.
Regression tests on a golden set of predictions that you understand. If these fail, you might be able to narrow down which scenarios are broken.
Since explanations aren’t involved with these methods, I won’t say more here.
If you are not sure your model is using your data appropriately, use explanations of feature importance to examine its behavior. You might see data leakage or data bias. Then, you can fix your data, which is the best way to fix your model.
It is a bipartisan sentiment that, left unchecked, AI can pose a risk to fairness in financial services. While the exact extent of this danger might be debated, governments in the US and abroad acknowledge the necessity and assert the right to regulate financial institutions for this purpose.
The June 26 hearing was the first wake-up call for financial services: they need to be prepared to respond and comply with future legislation requiring transparency and fairness.
In this post, we review the notable events of this hearing, and we explore how the US House is beginning to examine the risks and benefits of AI in financial services.
Two new House Task Forces to regulate fintech and AI
The fintech task force should have a nearer-term focus on applications (e.g. underwriting, payments, immediate regulation).
The AI task force should have a longer-term focus on risks (e.g. fraud, job automation, digital identification).
And explicitly, Chairwoman Waters explained her overall interest in regulation:
Make sure that responsible innovation is encouraged, and that regulators and the law are adapting to the changing landscape to best protect consumers, investors, and small businesses.
The appointed chairman of the Task Force on AI, Congressman Bill Foster (D-IL), extolled AI’s potential in a similar statement, but also cautioned,
It is crucial that the application of AI to financial services contributes to an economy that is fair for all Americans.
This first hearing did find ample AI applications in financial services. But it also concluded that these worried sentiments are neither misrepresentative of their constituents nor misplaced.
Risks of AI
In a humorous exchange later in the hearing, Congresswoman Sylvia Garcia (D-TX) asks a witness, Dr. Bonnie Buchanan of the University of Surrey, to address the average American and explain AI in 25 words or less. It does not go well.
DR. BUCHANAN I would say it’s a group of technologies and processes that can look at determining general pattern recognition, universal approximation of relationships, and trying to detect patterns from noisy data or sensory perception.
REP. GARCIA I think that probably confused them more.
DR. BUCHANAN Oh, sorry.
Beyond making jokes, Congresswoman Garcia has a point. AI is extraordinarily complex. Not only that, to many Americans it can be threatening. As Garcia later expresses, “I think there’s an idea that all these robots are going to take over all the jobs, and everybody’s going to get into our information.”
In his opening statement, task force ranking member Congressman French Hill (R-AR) tries to preempt at least the first concern. He cites a World Economic Forum study that the 75 million jobs lost because of AI will be more than offset by 130 million new jobs. But Americans are still anxious about AI development.
overwhelming support for careful management of robots and/or AI (82% support)
more trust in tech companies than in the US government to manage AI in the interest of the public
mixed support for developing high-level machine intelligence (defined as “when machines are able to perform almost all tasks that are economically relevant today better than the median human today”)
This public apprehension about AI development is mirrored by concerns from the task force and experts. Personal privacy is mentioned nine times throughout the hearing, notably in Congressman Anthony Gonzalez’s (R-OH) broad question on “balancing innovation with empowering consumers with their data,” which the panel does not quite adequately address.
But more often, the witnesses discuss fairnessand how AI models could discriminate unnoticed. Most notably, Dr. Nicol Turner-Lee, a fellow at the the Brookings Institution, suggests implementing guardrails to prevent biased training data from “replicat[ing] and amplify[ing] stereotypes historically prescribed to people of color and other vulnerable populations.”
And she’s not alone. A separate April 2019 Brookings report seconds this concern of an unfairness “whereby algorithms deny credit or increase interest rates using a host of variables that are fundamentally driven by historical discriminatory factors that remain embedded in society.”
So if we’re so worried, why bother introducing the Pandora’s box of AI to financial services at all?
Benefits of AI
AI’s potential benefits, according to Congressman Hill, are to “gather enormous amounts of data, detect abnormalities, and solve complex problems.” In financial services, this means actually fairer and more accurate models for fraud, insurance, and underwriting. This can simultaneously improve bank profitability and extend services to the previously underbanked.
Both Hill and Foster cite a National Bureau of Economic Research working paper finding where in one case, algorithmic lending models discriminate 40% less than face-to-face lenders. Furthermore, Dr. Douglas Merrill, CEO of ZestFinance and expert witness, claims that customers using his company’s AI tools experience higher approval rates for credit cards, auto loans, and personal loans, each with no increase in defaults.
Moreover, Hill frames his statement with an important point about how AI could reshape the industry: this advancement will work “for both disruptive innovators and for our incumbent financial players.” At first this might seem counterintuitive.
“Disruptive innovators,” more agile and hindered less by legacy processes, can have an advantage in implementing new technology. But without the immense budgets and customer bases of “incumbent financial players,” how can these disruptors succeed? And will incumbents, stuck in old ways, ever adopt AI?
Mr. Jesse McWaters, financial innovation lead at the World Economic Forum and the final expert witness, addresses this apparent paradox, discussing what will “redraw the map of what we consider the financial sector.” Third-party AI service providers — from traditional banks to small fintech companies — can “help smaller community banks remain digitally relevant to their customers” and “enable financial institutions to leapfrog forward.”
Enabling competitive markets, especially in concentrated industries like financial services, is an unadulterated benefit according to free market enthusiasts in Congress. However, “redrawing the map” in this manner makes the financial sector larger and more complex. Congress will have to develop policy responding to not only more complex models, but also a more complex financial system.
This system poses risks both to corporations, acting in the interest of shareholders, and to the government, acting in the interest of consumers.
Business and government look at risks
Businesses are already acting to avert potential losses from AI model failure and system complexity. A June 2019 Gartner report predicts that 75% of large organizations will hire AI behavioral forensic experts to reduce brand and reputation risk by 2023.
However, governments recognize that business-led initiatives, if motivated to protect company brand and profits, may only go so far. For a government to protect consumers, investors, and small businesses (the relevant parties according to Chairwoman Waters), a gap may still remain.
As governments explore how to fill this gap, they are establishing principles that will underpin future guidance and regulation. The themes are consistent across governing bodies:
AI systems need to be trustworthy.
They therefore require some government guidance or regulation from government representing the people.
This guidance should encourage fairness, privacy, and transparency.
In the US, President Donald Trump signed an executive order in February 2019 “to Maintain American Leadership in Artificial Intelligence,” directing federal agencies to, among other goals, “foster public trust in AI systems by establishing guidance for AI development and use.” The Republican White House and Democratic House of Representatives seem to clash at every turn, but they align here.
The EU is also establishing a regulatory framework for ensuring trustworthy AI. Likewise included among the seven requirements in their latest communication from April 2019: privacy, transparency, and fairness.
And June’s G20 summit drew upon similar ideas to create their own set of principles, including fairness and transparency, but also adding explainability.
These governing bodies are in a fact-finding stage, establishing principles and learning what they are up against before guiding policy. In the words of Chairman Foster, the task force must understand “how this technology will shape the questions that policymakers will have to grapple with in the coming years.”
Conclusion: Explain your models
An hour before Congresswoman Garcia’s amusing challenge, Dr. Buchanan reflected upon a couple common themes of concern.
Policymakers need to be concerned about the explainability of artificial intelligence models. And we should avoid black-box modeling where humans cannot determine the underlying process or outcomes of the machine learning or deep learning algorithms.
But through this statement, she suggests a solution: make these AI models explainable. If humans can indeed understand the inputs, process, and outputs of a model, we can trust our AI. Then throughout AI applications in financial services, we can promote fairness for all Americans.
Zhang, Baobao and Allan Dafoe. “Artificial Intelligence: American Attitudes and Trends.” Oxford, UK: Center for the Governance of AI, Future of Humanity Institute, University of Oxford, 2019. https://ssrn.com/abstract=3312874
In today’s world, data has played a huge role in the success of technology giants like Google, Amazon, and Facebook. All of these companies have built massively scalable infrastructure to process data and provide great product experiences for their users. In the last 5 years, we’ve seen a real emergence of AI as a new technology stack. For example, Facebook built an end-to-end platform called FBLearner that enables an ML Engineer or a Data Scientist build Machine Learning pipelines, run lots of experiments, share model architectures and datasets with team members, scale ML algorithms for billions of Facebook users worldwide. Since its inception, millions of models have been trained on FBLearner and every day these models answer billions of real-time queries to personalize News Feed, show relevant Ads, recommend Friend connections, etc.
However, for most other companies building AI applications remains extremely expensive. This is primarily due to a lack of systems and toolsfor supporting end-to-end machine learning (ML) application development — from data preparation and labeling to operationalization and monitoring .
The goal of this post is 2-fold:
List the challenges with adopting AI successfully: data management, model training, evaluation, deployment, and monitoring;
List the tools I think we need to create to allow developers to meet these challenges: a data-centric IDE with capabilities like explainable recommendations, robust dataset management, model-aware testing, model deployment, measurement, and monitoring capabilities.
Challenges of adopting AI
In order to build an end-to-end ML platform, a data scientist has to go through multiple hoops of the following workflow .
End-to-End ML Workflow
A big challenge to building AI applications is that different stages of the workflow require new software abstractions that can accommodate complex interactions with the underlying data used in AI training or prediction. For example:
Data Management requires a data scientist to build and operate systems like Hive, Hadoop, Airflow, Kafka, Spark etc to assemble data from different tables, clean datasets, procure labeling data, construct features and make them ready for training. In most companies, data scientists rely on their data engineering teams to maintain this infrastructure and help build ETL pipelines to get feature datasets ready.
Training models is more of an art than science. It requires understanding which features work and what modeling algorithms are suitable to the problem at hand. Although there are libraries like PyTorch, TensorFlow, Scikit-Learn etc, there is a lot of manual work in feature selection, parameter optimization, and experimentation.
Model evaluation is often performed as a team activity since it requires other people to review the model performance across a variety of metrics from AUC, ROC, Precision/Recall and ensure that model is calibrated well, etc. In the case of Facebook, this was built into FBLearner, where every model created on the platform would get an auto-generated dashboard showing all these statistics.
Deploying models requires data scientists to first pick the optimal model and make it ready to be deployed to production. If the model is going to impact business metrics of the product and will be consumed in a realtime manner, we need to deploy it to only a small % of traffic and run an A/B test with an existing production model. Once the A/B test is positive in terms of business metrics, the model gets rolled out to 100% of production traffic.
Inference of the models is closely tied with deployment, there can be 2 ways a model can be made available for consumption to make predictions.
batch inference, where a data pipeline is built to scan through a dataset and make predictions on each record or a batch of records.
realtime inference, where a micro-service hosts the model and makes predictions in a low-latency manner.
Monitoring predictions is very important because unlike traditional applications, model performance is non-deterministic and depends on various factors such as seasonality, new user behavior trends, data pipeline unreliability leading to broken features. For example, a perfectly functioning Ads model might need to be updated when a new holiday season arrives or a model trained to show content recommendations in the US may not do very well for users signing up internationally. There is also a need for alerts and notifications to detect model degradation quickly and take action.
As we can see, the workflow to build machine learning models is significantly different from building general software applications. f models are becoming first-class citizens in the modern enterprise stack, they need better tools. As Tesla’s Director of AI Andrej Karpathy succinctly puts it, AI is Software 2.0 and it needs new tools .
If we compare the stack of Software 1.0 with 2.0, I claim we require transformational thinking to build the new developer stack for AI.
We need new tools for AI engineering
In Software 1.0, we have seen a vast amount of tooling built in the past few decades to help developers write code, share it with other developers, get it reviewed, debug it, release it to production and monitor its performance. If we were to map these tools in the 2.0 stack, there is a big gap!
What would an ideal Developer Toolkit look like for an AI engineer?
To start with, we need to take a data-first approach as we build this toolkit because, unlike Software 1.0, the fundamental unit of input for 2.0 is data.
Integrated Development Environment (IDE): Traditional IDEs focus on helping developers write code, focus on features like syntax highlighting, code checkpointing, unit testing, code refactoring, etc.
For machine learning, we need an IDE that allows easy import and exploration of data, cleaning and massaging of tables. Jupyter notebooks are somewhat useful, but they have their own problems, including the lack of versioning and review tools. A powerful 2.0 IDE would be more data-centric, starts with allowing the data scientist to slice and dice data, edit the model architecture either via code or UI and debug the model on egregious cases where it might be not performing well. I see traction in this space with products like StreamLit  reimagining IDEs for ML.
Tools like Git, Jenkins, Puppet, Docker have been very successful in traditional software development by taking care of continuous integration and deployment of software. When it comes to machine learning, the following steps would constitute the release process.
Model Versioning: As more models get into production, managing the various versions of them becomes important. Git can be reused for models, however, it won’t scale for large datasets. The reason to version datasets is that to be able to reproduce a model, we need the snapshot of the data the model was trained upon. Naive implementations of this could explode the amount of data we’re versioning, think 1-copy-of-dataset-per-model-version. DVC  which is an open-source version control system is a good start and is gaining momentum.
Unit Testing is another important part of the build & release cycle. For ML, we need unit tests that catch not only code quality bugs but also data quality bugs.
Canary Tests are minimal tests to quickly and automatically verify that the everything we depend on is ready. We typically run Canary tests before other time-consuming tests, and before wasting time investigating the code when the other tests are failing . In Machine Learning, it means being able to replay a previous set of examples on the new Model and ensuring that it meets certain minimal set of conditions.
A/B Testing is a method of comparing two versions of an application change to determine which one performs better . For ML, AB testing is an experiment where two or more variations of the ML model are exposed to users at random, and statistical analysis is used to determine which variation performs better for a given conversion goal. For example in the dashboard below, we’re measuring click conversion on an A/B experiment dashboard that my team built at Pinterest, and it shows the performance of the ML experiments against business metrics like repins, likes, etc. CometML  lets data scientists keep track of ML experiments and collaborate with their team members.
Debugging: One of the main features of an IDE is the ability to debug the code and find exactly the line where the error occurred. For machine learning, this becomes a hard problem because models are often opaque and therefore exactly pinpointing why a particular example was misclassified is difficult. However, if we can understand the relationship between feature variables and the target variable in a consistent manner, it goes a long way in debugging models, also calledmodel interpretability, which is an active area of research. At Fiddler, we’re working on a product offering that allows data scientists to debug any kind of models and perform root cause analysis.
Profiling: Performance analysis is an important part of SDLC in 1.0 and profiling tools allow engineers to figure out slowness of an application and improve it. For models, it is also about improving performance metrics like AUC, log loss, etc. Often times, a given model could have a higher score on an aggregate metric but it can be performing poorly on certain instances or subsets of the dataset. This is where tools like Manifold  can enhance the capabilities of traditional performance analysis.
Monitoring: While superficially, application monitoring might seem similar to model monitoring and could actually be a good place to start, we need to track a different class of metrics for machine learning. Monitoring is crucial for models that automatically incorporate new data in a continual or ongoing fashion at training time, and is always needed for models that serve predictions in an on-demand fashion. We can categorize monitoring into 4 broad classes:
Feature Monitoring: This is to ensure that features are stable over time, certain data invariants are upheld, any checks w.r.t privacy can be made as well as continuous insight into statistics like feature correlations.
Model Ops Monitoring: Staleness, regressions in serving latency, throughput, RAM usage, etc.
Model Performance Monitoring: Regressions in prediction quality at inference time.
Model Bias Monitoring: Unknown introductions of bias both direct and latent.
I walked through 1) some challenges to successfully deploying AI (data management, model training, evaluation, deployment, and monitoring), 2) some tools I propose we need to meet these challenges (a data-centric IDE with capabilities like slicing & dicing of data, robust dataset management, model-aware testing, and model deployment, measurement, and monitoring capabilities). If you are interested in some of these tools, we’re working on them at Fiddler Labs. And if you’re interested in building these tools, we would love to hear from you at https://angel.co/fiddler-labs
A gentle introduction to a white box machine learning model called a GA2M, a Generalized Additive Model (GAM) with interaction terms.
This post is a gentle introduction to a white box machine learning model called a GA2M.
We’ll walk through:
What is a white box model, and why would you want one?
A classic example white box model: logistic regression
What is a GAM, and why would you want one?
What is a GA2M, and why would you want one?
When should you choose a GAM, a GA2M, or something else?
The purpose of all these machine learning models is to make a prediction towards a goal specified by a human. Think of a model that can predict loan default, or the presence of someone’s face in a picture.
The short story: A generalized additive model (GAM) is a white box model that is more flexible than logistic regression, but still interpretable. A GA2M is a GAM with interaction terms, which allows it to be more flexible still, but with a more complicated interpretation. GAMs and GA2Ms are an intriguing addition to your toolbox, interpretable at the expense of not fitting every kind of data. A picture:
For more about what that all means, read on.
White box models
The term “white box” comes from software engineering. It means software whose internals you can view, compared to a “black box” whose internals you cannot view. By this definition, a neural network could be a white box model if you can see the weights (picture credit):
However, by white box people really mean something they can understand. A white box model is a model whose internals a person can see and reason about. This is subjective, but most people would agree the weights shown above don’t give us information about how the model works in such a way as we could usefully describe it, or predict what the model is going to do in the future.
Compare the picture above to this one about risk of death from pneumonia by age from :
Now that isn’t a whole model. Rather, it’s the impact of one feature (age) on the risk score. The green lines are error bars (±1 standard deviation in 100 rounds of bagging). The red line in the middle of them is the best estimate. In the paper, they observe:
Risk score is flat until about age 50. Risk score here is negative, meaning less risk of death than the average in the dataset.
Risk score rises sharply at 65. This could be due to retirement. In future, it might be interesting to gather data about retirement.
The error bars are narrowest in ages 66-85. Perhaps that is where the most data is.
Risk score rises again at 85. The error bars also widen again. Maybe this jump is not real.
Risk score drops above 100. This may be due to lack of data, or something else. In the paper, they suggested one might wish to “fix” this region of the model by changing it to predict at the same level as ages 85-100 instead of dropping. This fix is using domain knowledge (“risk of pneumonia likely doesn’t go down after age 85”) to address possible model artifacts.
Risk score between 66 and 85 is relatively flat.
All this from one graph of one model feature. There are facts, like the shape of the graph, and then speculation about why the graph might behave that way. The facts are useful to understand the data. The speculation cannot be answered by any tool, but may be useful to suggest further actions, like collecting new features (say, about retirement) or new instances (like points below age 50 or above 100), or new analyses (like looking carefully at data instances around ages 85-86 for differences).
These aren’t simulations of what the model would do. These are the internals of the model itself, so that graph is accurately describing the exact effect of age on risk score. There are 55 other components to this model, but each can be examined and reasoned about.
This is the power of a white box model.
This example also shows the dangers. By seeing everything, we may believe we understand everything, and speculate wildly or “fix” inappropriately. As always, we have to exercise judgment to use data properly.
In summary: make a white box model to
learn about your model, not from simulations or approximations, but the actual internals
improve your model, by giving you ideas of directions to pursue
“fix” your model, i.e., align it with your intuition or domain knowledge
One final possibility: regulations dictate that you need to fully describe your model. In that case, it could be useful to have human-readable internals for reference.
Here are some examples of white box and black box models:
White box models
Black box models
Logistic regression GAMs GA2Ms Decision trees (short and few trees)
Neural networks (including deep learning) Boosted trees and random forests (many trees) Support vector machines
Now let’s walk through three specific white box models.
A classic: logistic regression
Logistic regression was developed in the early 1800s, and re-popularized in the 1900s. It’s been around for a long time, for many reasons. It solves a common problem (predict the probability of an event), and it’s interpretable. Let’s explore what that means. Here is the logistic equation defining the model:
There are three types of variables in this model equation:
p is the probability of an event we’re predicting. For example, defaulting on a loan
The x’s are features. For example, loan amount.
The 𝛽’s (betas) are the coefficients, which we fit using a computer.
The betas are fit once to the entire dataset. The x’s are different for each instance in the dataset. The p represents an aggregate of dataset behavior: any dataset instance either happened (1) or didn’t (0), but in aggregate, we’d like the right-hand side and the left-hand side to be as close as possible.
The “log(p/(1-p))” is the log odds, also called the “logit of the probability”. The odds are (probability the event happened)/(probability the event won’t happen), or p/(1-p). Then we apply the natural logarithm to translate p, which takes the range 0 to 1, to a quantity which can range from -∞ to +∞, suitable for a linear model.
This model is linear, but for the log odds. That is, the right-hand side is a linear equation, but it is fit to the log odds, not the probability of an event.
This model is interpretable as follows: a unit increase in xi is a log-odds increase in 𝛽i.
For example, suppose we’re predicting probability of loan default, and our model has a feature coefficient 𝛽1=0.15 for the loan amount feature x1. That means a unit increase in the feature corresponds to a log odds increase of 0.15 in default. We can take the natural exponent to get the odds ratio, exp(0.15)=1.1618. That means:
for this model, a unit increase (of say, a thousand dollars) in loan amount corresponds to a 16% increase in the odds of loan default, holding all other factors constant.
This statement is what people mean when they say logistic regression is interpretable.
To summarize why logistic regression is a white box model:
The input response terms (𝛽ixi terms) can be interpreted independently of each other
The terms are in interpretable units: the coefficients (betas) are in units of log odds.
So why would we use anything other than the friendly, venerable model of logistic regression?
Well, if the features and log odds don’t have a linear relationship, this model won’t fit well. I always think of trying to fit a line to a parabola:
If you have non-linear data (the black parabola), a linear fit (the blue dashed line) will never be great. No line fits the curve.
This equation is quite similar to logistic regression. It has the same three types of elements:
E(Y) is an aggregate of dataset behavior, like the “p” in the equation above. In fact, it may well be the probability of an event, the same p.
g(.) is a link function, like the logit (or log odds) from the logistic equation above.
fi(xi) is a term for each dataset instance feature x1,…,xm.
The big difference is instead of a linear term 𝛽ixi for a feature, now we have a function fi(xi). In their book, Hastie and Tibshirani specify a “smooth” function like a cubic spline. Lou et al.  looked at other functions for the fi, which they call “shape functions.”
A GAM also has white box features:
The input response terms (f(xi) terms) can be interpreted independently of each other
The terms are in interpretable units. For the logit link function, these are log odds.
Now a term, instead of being a constant (beta), is a function, so instead of reporting the log odds as a number, we visualize it with a graph. In fact, the graph above of pneumonia risk of death by age is one term (shape function) in a GAM.
So why would we use anything other than a GAM? It’s already flexible and interpretable. Same reason as before: it might not be accurate enough. In particular, we’ve assumed that each feature response can be modeled with its own function, independent of the others.
But what if there are interactions between the features? Several black box models (boosted trees, neural networks) can model interaction terms. Let us walk through a white box model that also can: GA2Ms.
GAMs with interaction terms (GA2Ms)
GA2Ms were investigated in 2013 by Lou et al. . The authors pronounce them with the letters “gee ay two em”, but in house we’ve taken to calling them “interaction GAMs” because it’s more pronounceable. Here is the model equation:
This equation is quite similar to the GAM equation from the previous section, except it adds functions that can account for two feature variables at once, i.e. interaction terms.
Microsoft just released a library InterpretML that implements GA2Ms in python. In that library, they call them “Explainable Boosting Machines.”
Lou et al. say these are still white box models because the “shape function” for an interaction term is a heatmap. The two features are along the X and Y axis, and the color in the middle shows the function response. Here is an example from Microsoft’s library fit to predicting loan default on a dataset of loan performance from lending club:
For this example graph:
The upper right corner is the most red. That means the probability of default goes up the most when dti (debt to income ratio) and fico_range_midpoint (the FICO credit score) are both high.
The left strip is also red, but turns blue near the bottom. That means that very low dti is usually bad, except if fico_range_midpoint is also low.
This particular heatmap is hard to reason about. This is likely only the interaction effect without the single-feature terms. So, it could be that the probability of default overall isn’t higher at high-dti and high-fico, but rather just higher than either of the primary effects predict by themselves. To investigate further, we could probably look at some examples around the borders. But, for this blog post, we’ll skip the deep dive.
In practice, this library fits all single-feature functions, then N interaction terms, where you pick N. It is not easy to pick N. The interaction terms are worthwhile if they add enough accuracy to be worth the extra complexity of staring at heatmaps to interpret them. That is a judgement call that depends on your business situation.
When should we use GAMs or GA2Ms?
To perform machine learning, first pick a goal. Then pick a technology that will best use your data to meet the goal. There are thousands of books and millions of papers on that subject. But, here is a drastically simplified way to think about how GA2Ms fit in to possible model technologies: they are on a spectrum from interpretability to modeling feature interactions.
Use GAMs if they are accurate enough. It gives the advantages of a white box model: separable terms with interpretable units.
Use GA2Ms if they are significantly more accurate than GAMs, especially if you believe from your domain knowledge that there are real feature interactions, but they are not too complex. This also gives the advantages of a white box model, with more effort to interpret.
Try boosted trees (xgboost or lightgbm) if you don’t know a lot about the data, since it is quite robust to quirks in data. These are black box models.
When features interact highly with each other, like pixels in images or the context in audio, you may well need neural networks or something else that can capture complex interactions. These are deeply black box.
In all cases, you may well need domain-specific data preprocessing, like squaring images, or standardizing features (subtracting the mean and dividing by the standard deviation). That is a topic for another day.
Now hopefully the diagram we started with makes more sense.
Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-Day Readmission.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721–1730. KDD ’15. New York, NY, USA: ACM, 2015. https://doi.org/10.1145/2783258.2788613.
Lou, Yin, Rich Caruana, and Johannes Gehrke. “Intelligible Models for Classification and Regression.” In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150–158. KDD ’12. New York, NY, USA: ACM, 2012. https://doi.org/10.1145/2339530.2339556.
A few weeks ago, I was on the panel for Explainable AI at the PRMIA Fintech Horizons Conference in SF. The participants were predominantly from the finance industry like Banks, Hedge Funds and Fintech Startups.
We had a very interesting discussion on topics like:
Automated AI vs Human-Centered AI
How catastrophic can it be when a Business Risk is left unmanaged?
Example: Boeing 737 Max 8 failure
Special challenges in Quantitative Finance with AI. Can we quantify Model Risk in terms of a $-value?
Who in the organizations needs to care about Explainable AI? Is it the Data Scientist? Chief Risk Officer? Business Owner?
Bob Mark – Former CRO & Treasurer of CIBC, Managing Partner Black Diamond Risk – Moderator
We’re excited to introduce Ankur Taly, the newest member of our team. Ankur joins us as the Head of Data Science.
Previously, he was a Staff Research Scientist at Google Brain where he worked on Machine Learning Interpretability and was most well-known for his contribution to developing and applying Integrated Gradients — a new interpretability algorithm for Deep Neural Networks. Ankur has a broad research background and has published in several areas including Computer Security, Programming Languages, Formal Verification, and Machine Learning. Ankur obtained his Ph.D. in CS from Stanford University and a B. Tech in CS from IIT Bombay.
Ankur passionately believes in the need for Explainable AI and is excited to join Fiddler labs. In his own words:
“Explainability is one of the key missing pieces in the ongoing machine learning (ML) revolution. As ML models continue to become more complex and opaque, being able to explain their predictions is getting increasingly important and challenging. The ability to explain a model’s predictions would enable users to build trust in the model, business stakeholders to derive actionable insights and strategies, regulators to assess model fairness and risk, and data scientists to iterate on the model in a principled manner. In response to this need, there has been a large surge in research on explaining various aspects of ML models. Fiddler Labs has taken up the ambitious task of driving this research to industrial practice, by making it available as a cutting-edge enterprise product catering to several business needs. This is incredibly promising, and I am super excited to join Fiddler on this journey!”