Should you explain your predictions with SHAP or IG?

Two different explanation algorithm types, best in different situations.

Some of the most accurate predictive models today are black box models, meaning it is hard to really understand how they work. To address this problem, techniques have arisen to understand feature importance: for a given prediction, how important is each input feature value to that prediction? Two well-known techniques are SHapley Additive exPlanations (SHAP) and Integrated Gradients (IG). In fact, they each represent a different type of explanation algorithm: a Shapley-value-based algorithm (SHAP) and a gradient-based algorithm (IG).

There is a fundamental difference between these two algorithm types. This post describes that difference. First, we need some background. Below, we review Shapley values, Shapley-value-based methods (including SHAP), and gradient-based methods (including IG). Finally, we get back to our central question: When should you use a Shapley-value-based algorithm (like SHAP) versus a gradient-based explanation explanation algorithm (like IG)?

What are Shapley values?

The Shapley value (proposed by Lloyd Shapley in 1953) is a classic method to distribute the total gains of a collaborative game to a coalition of cooperating players. It is provably the only distribution with certain desirable properties (fully listed on Wikipedia). 

In our case, we formulate a game for the prediction at each instance. We consider the “total gains” to be the prediction value for that instance, and the “players” to be the model features of that instance. The collaborative game is all of the model features cooperating to form a prediction value. The Shapley value efficiency property says the feature attributions should sum to the prediction value. The attributions can be negative or positive, since a feature can lower or raise a predicted value.

There is a variant called the Aumann-Shapley value, extending the definition of the Shapley value to a game with many (or infinitely many) players, where each player plays only a minor role, if the worth function (the gains from including a coalition of players) is differentiable.

What is a Shapley-value-based explanation method?

A Shapley-value-based explanation method tries to approximate Shapley values of a given prediction by examining the effect of removing a feature under all possible combinations of presence or absence of the other features. In other words, this method looks at function values over subsets of features like F(x1, <absent>, x3, x4, …, <absent>, , xn). How to evaluate a function F with one or more absent features is subtle.

For example, SHAP (SHapely Additive exPlanations) estimates the model’s behavior on an input with certain features absent by averaging over samples from those features drawn from the training set. In other words, F(x1, <absent>, x3, …, xn) is estimated by the expected prediction when the missing feature x2 is sampled from the dataset. 

Exactly how that sample is chosen is important (for example marginal versus conditional distribution versus cluster centers of background data), but I will skip the fine details here.

Once we define the model function (F) for all subsets of the features, we can apply the Shapley values algorithm to compute feature attributions. Each feature’s Shapley value is the contribution of the feature for all possible subsets of the other features.

The “kernel SHAP” method from the SHAP paper computes the Shapley values of all features simultaneously by defining a weighted least squares regression whose solution is the Shapley values for all the features.

The high-level point is that all these methods rely on taking subsets of features. This makes the theoretical version exponential in runtime: for N features, there are 2N combinations of presence and absence. That is too expensive for most N, so these methods approximate. Even with approximations, kernel SHAP can be slow. Also, we don’t know of any systematic study of how good the approximation is.

There are versions of SHAP specialized to different model architectures for speed. For example, Tree SHAP computes all the subsets by cleverly keeping track of what proportion of all possible subsets flow down into each of the leaves of the tree. However, if your model architecture does not have a specialized algorithm like this, you have to fall back on kernel SHAP, or another naive (unoptimized) Shapley-value-based method.

A Shapley-value-based method is attractive as it only requires black box access to the model (i.e. computing outputs from inputs), and there is a version agnostic to the model architecture. For instance, it does not matter whether the model function is discrete or continuous. The downside is that exactly computing the subsets is exponential in the number of features.

What is a gradient-based explanation method?

A gradient-based explanation method tries to explain a given prediction by using the gradient of (i.e. change in) the output with respect to the input features. Some methods like Integrated Gradients (IG), GradCAM, and SmoothGrad literally apply the gradient operator. Other methods like DeepLift and LRP apply “discrete gradients.”

Figure 1 from the IG paper, showing three paths between a baseline (r1 , r2) and an input (s1, s2). Path P2, used by Integrated Gradients, simultaneously moves all features from off to on. Path P1 moves along the edges, turning features on in sequence. Other paths like P1 along different edges correspond to different sequences. SHAP computes the expected attribution over all such edge paths like P1.

Let me describe IG, which has the advantage that it tries to approximate Aumann-Shapley values, which are axiomatically justified. IG operates by considering a straight line path, in feature space, from the input at hand (e.g., an image from a training set) to a certain baseline input (e.g., a black image), and integrating the gradient of the prediction with respect to input features (e.g., image pixels) along this path.

This paper explains the intuition of the IG algorithm as follows. As the input varies along the straight line path between the baseline and the input at hand, the prediction moves along a trajectory from uncertainty to certainty (the final prediction probability). At each point on this trajectory, one can use the gradient with respect to the input features to attribute the change in the prediction probability back to the input features. IG aggregates these gradients along the trajectory using a path integral.

IG (roughly) requires the prediction to be a continuous and piecewise differentiable function of the input features. (More precisely, it requires the function is continuous everywhere and the partial derivative along each input dimension satisfies Lebesgue’s integrability condition, i.e., the set of discontinuous points has measure zero.)

Figure 2 from the IG paper, showing which pixels were most important to each image label.

Note it is important to choose a good baseline for IG to make sensible feature attributions. For example, if a black image is chosen as baseline, IG won’t attribute importance to a completely black pixel in an actual image. The baseline value should both have a near-zero prediction, and also faithfully represent a complete absence of signal.

IG is attractive as it is broadly applicable to all differentiable models, easy to implement in most machine learning frameworks (e.g., TensorFlow, PyTorch, Caffe), and computationally scalable to massive deep networks like Inception and ResNet with millions of neurons.

When should you use a Shapley-value-based versus a gradient-based explanation method?

Finally, the payoff! Our advice: If the model function is piecewise differentiable and you have access to the model gradient, use IG. Otherwise, use a Shapley-value-based method.

Any model trained using gradient descent is differentiable. For example: neural networks, logistic regression, support vector machines. You can use IG with these. The major class of non-differentiable models is trees: boosted trees, random forests. They encode discrete values at the leaves. These require a Shapley-value-based method, like Tree SHAP.

The IG algorithm is faster than a naive Shapley-value-based method like kernel SHAP, as it only requires computing the gradients of the model output on a few different inputs (typically 50). In contrast, a Shapley-value-based method requires computing the model output on a large number of inputs sampled from the exponentially huge subspace of all possible combinations of feature values. Computing gradients of differentiable models is efficient and well supported in most machine learning frameworks. However, a differentiable model is a prerequisite for IG. By contrast, a Shapley-value-based method makes no such assumptions. 

Several types of input features that look discrete (hence might require a Shapley-value-based method) actually can be mapped to differentiable model types (which let us use IG). Let us walk through one example: text sentiment. Suppose we wish to attribute the sentiment prediction to the words in some input text. At first, it seems that such models may be non-differentiable as the input is discrete (a collection of words). However, differentiable models like deep neural networks can handle words by first mapping them to a high-dimensional continuous space using word embeddings. The model’s prediction is a differentiable function of these embeddings. This makes it amenable to IG. Specifically, we attribute the prediction score to the embedding vectors. Since attributions are additive, we sum the attributions (retaining the sign) along the fields of each embedding vector and map it to the specific input word that the embedding corresponds to.

A crucial question for IG is: what is the baseline prediction? For this text example, one option is to use the embedding vector corresponding to empty text. Some models take fixed length inputs by padding short sentences with a special “no word” token. In such cases, we can take the baseline as the embedding of a sentence with just “no word” tokens.

For more on IG, see the paper or this how-to.

Conclusion

In many cases (a differentiable model with a gradient), you can use integrated gradients (IG) to get a more certain and possibly faster explanation of feature importance for a prediction. However, a Shapley-value-based method is required for other (non-differentiable) model types.

At Fiddler, we support both SHAP and IG. (Full disclosure: Ankur Taly, a co-author of IG, works at Fiddler, and is a co-author of this post.) Feel free to email info@fiddler.ai for more information, or just to say hi!

Causality in model explanations and in the real world

You can’t always change a human’s input to see the output.

At Fiddler Labs, we place great emphasis on model explanations being faithful to the model’s behavior. Ideally, feature importance explanations should surface and appropriately quantify all and only those factors that are causally responsible for the prediction. This is especially important if we want explanations to be legally compliant (e.g., GDPR, article 13 section 2f, people have a right to ‘[information about] the existence of automated decision-making, including profiling .. and .. meaningful information about the logic involved’), and actionable. Even when making post-processing explanations human-intelligible, we must preserve faithfulness to the model.

How do we differentiate between features that are correlated with the outcome, and those that cause the outcome? In other words, how do we think about the causality of a feature to a model output, or to a real-world task? Let’s take those one at a time.

Explaining causality in models is hard

When explaining a model prediction, we’d like to quantify the contribution of each (causal) feature to the prediction.

For example, in a credit risk model, we might like to know how important income or zip code is to the prediction.

Note that zip code may be causal to a model’s prediction (i.e. changing zip code may change the model prediction) even though it may not be causal to the underlying task (i.e. changing zip code may not change the decision of whether to grant a loan). However, these two things may be related if this model’s output is used in the real-world decision process.

The good news is that since we have input-output access to the model, we can probe it with arbitrary inputs. This allows examining counterfactuals, inputs that are different from those of the prediction being explained. These counterfactuals might be elsewhere in the dataset, or they might not.

Shapley values (a classic result from game theory) offer an elegant, axiomatic approach to quantify feature contributions.

One challenge is they rely on probing with an exponentially large set of counterfactuals, too large to compute. Hence, there are several papers on approximating Shapley values, especially for specific classes of model functions. 

However, a more fundamental challenge is that when features are correlated, not all counterfactuals may be realistic. There is no clear consensus on how to address this issue, and existing approaches differ on the exact set of counterfactuals to be considered.

To overcome these challenges, it is tempting to rely on observational data. For instance, using the observed data to define the counterfactuals for applying Shapley values. Or more simply, fitting an interpretable model on it to mimic the main model’s prediction and then explaining the interpretable model in lieu of the main model. But, this can be dangerous.

Consider a credit risk model with features including the applicant’s income and zip code. Say the model internally only relies on the zip code (i.e., it redlines applicants). Explanations based on observational data might reveal that the applicant’s income, by virtue of being correlated to zip code, is as predictive of the model’s output. This may mislead us to explain the model’s output in terms of the applicant’s income. In fact, a naive explanation algorithm will split attributions equally between two perfectly correlated features.

To learn more, we can intervene in features. One counterfactual changing zip code but not income will reveal that zip code causes the model’s prediction to change. A second counterfactual that changes income but not zip code will reveal that income does not. These two together will allow us to conclude that zip code is causal to the model’s prediction, and income is not.

Explaining causality requires the right counterfactuals.

Explaining causality in the real world is harder

Above we outlined a method to try to explain causality in models: study what happens when features change. To do so in the real world, you have to be able to apply interventions. This is commonly called a “randomized controlled trial” (also known as an “A/B testing” when there are two variants, especially in the tech industry). You divide a population into two or more groups randomly, and apply different interventions to each group. The randomization ensures that the only differences among the groups are your intervention. Therefore, you can conclude that your intervention causes the measurable differences in the groups.

The challenge in applying this method to real-world tasks is that not all interventions are feasible. You can’t ethically ask someone to take up smoking. In the real world, you may not be able to get the data you need to properly examine causality.

We can probe models as we wish, but not people.

Natural experiments can provide us an opportunity to examine situations where we would not normally intervene, like in epidemiology and economics. However, these provide us a limited toolkit, leaving many questions in these fields up for debate.

There are proposals for other theories that allow us to use domain knowledge to separate correlation from causation. These are subject to ongoing debate and research.

Now you know why explaining causality in models is hard, and explaining it in the real world is even harder.

To learn more about explaining models, email us at info@fiddler.ai. (Photo credit: pixabay.) This post was co-written with Ankur Taly.

“Hey, what’s that?” Debugging predictions using explanations

What does debugging look like in the new world of machine learning models? One way uses model explanations.

Machine learning (ML) models are popping up everywhere. There is a lot of technical innovation (e.g., deep learning, explainable AI) that has made them more accurate, more broadly applicable, and usable by more people in more business applications. The lists are everywhere: banking, healthcare, tech, all of the above.

However, as with any computer program, models can have errors, or more colloquially, bugs. The process of finding those bugs is quite different from previous technology, and requires a new developer stack. “Soon we won’t program computers, we’ll train them like dogs” (Wired, 2016). “Gradient descent can write code better than you. I’m sorry” (Andrej Karpathy, 2017).

In a deep learning neural network, instead of lines of code written by people, we are looking at possibly millions of weights linked together into an incomprehensible network. (picture credit)

How do we find bugs in this network?

So how do we find bugs in this network? One way is to explain your model predictions. Let’s look at two types of bugs we can find through explanations (data leakage and data bias), illustrated with examples from predicting loan default. Both of these are actually data bugs, but a model summarizes the data, so they show up in the model.

Bug #1: data leakage

Most ML models are supervised. You choose a precise prediction goal (also called the “prediction target”), gather a dataset with features, and label each example with the target. Then you train a model to use the features to predict the target. Surprisingly often there are features in the dataset that relate to the prediction target but are not useful for prediction. For example, they might be added from the future (i.e. long after prediction time), or otherwise unavailable at prediction time.

Here is an example from the Lending Club dataset. We can use this dataset to try modeling predicting loan default with loan_status field as our prediction target. It takes the values “Fully Paid” (okay) or “Charged Off” (bank declared a loss, i.e. the borrower defaulted). In this dataset, there are also fields such as total_pymnt (the payments received) and loan_amnt (amount borrowed). Here are a few example values:

Whenever the loan is “Charged Off”, delta is positive. But, we don’t know delta at loan time.

Notice anything? Whenever the loan has defaulted (“Charged Off”), the total payments are less than the loan amount, and delta (=loan_amnt-total_pymnt) is positive. Well, that’s not terribly surprising. Rather, it’s nearly the definition of default: by the end of the loan term, the borrower paid less than what was loaned. Now, delta doesn’t have to be positive for a default: you could default after paying back the entire loan principal amount but not all of the interest. But, in this data, 98% of the time if delta is negative, the loan was fully paid; and 100% of the time delta is positive, the loan was charged off. Including total_pymnt gives us nearly perfect information, but we don’t get total_pymnt until after the entire loan term (3 years)!

Including both loan_amnt and total_pymnt in the data potentially allows nearly perfect prediction, but we won’t really have total_pymnt for the real prediction task. Including them both in the training data is data leakage of the prediction target.

If we make a (cheating) model, it will perform very well. Too well. And, if we run a feature importance algorithm on some predictions (a common form of model explanation), we’ll see these two variables come up as important, and with any luck realize this data leakage.

Below, the Fiddler explanation UI shows “delta” stands out as a huge factor in raising this example prediction.

“delta” really stands out, because it’s data leakage.

There are other, more subtle potential data leakages in this dataset. For example, the grade and sub_grade are assigned by a Lending Club proprietary model, which almost completely determines the interest rate. So, if you want to build your own risk scoring model without Lending Club, then grade, sub_grade, and int_rate are all data leakage. They wouldn’t allow you to perfectly predict default, but presumably they would help, or Lending Club would not use their own model. Moreover, for their model, they include FICO score, yet another proprietary risk score, but one that most financial institutions buy and use. If you don’t want to use FICO score, then that is also data leakage.

Data leakage is any predictive data that you can’t or won’t use for prediction. A model built on data with leakage is buggy.

Bug #2: data bias

Suppose through poor data collection or a bug in preprocessing, our data in biased. More specifically, there is a spurious correlation between a feature and the prediction target. In that case, explaining predictions will show an unexpected feature often being important.

We can simulate a data processing bug in our lending data by dropping all the charged off loans from zip codes starting with 1 through 5. Before this bug, zip code is not very predictive of chargeoff (an AUC of 0.54, only slightly above random). After this bug, any zip code starting with 1 through 5 will never be charged off, and the AUC jumps to 0.78. So, zip code will show up as an important feature in predicting (no) loan default from data examples in those zip codes. In this example, we could investigate by looking at predictions where zip code was important. If we are observant, we might notice the pattern, and realize the bias.

Below is what charge-off rate would look like if summarized by the first digit of zip code. Some zips would have no charge-offs, while the rest had a rate similar to the dataset overall.

In this buggy dataset, there are no charged-off loans with zip codes starting with 6, 7, 8, 9, 0.

Below, the Fiddler explanation UI shows zip code prefix stands out as a huge factor in lowering this example prediction.

“zip_code_prefix” really stands out, because the model has a bug related to zip code.

A model built from this biased data is not useful for making predictions on (unbiased) data we haven’t seen yet. It is only accurate in the biased data. Thus, a model built on biased data is buggy.

Other model debugging methods

There are many other possibilities for model debugging that don’t involve model explanations. For example:

  1. Look for overfitting or underfitting. If your model architecture is too simple, it will underfit. If it is too complex, it will overfit.
  2. Regression tests on a golden set of predictions that you understand. If these fail, you might be able to narrow down which scenarios are broken.

Since explanations aren’t involved with these methods, I won’t say more here.

Summary

If you are not sure your model is using your data appropriately, use explanations of feature importance to examine its behavior. You might see data leakage or data bias. Then, you can fix your data, which is the best way to fix your model.

Fiddler is building an Explainable AI Engine that can help debug models. Email us at info@fiddler.ai.

Can Congress help keep AI fair for consumers?

A Congressional hearing on June 26 was a wake-up call for financial services.

How do firms ensure that AI systems are not having a disparate impact on vulnerable communities, and what safeguards should regulators and Congress put in place to protect consumers?

To what extent should companies be required to audit these algorithms so that they don’t unfairly discriminate? Who should determine the standards for that?

We need to ensure that AI does not create biases in lending toward discrimination.

These aren’t questions from an academic discourse or the editorial pages. These were posed to the witnesses of a June 26 hearing before the US House Committee on Financial Services — by both Democrats and Republicans, representatives of Illinois, North Carolina, and Arkansas.

It is a bipartisan sentiment that, left unchecked, AI can pose a risk to fairness in financial services. While the exact extent of this danger might be debated, governments in the US and abroad acknowledge the necessity and assert the right to regulate financial institutions for this purpose.

The June 26 hearing was the first wake-up call for financial services: they need to be prepared to respond and comply with future legislation requiring transparency and fairness.

In this post, we review the notable events of this hearing, and we explore how the US House is beginning to examine the risks and benefits of AI in financial services.

Two new House Task Forces to regulate fintech and AI

On May 9 of this year, the chairwoman of the US House Committee on Financial Services, Congresswoman Maxine Waters (D-CA), announced the creation of two task forces: one on fintech, and one on AI.

Generally, task forces convene to investigate a specific issue that might require a change in policy. These investigations may involve hearings that call forth experts to inform the task force.

These two task forces overlap in jurisdiction, but the committee’s objectives implied some distinctions:

  • The fintech task force should have a nearer-term focus on applications (e.g. underwriting, payments, immediate regulation).
  • The AI task force should have a longer-term focus on risks (e.g. fraud, job automation, digital identification).

And explicitly, Chairwoman Waters explained her overall interest in regulation:

Make sure that responsible innovation is encouraged, and that regulators and the law are adapting to the changing landscape to best protect consumers, investors, and small businesses.

The appointed chairman of the Task Force on AI, Congressman Bill Foster (D-IL), extolled AI’s potential in a similar statement, but also cautioned,

It is crucial that the application of AI to financial services contributes to an economy that is fair for all Americans.

This first hearing did find ample AI applications in financial services. But it also concluded that these worried sentiments are neither misrepresentative of their constituents nor misplaced.

From left to right: Maxine Waters (D-CA), Chairwoman of the US House Committee on Financial Services; Bill Foster (D-IL), Chairman of the Task Force on AI; French Hill (R-AR), Ranking Member on the Task Force on AI

Risks of AI

In a humorous exchange later in the hearing, Congresswoman Sylvia Garcia (D-TX) asks a witness, Dr. Bonnie Buchanan of the University of Surrey, to address the average American and explain AI in 25 words or less. It does not go well.

DR. BUCHANAN
I would say it’s a group of technologies and processes that can look at determining general pattern recognition, universal approximation of relationships, and trying to detect patterns from noisy data or sensory perception.

REP. GARCIA
I think that probably confused them more.

DR. BUCHANAN
Oh, sorry.

Beyond making jokes, Congresswoman Garcia has a point. AI is extraordinarily complex. Not only that, to many Americans it can be threatening. As Garcia later expresses, “I think there’s an idea that all these robots are going to take over all the jobs, and everybody’s going to get into our information.”

In his opening statement, task force ranking member Congressman French Hill (R-AR) tries to preempt at least the first concern. He cites a World Economic Forum study that the 75 million jobs lost because of AI will be more than offset by 130 million new jobs. But Americans are still anxious about AI development.

In a June 2018 survey of 2,000 Americans conducted by Oxford’s Center for the Governance of AI, researchers observed

  • overwhelming support for careful management of robots and/or AI (82% support)
  • more trust in tech companies than in the US government to manage AI in the interest of the public
  • mixed support for developing high-level machine intelligence (defined as “when machines are able to perform almost all tasks that are economically relevant today better than the median human today”)

This public apprehension about AI development is mirrored by concerns from the task force and experts. Personal privacy is mentioned nine times throughout the hearing, notably in Congressman Anthony Gonzalez’s (R-OH) broad question on “balancing innovation with empowering consumers with their data,” which the panel does not quite adequately address.

But more often, the witnesses discuss fairness and how AI models could discriminate unnoticed. Most notably, Dr. Nicol Turner-Lee, a fellow at the the Brookings Institution, suggests implementing guardrails to prevent biased training data from “replicat[ing] and amplify[ing] stereotypes historically prescribed to people of color and other vulnerable populations.”

And she’s not alone. A separate April 2019 Brookings report seconds this concern of an unfairness “whereby algorithms deny credit or increase interest rates using a host of variables that are fundamentally driven by historical discriminatory factors that remain embedded in society.”

So if we’re so worried, why bother introducing the Pandora’s box of AI to financial services at all?

Benefits of AI

AI’s potential benefits, according to Congressman Hill, are to “gather enormous amounts of data, detect abnormalities, and solve complex problems.” In financial services, this means actually fairer and more accurate models for fraud, insurance, and underwriting. This can simultaneously improve bank profitability and extend services to the previously underbanked.

Both Hill and Foster cite a National Bureau of Economic Research working paper finding where in one case, algorithmic lending models discriminate 40% less than face-to-face lenders. Furthermore, Dr. Douglas Merrill, CEO of ZestFinance and expert witness, claims that customers using his company’s AI tools experience higher approval rates for credit cards, auto loans, and personal loans, each with no increase in defaults.

Moreover, Hill frames his statement with an important point about how AI could reshape the industry: this advancement will work “for both disruptive innovators and for our incumbent financial players.” At first this might seem counterintuitive.

“Disruptive innovators,” more agile and hindered less by legacy processes, can have an advantage in implementing new technology. But without the immense budgets and customer bases of “incumbent financial players,” how can these disruptors succeed? And will incumbents, stuck in old ways, ever adopt AI?

Mr. Jesse McWaters, financial innovation lead at the World Economic Forum and the final expert witness, addresses this apparent paradox, discussing what will “redraw the map of what we consider the financial sector.” Third-party AI service providers — from traditional banks to small fintech companies — can “help smaller community banks remain digitally relevant to their customers” and “enable financial institutions to leapfrog forward.”

Enabling competitive markets, especially in concentrated industries like financial services, is an unadulterated benefit according to free market enthusiasts in Congress. However, “redrawing the map” in this manner makes the financial sector larger and more complex. Congress will have to develop policy responding to not only more complex models, but also a more complex financial system.

This system poses risks both to corporations, acting in the interest of shareholders, and to the government, acting in the interest of consumers.

Business and government look at risks

Businesses are already acting to avert potential losses from AI model failure and system complexity. A June 2019 Gartner report predicts that 75% of large organizations will hire AI behavioral forensic experts to reduce brand and reputation risk by 2023.

However, governments recognize that business-led initiatives, if motivated to protect company brand and profits, may only go so far. For a government to protect consumers, investors, and small businesses (the relevant parties according to Chairwoman Waters), a gap may still remain.

As governments explore how to fill this gap, they are establishing principles that will underpin future guidance and regulation. The themes are consistent across governing bodies:

  • AI systems need to be trustworthy.
  • They therefore require some government guidance or regulation from government representing the people.
  • This guidance should encourage fairness, privacy, and transparency.

In the US, President Donald Trump signed an executive order in February 2019 “to Maintain American Leadership in Artificial Intelligence,” directing federal agencies to, among other goals, “foster public trust in AI systems by establishing guidance for AI development and use.” The Republican White House and Democratic House of Representatives seem to clash at every turn, but they align here.

The EU is also establishing a regulatory framework for ensuring trustworthy AI. Likewise included among the seven requirements in their latest communication from April 2019: privacy, transparency, and fairness.

And June’s G20 summit drew upon similar ideas to create their own set of principles, including fairness and transparency, but also adding explainability.

These governing bodies are in a fact-finding stage, establishing principles and learning what they are up against before guiding policy. In the words of Chairman Foster, the task force must understand “how this technology will shape the questions that policymakers will have to grapple with in the coming years.”

Conclusion: Explain your models

An hour before Congresswoman Garcia’s amusing challenge, Dr. Buchanan reflected upon a couple common themes of concern.

Policymakers need to be concerned about the explainability of artificial intelligence models. And we should avoid black-box modeling where humans cannot determine the underlying process or outcomes of the machine learning or deep learning algorithms.

But through this statement, she suggests a solution: make these AI models explainable. If humans can indeed understand the inputs, process, and outputs of a model, we can trust our AI. Then throughout AI applications in financial services, we can promote fairness for all Americans.

Sources

  1. United States House Committee of Financial Services. “Perspectives on Artificial Intelligence: Where We Are and the Next Frontier in Financial Services.” https://financialservices.house.gov/calendar/eventsingle.aspx?EventID=403824. Accessed July 18, 2019.
  2. United States House Committee of Financial Services. “Waters Announces Committee Task Forces on Financial Technology and Artificial Intelligence.” https://financialservices.house.gov/news/documentsingle.aspx?DocumentID=403738. Accessed July 18, 2019.
  3. Leopold, Till Alexander, Vesselina Ratcheva, and Saadia Zahidi. “The Future of Jobs Report 2018.” World Economic Forum. http://www3.weforum.org/docs/WEF_Future_of_Jobs_2018.pdf
  4. Zhang, Baobao and Allan Dafoe. “Artificial Intelligence: American Attitudes and Trends.” Oxford, UK: Center for the Governance of AI, Future of Humanity Institute, University of Oxford, 2019. https://ssrn.com/abstract=3312874
  5. Klein, Aaron. “Credit Denial in the Age of AI.” Brookings Institution. April 11, 2019. https://www.brookings.edu/research/credit-denial-in-the-age-of-ai/
  6. Bartlett, Robert, Adair Morse, Richard Stanton, Nancy Wallace, “Consumer-Lending Discrimination in the FinTech Era.” National Bureau of Economic Research, June 2019. https://www.nber.org/papers/w25943
  7. Snyder, Scott. “How Banks Can Keep Up with Digital Disruptors.” Philadelphia, PA: The Wharton School of the University of Pennsylvania, 2017. https://knowledge.wharton.upenn.edu/article/banking-and-fintech/
  8. “Gartner Predicts 75% of Large Organizations Will Hire AI Behavior Forensic Experts to Reduce Brand and Reputation Risk by 2023.” Gartner. June 6, 2019. https://www.gartner.com/en/newsroom/press-releases/2019-06-06-gartner-predicts-75–of-large-organizations-will-hire
  9. United States, Executive Office of the President [Donald Trump]. Executive order 13859: Executive Order on Maintaining American Leadership in Artificial Intelligence. February 11, 2019. https://www.whitehouse.gov/presidential-actions/executive-order-maintaining-american-leadership-artificial-intelligence/
  10. “Building Trust in Human-Centric Artificial Intelligence.” European Commission. April 8, 2019. https://ec.europa.eu/futurium/en/ai-alliance-consultation/guidelines#Top
  11. “G20 Ministerial Statement on Trade and Digital Economy.” June 9, 2019. http://trade.ec.europa.eu/doclib/press/index.cfm?id=2027

AI needs a new developer stack!

In today’s world, data has played a huge role in the success of technology giants like Google, Amazon, and Facebook. All of these companies have built massively scalable infrastructure to process data and provide great product experiences for their users. In the last 5 years, we’ve seen a real emergence of AI as a new technology stack. For example, Facebook built an end-to-end platform called FBLearner that enables an ML Engineer or a Data Scientist build Machine Learning pipelines, run lots of experiments, share model architectures and datasets with team members, scale ML algorithms for billions of Facebook users worldwide. Since its inception, millions of models have been trained on FBLearner and every day these models answer billions of real-time queries to personalize News Feed, show relevant Ads, recommend Friend connections, etc.  

However, for most other companies building AI applications remains extremely expensive. This is primarily due to a lack of systems and tools for supporting end-to-end machine learning (ML) application development — from data preparation and labeling to operationalization and monitoring [1][9][10][11].

The goal of this post is 2-fold:

  1. List the challenges with adopting AI successfully: data management, model training, evaluation, deployment, and monitoring;
  2. List the tools I think we need to create to allow developers to meet these challenges: a data-centric IDE with capabilities like explainable recommendations, robust dataset management, model-aware testing, model deployment, measurement, and monitoring capabilities.

Challenges of adopting AI

In order to build an end-to-end ML platform, a data scientist has to go through multiple hoops of the following workflow [3].

End-to-End ML Workflow

A big challenge to building AI applications is that different stages of the workflow require new software abstractions that can accommodate complex interactions with the underlying data used in AI training or prediction. For example:

Data Management requires a data scientist to build and operate systems like Hive, Hadoop, Airflow, Kafka, Spark etc to assemble data from different tables, clean datasets, procure labeling data, construct features and make them ready for training. In most companies, data scientists rely on their data engineering teams to maintain this infrastructure and help build ETL pipelines to get feature datasets ready.

Training models is more of an art than science. It requires understanding which features work and what modeling algorithms are suitable to the problem at hand. Although there are libraries like PyTorch, TensorFlow, Scikit-Learn etc, there is a lot of manual work in feature selection, parameter optimization, and experimentation.

Model evaluation is often performed as a team activity since it requires other people to review the model performance across a variety of metrics from AUC, ROC, Precision/Recall and ensure that model is calibrated well, etc. In the case of Facebook, this was built into FBLearner, where every model created on the platform would get an auto-generated dashboard showing all these statistics.  

Deploying models requires data scientists to first pick the optimal model and make it ready to be deployed to production. If the model is going to impact business metrics of the product and will be consumed in a realtime manner, we need to deploy it to only a small % of traffic and run an A/B test with an existing production model. Once the A/B test is positive in terms of business metrics, the model gets rolled out to 100% of production traffic.

Inference of the models is closely tied with deployment, there can be 2 ways a model can be made available for consumption to make predictions.

  • batch inference, where a data pipeline is built to scan through a dataset and make predictions on each record or a batch of records.
  • realtime inference, where a micro-service hosts the model and makes predictions in a low-latency manner.

Monitoring predictions is very important because unlike traditional applications, model performance is non-deterministic and depends on various factors such as seasonality, new user behavior trends, data pipeline unreliability leading to broken features. For example, a perfectly functioning Ads model might need to be updated when a new holiday season arrives or a model trained to show content recommendations in the US may not do very well for users signing up internationally. There is also a need for alerts and notifications to detect model degradation quickly and take action.

As we can see, the workflow to build machine learning models is significantly different from building general software applications. f models are becoming first-class citizens in the modern enterprise stack, they need better tools. As Tesla’s Director of AI Andrej Karpathy succinctly puts it, AI is Software 2.0 and it needs new tools [2].

If we compare the stack of Software 1.0 with 2.0, I claim we require transformational thinking to build the new developer stack for AI.

We need new tools for AI engineering

In Software 1.0, we have seen a vast amount of tooling built in the past few decades to help developers write code, share it with other developers, get it reviewed, debug it, release it to production and monitor its performance. If we were to map these tools in the 2.0 stack, there is a big gap!

What would an ideal Developer Toolkit look like for an AI engineer?

To start with, we need to take a data-first approach as we build this toolkit because, unlike Software 1.0, the fundamental unit of input for 2.0 is data.

Integrated Development Environment (IDE): Traditional IDEs focus on helping developers write code, focus on features like syntax highlighting, code checkpointing, unit testing, code refactoring, etc.

For machine learning, we need an IDE that allows easy import and exploration of data, cleaning and massaging of tables. Jupyter notebooks are somewhat useful, but they have their own problems, including the lack of versioning and review tools. A powerful 2.0 IDE would be more data-centric, starts with allowing the data scientist to slice and dice data, edit the model architecture either via code or UI and debug the model on egregious cases where it might be not performing well. I see traction in this space with products like StreamLit [13] reimagining IDEs for ML.

Tools like Git, Jenkins, Puppet, Docker have been very successful in traditional software development by taking care of continuous integration and deployment of software. When it comes to machine learning, the following steps would constitute the release process.

Model Versioning: As more models get into production, managing the various versions of them becomes important. Git can be reused for models, however, it won’t scale for large datasets. The reason to version datasets is that to be able to reproduce a model, we need the snapshot of the data the model was trained upon. Naive implementations of this could explode the amount of data we’re versioning, think 1-copy-of-dataset-per-model-version. DVC [12] which is an open-source version control system is a good start and is gaining momentum.

Unit Testing is another important part of the build & release cycle. For ML, we need unit tests that catch not only code quality bugs but also data quality bugs.

Canary Tests are minimal tests to quickly and automatically verify that the everything we depend on is ready. We typically run Canary tests before other time-consuming tests, and before wasting time investigating the code when the other tests are failing [8]. In Machine Learning, it means being able to replay a previous set of examples on the new Model and ensuring that it meets certain minimal set of conditions.

A/B Testing is a method of comparing two versions of an application change to determine which one performs better [7]. For ML, AB testing is an experiment where two or more variations of the ML model are exposed to users at random, and statistical analysis is used to determine which variation performs better for a given conversion goal. For example in the dashboard below, we’re measuring click conversion on an A/B experiment dashboard that my team built at Pinterest, and it shows the performance of the ML experiments against business metrics like repins, likes, etc. CometML [14] lets data scientists keep track of ML experiments and collaborate with their team members.

Debugging: One of the main features of an IDE is the ability to debug the code and find exactly the line where the error occurred. For machine learning, this becomes a hard problem because models are often opaque and therefore exactly pinpointing why a particular example was misclassified is difficult. However, if we can understand the relationship between feature variables and the target variable in a consistent manner, it goes a long way in debugging models, also called model interpretability, which is an active area of research. At Fiddler, we’re working on a product offering that allows data scientists to debug any kind of models and perform root cause analysis.

Profiling: Performance analysis is an important part of SDLC in 1.0 and profiling tools allow engineers to figure out slowness of an application and improve it. For models, it is also about improving performance metrics like AUC, log loss, etc. Often times, a given model could have a higher score on an aggregate metric but it can be performing poorly on certain instances or subsets of the dataset. This is where tools like  Manifold [5] can enhance the capabilities of traditional performance analysis.

Monitoring: While superficially, application monitoring might seem similar to model monitoring and could actually be a good place to start, we need to track a different class of metrics for machine learning. Monitoring is crucial for models that automatically incorporate new data in a continual or ongoing fashion at training time, and is always needed for models that serve predictions in an on-demand fashion. We can categorize monitoring into 4 broad classes:

  • Feature Monitoring: This is to ensure that features are stable over time, certain data invariants are upheld, any checks w.r.t privacy can be made as well as continuous insight into statistics like feature correlations.
  • Model Ops Monitoring: Staleness, regressions in serving latency, throughput, RAM usage, etc.
  • Model Performance Monitoring: Regressions in prediction quality at inference time.
  • Model Bias Monitoring: Unknown introductions of bias both direct and latent.
Annual Income vs Probability of defaulting on a granted Loan in a Credit Risk Model trained on a public lending dataset.

Conclusion

I walked through 1) some challenges to successfully deploying AI (data management, model training, evaluation, deployment, and monitoring), 2) some tools I propose we need to meet these challenges (a data-centric IDE with capabilities like slicing & dicing of data, robust dataset management, model-aware testing, and model deployment, measurement, and monitoring capabilities). If you are interested in some of these tools, we’re working on them at Fiddler Labs. And if you’re interested in building these tools, we would love to hear from you at https://angel.co/fiddler-labs

References

  1. https://arxiv.org/pdf/1705.07538.pdf
  2. https://medium.com/@karpathy/software-2-0-a64152b37c35
  3. https://towardsdatascience.com/technology-fridays-how-michelangelo-horovod-and-pyro-are-helping-build-machine-learning-at-uber-28f49fea55a6
  4. https://christophm.github.io/interpretable-ml-book/
  5. https://eng.uber.com/manifold/
  6. http://www.fiddler.ai
  7. https://www.optimizely.com/optimization-glossary/ab-testing/
  8. https://dzone.com/articles/canary-tests
  9. https://research.fb.com/wp-content/uploads/2017/12/hpca-2018-facebook.pdf
  10. http://stevenwhang.com/tfx_paper.pdf
  11. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
  12. https://dvc.org/
  13. https://streamlit.io/
  14. https://www.comet.ml/

A gentle introduction to GA2Ms, a white box model

A gentle introduction to a white box machine learning model called a GA2M, a Generalized Additive Model (GAM) with interaction terms.

This post is a gentle introduction to a white box machine learning model called a GA2M.

We’ll walk through:

  • What is a white box model, and why would you want one?
  • A classic example white box model: logistic regression
  • What is a GAM, and why would you want one?
  • What is a GA2M, and why would you want one?
  • When should you choose a GAM, a GA2M, or something else?

The purpose of all these machine learning models is to make a prediction towards a goal specified by a human. Think of a model that can predict loan default, or the presence of someone’s face in a picture.

The short story: A generalized additive model (GAM) is a white box model that is more flexible than logistic regression, but still interpretable. A GA2M is a GAM with interaction terms, which allows it to be more flexible still, but with a more complicated interpretation. GAMs and GA2Ms are an intriguing addition to your toolbox, interpretable at the expense of not fitting every kind of data. A picture:

For more about what that all means, read on.

White box models

The term “white box” comes from software engineering. It means software whose internals you can view, compared to a “black box” whose internals you cannot view. By this definition, a neural network could be a white box model if you can see the weights (picture credit):

However, by white box people really mean something they can understand. A white box model is a model whose internals a person can see and reason about. This is subjective, but most people would agree the weights shown above don’t give us information about how the model works in such a way as we could usefully describe it, or predict what the model is going to do in the future.

Compare the picture above to this one about risk of death from pneumonia by age from [1]:

Now that isn’t a whole model. Rather, it’s the impact of one feature (age) on the risk score. The green lines are error bars (±1 standard deviation in 100 rounds of bagging). The red line in the middle of them is the best estimate. In the paper, they observe:

  1. Risk score is flat until about age 50. Risk score here is negative, meaning less risk of death than the average in the dataset.
  2. Risk score rises sharply at 65. This could be due to retirement. In future, it might be interesting to gather data about retirement.
  3. The error bars are narrowest in ages 66-85. Perhaps that is where the most data is.
  4. Risk score rises again at 85. The error bars also widen again. Maybe this jump is not real.
  5. Risk score drops above 100. This may be due to lack of data, or something else. In the paper, they suggested one might wish to “fix” this region of the model by changing it to predict at the same level as ages 85-100 instead of dropping. This fix is using domain knowledge (“risk of pneumonia likely doesn’t go down after age 85”) to address possible model artifacts.
  6. Risk score between 66 and 85 is relatively flat.

All this from one graph of one model feature. There are facts, like the shape of the graph, and then speculation about why the graph might behave that way. The facts are useful to understand the data. The speculation cannot be answered by any tool, but may be useful to suggest further actions, like collecting new features (say, about retirement) or new instances (like points below age 50 or above 100), or new analyses (like looking carefully at data instances around ages 85-86 for differences).

These aren’t simulations of what the model would do. These are the internals of the model itself, so that graph is accurately describing the exact effect of age on risk score. There are 55 other components to this model, but each can be examined and reasoned about.

This is the power of a white box model.

This example also shows the dangers. By seeing everything, we may believe we understand everything, and speculate wildly or “fix” inappropriately. As always, we have to exercise judgment to use data properly.

In summary: make a white box model to

  • learn about your model, not from simulations or approximations, but the actual internals
  • improve your model, by giving you ideas of directions to pursue
  • “fix” your model, i.e., align it with your intuition or domain knowledge

One final possibility: regulations dictate that you need to fully describe your model. In that case, it could be useful to have human-readable internals for reference.

Here are some examples of white box and black box models:

White box modelsBlack box models
Logistic regression
GAMs
GA2Ms
Decision trees (short and few trees)
Neural networks (including deep learning)
Boosted trees and random forests (many trees)
Support vector machines

Now let’s walk through three specific white box models.

A classic: logistic regression

Logistic regression was developed in the early 1800s, and re-popularized in the 1900s. It’s been around for a long time, for many reasons. It solves a common problem (predict the probability of an event), and it’s interpretable. Let’s explore what that means. Here is the logistic equation defining the model:

There are three types of variables in this model equation:

  • p is the probability of an event we’re predicting. For example, defaulting on a loan
  • The x’s are features. For example, loan amount.
  • The 𝛽’s (betas) are the coefficients, which we fit using a computer.

The betas are fit once to the entire dataset. The x’s are different for each instance in the dataset. The p represents an aggregate of dataset behavior: any dataset instance either happened (1) or didn’t (0), but in aggregate, we’d like the right-hand side and the left-hand side to be as close as possible.

The “log(p/(1-p))” is the log odds, also called the “logit of the probability”. The odds are (probability the event happened)/(probability the event won’t happen), or p/(1-p). Then we apply the natural logarithm to translate p, which takes the range 0 to 1, to a quantity which can range from -∞ to +∞, suitable for a linear model.

This model is linear, but for the log odds. That is, the right-hand side is a linear equation, but it is fit to the log odds, not the probability of an event.

This model is interpretable as follows: a unit increase in xi is a log-odds increase in 𝛽i.

For example, suppose we’re predicting probability of loan default, and our model has a feature coefficient 𝛽1=0.15 for the loan amount feature x1. That means a unit increase in the feature corresponds to a log odds increase of 0.15 in default. We can take the natural exponent to get the odds ratio, exp(0.15)=1.1618. That means:

for this model, a unit increase (of say, a thousand dollars) in loan amount corresponds to a 16% increase in the odds of loan default, holding all other factors constant.

This statement is what people mean when they say logistic regression is interpretable.

To summarize why logistic regression is a white box model:

  • The input response terms (𝛽ixi terms) can be interpreted independently of each other
  • The terms are in interpretable units: the coefficients (betas) are in units of log odds.

So why would we use anything other than the friendly, venerable model of logistic regression?

Well, if the features and log odds don’t have a linear relationship, this model won’t fit well. I always think of trying to fit a line to a parabola:

If you have non-linear data (the black parabola), a linear fit (the blue dashed line) will never be great. No line fits the curve.

Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) were developed in the 1990s by Hastie and Tibshirani. (See also chapter 9 of their book “The Elements of Statistical Learning”.) Here is the equation defining the model:

This equation is quite similar to logistic regression. It has the same three types of elements:

  • E(Y) is an aggregate of dataset behavior, like the “p” in the equation above. In fact, it may well be the probability of an event, the same p.
  • g(.) is a link function, like the logit (or log odds) from the logistic equation above.
  • fi(xi) is a term for each dataset instance feature x1,…,xm.

The big difference is instead of a linear term 𝛽ixi for a feature, now we have a function fi(xi). In their book, Hastie and Tibshirani specify a “smooth” function like a cubic spline. Lou et al. [2] looked at other functions for the fi, which they call “shape functions.”

A GAM also has white box features:

  • The input response terms (f(xi) terms) can be interpreted independently of each other
  • The terms are in interpretable units. For the logit link function, these are log odds.

Now a term, instead of being a constant (beta), is a function, so instead of reporting the log odds as a number, we visualize it with a graph. In fact, the graph above of pneumonia risk of death by age is one term (shape function) in a GAM.

So why would we use anything other than a GAM? It’s already flexible and interpretable. Same reason as before: it might not be accurate enough. In particular, we’ve assumed that each feature response can be modeled with its own function, independent of the others.

But what if there are interactions between the features? Several black box models (boosted trees, neural networks) can model interaction terms. Let us walk through a white box model that also can: GA2Ms.

GAMs with interaction terms (GA2Ms)

GA2Ms were investigated in 2013 by Lou et al. [3]. The authors pronounce them with the letters “gee ay two em”, but in house we’ve taken to calling them “interaction GAMs” because it’s more pronounceable. Here is the model equation:

This equation is quite similar to the GAM equation from the previous section, except it adds functions that can account for two feature variables at once, i.e. interaction terms.

Microsoft just released a library InterpretML that implements GA2Ms in python. In that library, they call them “Explainable Boosting Machines.”

Lou et al. say these are still white box models because the “shape function” for an interaction term is a heatmap. The two features are along the X and Y axis, and the color in the middle shows the function response. Here is an example from Microsoft’s library fit to predicting loan default on a dataset of loan performance from lending club:

For this example graph:

  • The upper right corner is the most red. That means the probability of default goes up the most when dti (debt to income ratio) and fico_range_midpoint (the FICO credit score) are both high.
  • The left strip is also red, but turns blue near the bottom. That means that very low dti is usually bad, except if fico_range_midpoint is also low.

This particular heatmap is hard to reason about. This is likely only the interaction effect without the single-feature terms. So, it could be that the probability of default overall isn’t higher at high-dti and high-fico, but rather just higher than either of the primary effects predict by themselves. To investigate further, we could probably look at some examples around the borders. But, for this blog post, we’ll skip the deep dive.

In practice, this library fits all single-feature functions, then N interaction terms, where you pick N. It is not easy to pick N. The interaction terms are worthwhile if they add enough accuracy to be worth the extra complexity of staring at heatmaps to interpret them. That is a judgement call that depends on your business situation.

When should we use GAMs or GA2Ms?

To perform machine learning, first pick a goal. Then pick a technology that will best use your data to meet the goal. There are thousands of books and millions of papers on that subject. But, here is a drastically simplified way to think about how GA2Ms fit in to possible model technologies: they are on a spectrum from interpretability to modeling feature interactions.

  • Use GAMs if they are accurate enough. It gives the advantages of a white box model: separable terms with interpretable units.
  • Use GA2Ms if they are significantly more accurate than GAMs, especially if you believe from your domain knowledge that there are real feature interactions, but they are not too complex. This also gives the advantages of a white box model, with more effort to interpret.
  • Try boosted trees (xgboost or lightgbm) if you don’t know a lot about the data, since it is quite robust to quirks in data. These are black box models.
  • When features interact highly with each other, like pixels in images or the context in audio, you may well need neural networks or something else that can capture complex interactions. These are deeply black box.

In all cases, you may well need domain-specific data preprocessing, like squaring images, or standardizing features (subtracting the mean and dividing by the standard deviation). That is a topic for another day.

Now hopefully the diagram we started with makes more sense.

At Fiddler Labs, we help you explain your AI. Email us at info@fiddler.ai.

References

  1. Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-Day Readmission.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721–1730. KDD ’15. New York, NY, USA: ACM, 2015. https://doi.org/10.1145/2783258.2788613.
  2. Lou, Yin, Rich Caruana, and Johannes Gehrke. “Intelligible Models for Classification and Regression.” In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150–158. KDD ’12. New York, NY, USA: ACM, 2012. https://doi.org/10.1145/2339530.2339556.
  3. Lou, Yin, Rich Caruana, Giles Hooker, and Johannes Gehrke. Accurate Intelligible Models with Pairwise Interactions, 2017. https://www.microsoft.com/en-us/research/publication/accurate-intelligible-models-pairwise-interactions/.

Taming the Data: Complexities in Machine Learning and Explainable AI

FinTech Horizons

A few weeks ago, I was on the panel for Explainable AI at the PRMIA Fintech Horizons Conference in SF. The participants were predominantly from the finance industry like Banks, Hedge Funds and Fintech Startups.

We had a very interesting discussion on topics like:

  • Automated AI vs Human-Centered AI
  • How catastrophic can it be when a Business Risk is left unmanaged?
    • Example: Boeing 737 Max 8 failure
  • Special challenges in Quantitative Finance with AI. Can we quantify Model Risk in terms of a $-value?
  • Who in the organizations needs to care about Explainable AI? Is it the Data Scientist? Chief Risk Officer? Business Owner?
Podcast of the discussion

Participants:

Bob Mark – Former CRO & Treasurer of CIBC, Managing Partner Black Diamond Risk – Moderator

Krishna Gade – Founder & CEO, Fiddler Labs 

Jos Gheerardyn – Founder & CEO, Yields.IO

Hersh Shefrin – Mario L. Belotti Professor of Finance, Santa Clara University

Starting from left: Bob Mark, Jos Gheerardyn, Krishna Gade, and Hersh Shefrin

Welcome Ankur Taly!

We’re excited to introduce Ankur Taly, the newest member of our team. Ankur joins us as the Head of Data Science.

Previously, he was a Staff Research Scientist at Google Brain where he worked on Machine Learning Interpretability and was most well-known for his contribution to developing and applying Integrated Gradients  — a new interpretability algorithm for Deep Neural Networks. Ankur has a broad research background and has published in several areas including Computer Security, Programming Languages, Formal Verification, and Machine Learning. Ankur obtained his Ph.D. in CS from Stanford University and a B. Tech in CS from IIT Bombay.

Ankur passionately believes in the need for Explainable AI and is excited to join Fiddler labs. In his own words:  

“Explainability is one of the key missing pieces in the ongoing machine learning (ML) revolution. As ML models continue to become more complex and opaque, being able to explain their predictions is getting increasingly important and challenging. The ability to explain a model’s predictions would enable users to build trust in the model, business stakeholders to derive actionable insights and strategies, regulators to assess model fairness and risk, and data scientists to iterate on the model in a principled manner. In response to this need, there has been a large surge in research on explaining various aspects of ML models. Fiddler Labs has taken up the ambitious task of driving this research to industrial practice, by making it available as a cutting-edge enterprise product catering to several business needs. This is incredibly promising, and I am super excited to join Fiddler on this journey!”

Ankur Taly

Humans choose, AI does not

For the non-technical reader who sees scary headlines: every AI has a goal, usually a labeled dataset, precisely defined by a human.

Artificial intelligence isn’t human

Artificial Intelligence Will Best Humans at Everything by 2060, Experts Say”. Well.

First, as Yogi Berra said, “It’s tough to make predictions, especially about the future.” Where is my flying car?

Second, the title reads like clickbait, but surprisingly it appears to be pretty close to the actual survey, which asked AI researchers when “high-level machine intelligence” will arrive, defined as “when unaided machines can accomplish every task better and more cheaply than human workers.” What is a ‘task’ in this definition? Does “every task” even make sense? Can we enumerate all tasks?

Third and most important, is high-level intelligence just accomplishing tasks? This is the real difference between artificial and human intelligence: humans define goals, AI tries to achieve them. Is the hammer going to displace the carpenter? They each have a purpose.

This difference between artificial and human intelligence is crucial to understand, both to interpret all the crazy headlines in the popular press, and more importantly, to make practical, informed judgements about the technology.

The rest of this post walks through some types of artificial intelligence, types of human intelligence, and given how different they are, plausible and implausible risks of artificial intelligence. The short story: unlike humans, every AI technology has a perfectly mathematically well-defined goal, often a labeled dataset.

Types of artificial intelligence

In supervised learning, you define a prediction goal and gather a training set with labels corresponding to the goal. Suppose you want to identify whether a picture has Denzel Washington in it. Then your training set is a set of pictures, each labeled as containing Denzel Washington or not. The label has to be applied outside of the system, mostly likely by people. If your goal is to do facial recognition, your labeled dataset is pictures along with a label (the person in the picture). Again, you have to gather the labels somehow, likely with people. If your goal is to match a face with another face, you need a label of whether the match was successful or not. Always labels.

Almost all the machine learning you read about is supervised learning. Deep learning, neural networks, decision trees, random forests, logistic regression, all training on labeled datasets.

In unsupervised learning, again you define a goal. A very common unsupervised learning technique is clustering (e.g., the well-known k-means clustering). Again, the goal is very well-defined: find clusters minimizing some mathematical cost function. For example, where the distance between points within the same cluster is small, and the distance between points not within the same cluster is large. All of these goals are so well-defined they have mathematical formalism:

This formula feels very different from how humans specify goals. Most humans don’t understand these symbols at all. They are not formal. Also, a “goal-oriented” mindset in a human is unusual enough that it has a special term.

In reinforcement learning, you define a reward function to reward (or penalize) actions that move towards a goal. This is the technology people have been using recently for games like chess and Go, where it may take many actions to reach a particular goal (like checkmate), so you need a reward function that gives hints along the way. Again, not only a well-defined goal, but even a well-defined on-the-way-to-goal reward function.

These are types of artificial intelligence (“machine learning”) that are currently hot because of recent huge gains in accuracy, but there are plenty of others that people have studied.

Genetic algorithms are another way of solving problems inspired by biology. One takes a population of mathematical constructs (essentially functions), and selects those that perform best on a problem. Although people get emotional about the biological analogy, still the fitness function that defines “best” is a concrete, completely specified mathematical function chosen by a human.

There is computer-generated art. For example, deep dream (gallery) is a way to generate images from deep learning neural networks. This would seem to be more human and less goal-oriented, but in fact people are still directing. The authors described the goal at a high level: “Whatever you see there, I want more of it!” Depending on which layer of the network is asked, the features amplified might be low level (like lines, shapes, edges, colors, see the addaxes below) or higher level (like objects).

Original photo of addaxes by Zachi Evenor, processed photo from Google

Expert systems are a way to make decisions using if-then rules on a formally expressed body of knowledge. They were somewhat popular in the 1980s. These are a type of “Good Old Fashioned Artificial Intelligence” (GOFAI), a term for AI based on manipulating symbols.

Another common difference between human and artificial intelligence is that humans learn over a long time, while AI is often retrained from the beginning for each problem. This difference, however, is being narrowed. Transfer learning is the process of training a model, and then using or tweaking the model for use in a different context. This is industry practice in computer vision, where deep learning neural networks that have been trained using features from previous networks (example).

One interesting research project in long-term learning is NELL, Never-Ending Language Learning. NELL crawls the web collecting text, and trying to extract beliefs (facts) like “airtran is a transportation system”, along with a confidence. It’s been crawling since 2010, and as of July 2017 has accumulated over 117 million candidate beliefs, of which 3.8 million are high-confidence (at least 0.9 of 1.0).

In every case above, humans not only specify a goal, but have to specify it unambiguously, often even formally with mathematics.

Types of human intelligence

What are the types of human intelligence? It’s hard to even come up with a list. Psychologists have been studying this for decades. Philosophers have been wrestling with it for millennia.

IQ (the Intelligence Quotient) is measured with verbal and visual tests, sometimes abstract. It is predicated on the idea that there is a general intelligence (sometimes called the “g factor”) common to all cognitive ability. This idea is not accepted by everyone, and IQ itself is hotly debated. For example, some believe that people with the same latent ability but from different demographic groups might be measured differently, called Differential Item Functioning, or simply measurement bias.

People describe fluid intelligence (the ability to solve novel problems) and crystallized intelligence (the ability to use knowledge and experience).

The concept of emotional intelligence shows up in the popular press: the ability of a person to recognize their own emotions and those of others, and use emotional thinking to guide behavior. It is unclear how accepted this is by the academic community.

More widely accepted are the Big Five personality traits: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism. This is not intelligence (or is it?), but illustrates a strong difference with computer intelligence. “Personality” is a set of stable traits or behavior patterns that predict a person’s behavior. What is the personality of an artificial intelligence? The notion doesn’t seem to apply.

With humor, art, or the search for meaning, we get farther and farther from well-defined problems, yet closer and closer to humanity.

Risks of artificial intelligence

Can artificial intelligence surpass human intelligence?

One risk that captures the popular press is The Singularity. The writer and mathematician Verner Vinge gives a compelling description in an essay from 1993: “Within thirty years, we will have the technological means to create superhuman intelligence. Shortly after, the human era will be ended.

There are at least two ways to interpret this risk. The common way is that some magical critical mass will cause a phase change in machine intelligence. I’ve never understood this argument. “More” doesn’t mean “different.” The argument is something like “As we mimic the human brain closely, something near-human (or super-human) will happen.” Maybe?

Yes, the availability of lots of computing power and lots of data has resulted in a phase change in AI results. Speech recognition, automatic translation, computer vision, and other problem domains have been completely transformed. In 2012, when researchers at Toronto re-applied neural networks to computer vision, the error rates on a well-known dataset started dropping fast, until within a few years the computers were beating the humans. But computers were doing the same well-defined task as before, only better.

ImageNet Large Scale Visual Recognition Challenge error rates. 0.1 is a 10% error rate. After neural networks were re-applied in 2012, error rates dropped fast. They beat Andrej Karpathy by 2015.

The more compelling observation is: “The chance of a singularity might be small, but the consequences are so serious we should think carefully.”

Another way to interpret the risk of the singularity is that the entire system will have a phase change. That system includes computers, technology, networks, and humans setting goals. This seems more correct and entirely plausible. The Internet was a phase change, as were mobile phones. Under this interpretation, there are plenty of plausible risks of AI.

One plausible risk is algorithmic bias. If algorithms are involved in important decisions, we’d like them to be trustworthy. (In a previous post, we discussed how to measure algorithmic fairness.)

Tay, a Microsoft chatbot, was taught by Twitter to be racist and woman-hating within 24 hours. But Tay didn’t really understand anything, it just “learned” and mimicked.

Tay, the offensive chatbot.

Amazon’s facial recognition software Rekognition, falsely matched 28 U.S. Congresspeople (mostly people of color) with known criminals. Amazon’s response was that the ACLU (who conducted the test) used an unreliable cutoff of only 80 percent confident. (They recommended 95 percent.)

MIT researcher Joy Buolamwini showed that gender identification error rates in several computer vision systems were much higher for people with dark skin.

All of these untrustworthy results arise at least partially from the training data. In Tay’s case, it was deliberately fed hateful data by Twitter users. In the computer vision systems, there may well have been less data for people of color.

Another plausible risk is automation. As artificial intelligence becomes more cost-efficient at solving problems like driving cars or weeding farm plots, the people who used to do those tasks may be thrown out of work. This is the risk of AI plus capitalism: businesses will each try to be cost effective. We can only address this at a societal level, which makes it very difficult.

One final risk is bad goals, possibly aggravated by single-mindedness. This is memorably illustrated by the paper-clip problem, first described by Nick Bostrom in 2003: “It also seems perfectly possible to have a superintelligence whose sole goal is something completely arbitrary, such as to manufacture as many paperclips as possible, and who would resist with all its might any attempt to alter this goal. For better or worse, artificial intellects need not share our human motivational tendencies.” There is even a web game inspired by this idea.

Understand your goals

How do we address some of the plausible risks above? A complete answer is another full post (or book, or lifetime). But let’s mention one piece: understand the goals you’ve given your AI. Since all AI is simply optimizing a well-defined mathematical function, that is the language you use to say what problem you want to solve.

Does that mean you should start reading up on integrals and gradient descent algorithms? I can feel your eyeballs closing. Not necessarily!

The goals are a negotiation between what your business needs (human language) and how it can be measured and optimized (AI language). You need people to speak to both sides. That is often a business or product owner in collaboration with a data scientist or quantitative researcher.

Let me give an example. Suppose you want to recommend content using a model. You choose to optimize the model to increase engagement with the content, as measured by clicks. Voila, now you understand one reason the Internet is full of clickbait: the goal is wrong. You actually care about more than clicks. Companies modify the goal without touching the AI by trying to filter out content that doesn’t meet policies. That is one reasonable strategy. Another strategy might be to add a heavy penalty to the training dataset if the AI recommends content later found to be against policy. Now we are starting to really think through how our goal affects the AI.

This example also explains why content systems can be so jumpy: you click on a video on YouTube, or a pin on Pinterest, or a book on Amazon, and the system immediately recommends a big pile of things that are almost exactly the same. Why? The click is usually measured in the short-term, so the system optimizes for short-term engagement. This is a well-known recommender challenge, centered around mathematically defining a good goal. Perhaps a part of the goal should be whether the recommendation is irritating, or whether there is long-term engagement.

Another example: if your model is accurate, but your dataset or measurements don’t look at under-represented minorities in your business, you may be performing poorly for them. Your goal may really to be accurate for all sorts of different people.

A third example: if your model is accurate, but you don’t understand why, that might be a risk for some critical applications, like healthcare or finance. If you have to understand why, you might need to use a human-understandable (“white box”) model, or explanation technology for the model you have. Understandability can be a goal.

Conclusion: we need to understand AI

AI cannot fully replace humans, despite what you read in the popular press. The biggest difference between human and artificial intelligence is that only humans choose goals. So far, AIs do not.

If you can take away one thing about artificial intelligence: understand its goals. Any AI technology has a perfectly well-defined goal, often a labeled dataset. To the extent the definition (or dataset) is flawed, so too will be the results.

One way to understand your AI better is to explain your models. We formed Fiddler Labs to help. Feel free to reach us at info@fiddler.ai.

A gentle introduction to algorithmic fairness

A gentle introduction to issues of algorithmic fairness: some U.S. history, legal motivations, and four definitions with counterarguments.

History

In the United States, there is a long history of fairness issues in lending.

For example, redlining:

‘In 1935, the Federal Home Loan Bank Board asked the Home Owners’ Loan Corporation to look at 239 cities and create “residential security maps” to indicate the level of security for real-estate investments in each surveyed city. On the maps, “Type D” neighborhoods were outlined in red and were considered the most risky for mortgage support..

‘In the 1960s, sociologist John McKnight coined the term “redlining” to describe the discriminatory practice of fencing off areas where banks would avoid investments based on community demographics. During the heyday of redlining, the areas most frequently discriminated against were black inner city neighborhoods…’

Redlining is clearly unfair, since the decision to invest was not based on an individual homeowner’s ability to repay the loan, but rather on location; and that basis systematically denied loans to one racial group, black people. In fact, part 1 of a Pulitzer Prize-winning series in the Atlanta Journal-Constitution in 1988 suggests that location was more important than income: “Among stable neighborhoods of the same income [in metro Atlanta], white neighborhoods always received the most bank loans per 1,000 single-family homes. Integrated neighborhoods always received fewer. Black neighborhoods — including the mayor’s neighborhood — always received the fewest.

Legislation such as the 1968 Fair Housing Act and the 1977 Community Reinvestment Act were passed to combat these sorts of unfair practices in housing and lending.

More recently, in 2018, WUNC reported that blacks and latinos in some cities in North Carolina were denied mortgages at higher rates than whites:

“Lenders and their trade organizations do not dispute the fact that they turn away people of color at rates far greater than whites. But they maintain that the disparity can be explained by two factors the industry has fought to keep hidden: the prospective borrowers’ credit history and overall debt-to-income ratio. They singled out the three-digit credit score — which banks use to determine whether a borrower is likely to repay a loan — as especially important in lending decisions.”

The WUNC example raises an interesting point: it is possible to look unfair via one measure (loan rates by demographic), but not by another (ability to pay as judged by credit history and debt-to-income ratio). Measuring fairness is complicated. In this case, we can’t tell if the lending practices are fair because the data on credit history and debt-to-income ratio for these particular groups are not available to us to evaluate lenders’ explanations of the disparity.

In 2007, the federal reserve board (FRB) reported on credit scoring and its effects on the availability and affordability of credit. They concluded that the credit characteristics included in credit history scoring models are not a proxy for race, although different demographic groups have substantially different credit scores on average, and “for given credit scores, credit outcomes — including measures of loan performance, availability, and affordability — differ for different demographic groups.” This FRB study supports the lenders’ claims that credit score might explain disparity in mortgage denial rates (since demographic groups have different credit scores), while also pointing out that credit outcomes are different for different groups.

Is this fair or not?

Defining fairness

As machine learning (ML) becomes widespread, there is growing interest in fairness, accountability, and transparency in ML (e.g., the fat* conference and fatml workshops).

Some researchers say that fairness is not a statistical concept, and no statistic will fully capture it. There are many statistical definitions that people try to relate to (if not define) fairness.

First, here are two legal concepts that come up in many discussions on fairness:

  1. Disparate treatment: “unequal behavior toward someone because of a protected characteristic (e.g., race or gender) under Title VII of the United States Civil Rights Act.” Redlining is disparate treatment if the intent is to deny black people loans.
  2. Disparate impact: “practices .. that adversely affect one group of people of a protected characteristic more than another, even though rules applied .. are formally neutral.” (“The disparate impact doctrine was formalized in the landmark U.S. Supreme Court case Griggs v. Duke Power Co. (1971). In 1955, the Duke Power Company instituted a policy that mandated employees have a high school diploma to be considered for promotion, which had the effect of drastically limiting the eligibility of black employees. The Court found that this requirement had little relation to job performance, and thus deemed it to have an unjustified — and illegal — disparate impact.” [Corb2018])

[Lipt2017] points out that these are legal concepts of disparity, and creates corresponding terms for technical concepts of parity applied to machine learning classifiers:

  1. Treatment parity: a classifier should be blind to a given protected characteristic. Also called anti-classification in [Corb2018], or “fairness through unawareness.”
  2. Impact parity: the fraction of people given a positive decision should be equal across different groups. This is also called demographic parity, statistical parity, or independence of the protected class and the score [Fair2018].

There is a large body of literature on algorithmic fairness. From [Corb2018], two more definitions:

  1. Classification parity: some given measure of classification error is equal across groups defined by the protected attributes. [Hard2016] called this equal opportunity if the measure is true positive rates, and equalized odds if there were two equalized measures, true positive rates and false positive rates.
  2. Calibration: outcomes are independent of protected attributes conditional on risk score. That is, reality conforms to risk score. For example, about 20% of all loans predicted to have a 20% chance of default actually do.

There is lack of consensus in the research community on an ideal statistical definition of fairness. In fact, there are impossibility results on achieving multiple fairness notions simultaneously ([Klei2016] [Chou2017]). As we noted previously, some researchers say that fairness is not a statistical concept.

No definition is perfect

Each statistical definition described above has counterarguments.

Treatment parity unfairly ignores real differences. [Corb2018] describes the case of the COMPAS score used to predict recidivism (whether someone will commit a crime if released from jail). After controlling for COMPAS score and other factors, women are less likely to recidivate. Thus, ignoring sex in this prediction might unfairly punish women. Note that the Equal Credit Opportunity Act legally mandates treatment parity: “Creditors may ask you for [protected class information like race] in certain situations, but they may not use it when deciding whether to give you credit or when setting the terms of your credit.” Thus, [Corb2018] implies that this sort of unfairness is enshrined in law.

Impact parity doesn’t ensure fairness (people argue against quotas), and can cripple a model’s accuracy, harming the model’s utility to society. [Hard2016] discusses this issue (using the term “demographic parity”) in its introduction.

Corbett et al. [Corb2018] argue at length that classification parity is naturally violated: “when base rates of violent recidivism differ across groups, the true risk distributions will necessarily differ as well — and this difference persists regardless of which features are used in the prediction.

They also argue that calibration is not sufficient to prevent unfairness. Their hypothetical example is a bank that gives loans based solely on the default rate within a zip code, ignoring other attributes like income. Suppose that (1) within zip code, white and black applicants have similar default rates; and (2) black applicants live in zip codes with relatively high default rates. Then the bank’s plan would unfairly punish creditworthy black applicants, but still be calibrated.

Conclusion

In summary, likely fairness has no single measure. We took a whirlwind tour of four statistical definitions, two motivated by history and two more recently motivated by machine learning, and summarized the counterarguments to each.

This also means it is challenging to automatically decide if an algorithm is fair. Open-source fairness-measuring packages reflect this by offering many different measures.

However, this doesn’t mean we should ignore statistical measures. They can give us an idea of whether we should look more carefully. Food for thought. We should feed our brain well, it being the most likely to make the final call.

(A note: this subject is rightfully contentious. Our intention is to add to the conversation in a productive, respectful way. We welcome feedback of any kind.)

Thanks to Krishnaram Kenthapadi, Zack Lipton, Luke Merrick, Amit Paka, and Krishna Gade for their feedback.

References

  • [Chou2017] Chouldechova, Alexandra. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.” Big Data 5, no. 2 (June 1, 2017): 153–63. https://doi.org/10.1089/big.2016.0047.
  • [Corb2018] Corbett-Davies, Sam, and Sharad Goel. “The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning.” ArXiv:1808.00023 [Cs], July 31, 2018. http://arxiv.org/abs/1808.00023.
  • [Fair2018] “Fairness and Machine Learning.” Accessed April 9, 2019. https://fairmlbook.org/.
  • [Hard2016] Hardt, Moritz, Eric Price, and Nathan Srebro. “Equality of Opportunity in Supervised Learning.” ArXiv:1610.02413 [Cs], October 7, 2016. http://arxiv.org/abs/1610.02413.
  • [Klei2016] Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. “Inherent Trade-Offs in the Fair Determination of Risk Scores.” ArXiv:1609.05807 [Cs, Stat], September 19, 2016. http://arxiv.org/abs/1609.05807.
  • [Lipt2017] Lipton, Zachary C., Alexandra Chouldechova, and Julian McAuley. “Does Mitigating ML’s Impact Disparity Require Treatment Disparity?” ArXiv:1711.07076 [Cs, Stat], November 19, 2017. http://arxiv.org/abs/1711.07076.