Counterfactual Explanations vs. Attribution based Explanations

This post is co-authored by Aalok Shanbhag and Ankur Taly

As “black box” machine learning models spread to high stakes domains (e.g., lending, hiring, and healthcare), there is a growing need for explaining their predictions from end-user, regulatory, operations, and societal perspectives. Consequently, practical and scalable explainability approaches are being developed at a rapid pace. 

Photo by Markus Spiske on Unsplash

At Fiddler, we’re helping enterprises get peace of mind with Explainable AI by bringing transparency and thereby trust to all stakeholders using AI. Explainability of AI is at the core of our model analytics, model monitoring and model governance offerings that provide this visibility and insight.

A well-known family of explainability methods is based on attributing a prediction to input features. SHAP, LIME, and Integrated Gradients are some of the popular methods from this family.  A second emerging family of methods is counterfactual explanations. These methods explain a prediction by examining feature perturbations (counterfactuals) that lead to a different outcome. Google’s What-If tool is an interactive tool for constructing counterfactual explanations.

At a high level, both approaches look similar as they internally examine counterfactuals. Yet the two yield very different explanations. This has been a cause of confusion. In the rest of this post, we address this confusion by contrasting the two approaches as well as some of the practical challenges associated with them. We will begin with counterfactual explanations.

Counterfactual Explanations

Such explanations answer the question: how should the input change to obtain a different (more favorable) prediction? For instance, one could explain a credit rejection by saying: ‘Had you earned $5,000 more, your request for credit would have been approved.’ They were brought to the fore by Wachter et al. in 2017, who argued that counterfactual explanations are well aligned with the requirements of the European Union’s General Data Protection Regulation (GDPR)

Counterfactual explanations are attractive as they are easy to comprehend, and can be used to offer a path of recourse to end-users receiving unfavorable decisions. For this reason, some researchers suggest that counterfactual explanations may better serve the intended purpose of adverse action notices.

Obtaining counterfactual explanation involves identifying the closest point to the input at hand that results in a different prediction. While this sounds simple, there are several challenges in setting up and solving this optimization problem. 


The first challenge is defining “closest”. This is tricky because features vary on different scales, and costs may vary non-linearly with the feature value. For instance, ‘Income’ may vary in the tens of thousands but ‘FICO’ may vary in the tens of hundreds. While some approaches (e.g., this and this) suggest measuring cost in terms of the shifts over a data distribution, other approaches (e.g., this) suggest relying on an expert to supply domain-specific distance functions.

Second, solving this optimization problem is computationally challenging, especially in the presence of categorical features which turn the problem into a combinatorial optimization problem. For instance, even in the case of linear models, optimally solving the problem requires integer programming. For tree ensemble models (e.g., boosted trees, random forests), a result from 2016 shows that identifying any perturbation, minimal or otherwise, that results in a certain outcome is NP-complete. Furthermore, for “black-box” models, for which the mathematical relationship between the prediction and input is hidden, one may only be able to afford approximate solutions.

Third, for the suggested recourse to be practical, the perturbation must be feasible. Some approaches (e.g., this) combat this problem by modeling the data manifold and restricting perturbation to lie on it. But, a recent paper by Baracos et al. shows that this may still be insufficient. For recourse to be practical, one must take into account the real-world feasibility of the suggested feature changes as well as the causal dependencies between features. For instance, acting on a suggested recourse of “increasing income” almost always results in a change to “job tenure”. (Waiting for a raise at the current job increases tenure while obtaining a new high-paying job resets tenure.) This unforeseen change may adversely affect the prediction despite following the suggested recourse. 

Attribution based Explanations

Such explanations answer the question: what were the dominant features that contributed to this prediction? The explanation quantifies the impact (called attribution) of each feature on the prediction. For instance, on a lending model, the explanation may point out that a “reject” prediction was due to income being low and the number of past delinquencies being high. There are a variety of well-known attribution methods — SHAP, LIME, and Integrated Gradients are some of the popular ones. 

Similar to counterfactual explanation methods, most attribution methods also rely on comparing the input at hand to one or more counterfactual inputs (which are often referred to as “reference points” or “baselines”). However, the role of counterfactuals here is to tease apart the relative importance of features rather than to identify new instances with favorable predictions.

SHAP, which is based on the concept of Shapley Values from game theory, operates by considering counterfactuals that “turn off” features and noting the marginal effect on the prediction. In other words, we note the change in prediction when a feature is made absent. This is done for all combinations of features, and a certain weighted average of the marginal effect is computed. Integrated Gradients examines the gradients at all counterfactual points that interpolate between the input at hand, and a certain “all off” input (i.e., one where features are turned off). LIME operates by examining counterfactuals that randomly perturb features in the vicinity of the input.


First, defining counterfactuals that “turn off” a feature is tricky. For instance, what does it mean to turn off the “income” feature? Setting it to a zero value creates a very atypical counterfactual. Some approaches (e.g., this) suggest using the training distribution median while others (including SHAP package) suggest randomly drawing samples from a distribution (more on this below).   

Second, attributions are highly sensitive to the choice of counterfactuals. In a recent preprint, we showed how various Shapley Value based attribution methods choose different distributions which lead to drastically different (and sometimes misleading) attributions; see also: The many Shapley values for model explanation. A similar point is noted in this paper about the sensitivity of Integrated Gradients to the choice of baseline. A suggested alternative is to average across randomly sampled baselines.

Third, attribution based methods are known to be highly sensitive to perturbations to the input. For instance, a small perturbation that does not affect the prediction may still alter the attribution. While this may arguably just be an artifact for how the model “reasons” internally, a recent paper points this out as a flaw of the attribution method.

Contrasting the two

The two explanations are fundamentally different and complementary.

Attributions quantify the importance of features for the current prediction, while counterfactual explanations show how the features should change to obtain a different prediction.

A feature highlighted by a counterfactual explanation may not always have a large attribution. For instance, if most candidates in the accepted class have zero capital gains income, then a candidate with zero capital gains income will have most of the attribution fall on features besides “capital gains income”. However, increasing capital gains income may be a valid recourse to obtaining a favorable prediction.

Similarly, a feature that has a large attribution may not be highlighted by a counterfactual explanation. First, several features may be immutable and therefore inapplicable for recourse, e.g., number of past delinquencies. Second, when features interact, a single feature alone may not suffice for recourse. For instance, to get a loan, the model may require a credit score > 650 and income > 20k. For a candidate not satisfying either, perturbing these features one at a time will not yield a favorable outcome. 

Given their complementary nature, we support both kinds of explanations in Fiddler. Together they offer a much more complete picture of the model-data relationship at a datum, one giving an actionable insight that can help with recourse, and the other providing visibility into the decision making process of the model.