It has been a year since we founded Fiddler Labs, and the journey so far has been incredible. I’m very excited to announce that we’ve raised our Series A funding round at $10.2 million led by Lightspeed Venture Partners and Lux Capital, with participation from Haystack Ventures and Bloomberg Beta. Our first year in business has been super awesome, to say the least. We’ve built a unique Explainable AI Engine, put together a rock-solid team, and we’ve brought in customers both from the Enterprise and startup worlds.
As we ramped up Fiddler over the last one year, the one thing that stood out was how so many Enterprises choose not to deploy AI solutions today due to the lack of explainability. They understand the value AI provides, but they struggle with the ‘why’ and the ‘how’ when using traditional black-box AI because most applications today are not equipped with Explainable AI. It’s the missing link when it comes to AI solutions making it to production.
Why Explainable AI? We get this a lot since Explainable AI is still not a household term and not many companies understand what this actually means. So I wanted to ‘explain’ Explainable AI with a couple of examples.
Let’s consider the case where an older customer (age 65+) wants a credit line increase and reaches out to the bank to request this. The bank wants to use an AI credit lending model to run this query. The model returns a low score and the customer’s request is denied. The bank representative dealing with the customer has no idea why the model denied the request. And when they follow up internally, they might find that there is a proxy bias built into the model because of the lack of examples in the training data representing older females. Before you get alarmed, this is a hypothetical situation as Banks go through a very diligent Model Risk Management process as per guidelines specified in SR-11-7, ECOA, and FCRA to vet their models before they launch them for usage. However, the tools and processes have been built for much simpler quantitative models that they have been using for decades to process these requests. As Banks and other financial institutions look to moving towards AI-based underwriting and lending models, they need tools like Fiddler. If the same AI model were to run through Fiddler’s Explainable AI Engine, the team will quickly realize that the loan was denied because this older customer is considered an outlier. Explainability shows that the training data used in the model was age-constrained: limited age range of 20-60 year olds.
Let’s consider the case where a deep neural network AI model is used to make cancer predictions with the data from chest X-rays. Using this trained data, the model predicts that certain X-Rays are cancerous. We can use an Explainability method that can highlight regions in the X-ray, to ‘explain’ why an X-ray was flagged as cancerous. What was discovered here is very interesting – the model predicted that the image was cancerous because of the radiologist’s pen markings rather than the actual pathology in the image. This shows just how important it is to have explanations in any AI model. It tells you the why behind any prediction so you, as a human, know exactly why that prediction was made and can course-correct when needed. In this case, because of explanations, we realized that the prediction was based on something completely irrelevant to actual cancer.
Explainability gives teams visibility into the inner workings of the model, which in turn allows them to fix things that are incorrect. As AI penetrates our lives more deeply, there is a growing ask to make its decisions understandable by humans, which is why we’re seeing regulations like GDPR stating that customers have ‘a right to explanations for any automated decision’. Similar laws are being introduced in countries like the United States. All these regulations are meant to help increase trust in AI models and explainable AI platforms like Fiddler can provide this visibility so that companies accelerate their adoption AI that is not only efficient but also is trustworthy.
What we do at Fiddler At Fiddler, our mission is to unlock trust, visibility, and insights for the Enterprise by grounding our AI Engine in Explainability. This ensures anyone who is affected by AI technology can understand why decisions were made and ensure AI outputs are ethical, responsible, and fair.
We do this by providing
AI Awareness: understand, explain, analyze, and validate the why and how behind AI predictions to deliver explainable decisions to the business and its end users
AI Performance: continuously monitor production, test, or training AI models to ensure high-performance while iterating and improving models based on explanations
AI Compliance: with AI regulations becoming more common, ensure industry compliance with the ability to audit AI predictions and track and remove inherent bias
Fiddler is a Pluggable Explainable AI Engine – what does this mean? Fiddler works across multiple datasets and custom-built models. Customers bring in their data that is stored in any platform- Salesforce, HDFS, Amazon, Snowflake, and more – and/or their custom-built models built using Scikit-Learn, XGBoost, Spark, TensorFlow, PyTorch, Sagemaker, and more, to the Fiddler Engine. Fiddler works on top of the models and data to provide explainability on how these models are working with trusted insights through our APIs, reports, and dashboards.
Our Explainable AI Engine meets the needs of multiple stakeholders in the AI-lifecycle: from data scientists and business analysts to model regulators and business operations teams.
Our rock-solid team Our team comes from companies and universities like Facebook, Google Brain, Lyft, Twitter, Pinterest, Microsoft, Nutanix, Samsung, Georgia Tech, and Stanford. We’re working with experts from industry and academia to create the first trueExplainable AI Engine to solve business risks for companies around user safety, non-compliance, AI black-box, and brand risk.
As the Engineering Lead on Facebook’s AI-driven News Feed, I know just how useful explanations were for Facebook users as well as internal teams: from engineering all the way up to senior leadership. My co-founder, Amit Paka, had a similar vision when he was working on AI-driven product recommendations in shopping apps at Samsung.
Since our inception back in October 2018, we’ve significantly grown the team to include other Explainability experts like Ankur Taly, who was one of the co-creators of the popular Integrated Gradients explainability method when he was at Google Brain.
As we continue our hyper-growth trajectory, we’re continuing to expand both the engineering and business teams and are hiring more experts to ensure the Fiddler Explainable AI Engine continues to be the best in its category.
We’re super excited to continue this mission and ensure that AI is explainable in every enterprise!
Want to join our team? We’re hiring! Check our open positions here.
As artificial intelligence (AI) adoption grows, so do the risks of today’s typical black-box AI. These risks include customer mistrust, brand risk and compliance risk. As recently as last month, concerns about AI-driven facial recognition that was biased against certain demographics resulted in a PR backlash.
With customer protection in mind, regulators are staying ahead of this technology and introducing the first wave of AI regulations meant to address AI transparency. This is a step in the right direction in terms of helping customers trust AI-driven experiences while enabling businesses to reap the benefits of AI adoption.
This first group of regulations relates to the understanding of an AI-driven, automated decision by a customer. This is especially important for key decisions like lending, insurance and health care but is also applicable to personalization, recommendations, etc.
The General Data Protection Regulation (GDPR), specifically Articles 13 and 22, was the first regulation about automated decision-making that states anyone given an automated decision has the right to be informed and the right to a meaningful explanation. According to clause 2(f) of Article 13:
“[Information about] the existence of automated decision-making, including profiling … and … meaningful information about the logic involved [is needed] to ensure fair and transparent processing.”
One of the most frequently asked questions is what the “right to explanation” means in the context of AI. Does “meaningful information about the logic involved” mean that companies have to disclose the actual algorithm or source code? Would explaining the mechanics of the algorithm be really helpful for the individuals? It might make more sense to provide information on what inputs were used and how they influenced the output of the algorithm.
For example, if a loan application or insurance claim is denied using an algorithm or machine learning model, under Articles 13 and 22, the loan or insurance officer would need to provide specific details about the impact of the user’s data to the decision. Or, they could provide general parameters of the algorithm or model used to make that decision.
Similar laws working their way through the U.S. state legislatures of Washington, Illinois and Massachusetts are
WA House Bill 1655, which establishes guidelines for “the use of automated decision systems in order to protect consumers, improve transparency, and create more market predictability.”
MA Bill H.2701, which establishes a commission on “automated decision-making, artificial intelligence, transparency, fairness, and individual rights.”
IL HB3415, which states that “predictive data analytics in determining creditworthiness or in making hiring decisions…may not include information that correlates with the race of zip code of the applicant.”
Fortunately, advances in AI have kept pace with these needs. Recent research in machine learning (ML) model interpretability makes compliance to these regulations feasible. Cutting-edge techniques like Integrated Gradients from Google Brain along with SHAP and LIME from the University of Washington enable unlocking the AI black box to get meaningful explanations for consumers.
Ensuring fair automated decisions is another related area of upcoming regulations. While there is no consensus in the research community on the right set of fairness metrics, some approaches like equality of opportunity are already required by law in use cases like hiring. Integrating AI explainability in the ML lifecycle can also help provide insights for fair and unbiased automated decisions. Assessing and monitoring these biases, along with data quality and model interpretability approaches, provides a good playbook towards developing fair and ethical AI.
The recent June 26 US House Committee hearing is a sign that financial services need to get ready for upcoming regulations that ensure transparent AI systems. All these regulations will help increase trust in AI models and accelerate their adoption across industries toward the longer-term goal of trustworthy AI.
We recently chatted with Ganesh Nagarathnam, Director of Analytics and Machine Learning Engineering, at S&P Global. Take a listen to the podcast below or read the transcript. (Transcript edited for clarity and length.)
Fiddler: Welcome to Fiddler’s explainable AI podcast. I’m Anusha Sethuraman. And today I have with me on the podcast, Ganesh Nagarathnam, from S&P Global. He’s the director for analytics and machine learning engineering. Ganesh, thank you so much for joining us. We’re super excited to have you. Could you please tell us a little bit about yourself and what you do at S&P Global?
Ganesh: Thank you, Anusha for inviting me to the podcast. I’m currently working with S&P Global Market Intelligence line of business as a director for analytics and machine learning engineering. I have 20 plus years of experience in building our distributed and scalable software systems on a variety of technologies. From the likes of Java2, Java3 all the way to Java9. And right now, I’m heavily into the big data ecosystem on the public cloud. I have had opportunities to work with great firms from Verizon, Verizon Wireless, Goldman Sachs, JP Morgan, and now with S&P Global Market Intelligence.
Fiddler: Wonderful. Pulling up on that point of big data a little bit. How do you use Data and AI in your organization today?
Ganesh: So, at S&P Global I work with the innovation team and the product team where we work on an idea, or an innovation gets germinated by our interactions with our customers. Then the idea gets into our analytics team, which is my team, where we try to build the MVP. Our job is to wire the necessary technology stack in accordance with corporate standards and get the product out to market quickly. My primary focus is to get the machine learning models that are being developed into production as quickly as possible. So, having said that we use AI extensively. We build an idea; we build a model from simple to complex and we try to get that out. At S&P Global, it’s all about data. We have a humongous amount of data and we think about how we can provide actionable intelligence to our clients with the right amount of information at the right time. That’s our primary goal.
Fiddler I’m curious: what is a typical process for your team for creating AI solutions. You mentioned you come up with an idea and innovate on it. Can you touch upon a few of the details there in terms of how you go through that entire process of getting it into production?
Ganesh: So, as we discussed, we have lots of data and that’s a sufficient reason for us to explore or go down the AI route. Right from predictive analytics to interactive analytics and from visual analytics to simple data visualization, all we’re trying to do here is we’ll have to leverage that momentum to get to market quickly.
So, the typical process would be once we identify an innovative idea, we go to the drawing board and we discuss the product needs and we try to figure out what the appropriate technology stack is like. And then we invest 20% of our effort to deliver 80% value for our clients.
This means that we don’t want to iterate for too long and we involve our customers end-to-end when innovating. We then get this out to the product team and they take it to their customers to validate and ask for customer feedback. Then the process gets funneled with appropriate funding. So, the 20% of effort we’ve put into it doesn’t go to waste.
Fiddler: Great. Thank you for that insight. You did mention technology stack. So, I wanted to dig into that. What are the core AI and ML tools you use today and what are the main reasons why you use them?
Ganesh: That’s a great question. To begin with we are migrating into the public cloud and we have a lot of home-grown tools and external tools like Domino and AWS. On the AWS side, we use ML pipeline. We also use the Spark ML pipeline to do our preliminary feature engineering and then build the entire stack. Historically – if you look at Gartner’s report – around 68 to 70% of models being developed don’t get to see the production phase, meaning that they are sitting somewhere as Jupyter notebooks on desktops. So there is no set of well-defined processes around how you take an idea, develop a model, and then how you deliver it into production. That was the missing piece there.
Fiddler: I’m curious – you mentioned a lot of these models are just sitting there and that’s part of the challenge. What are some of the core challenges that your team is facing when you’re taking this AI solution all the way from inception to production?
Ganesh: The main focus for us is around how quickly we can show the dollar value by building the MVP. What we do is when we identify a solution, we remove our organizational hats (this organization or that organization) and we try to address the problem with a holistic approach. We figure out the appropriate solution with an open-minded approach. Once we find the right solution, we look at the boundaries or the bounding boxes in which we operate. Every organization has their specific set of boundaries. Then we look at those boundaries and see how we can factor in the solution which we are planning to build into the existing boundaries. We also take a closer look at these boundaries – are they legacy boundaries and is there something that can be tweaked so that the solution can be implemented seamlessly? That to me is a big challenge. On one side you’ve got to get to market pretty quickly, and on the other side, you have to work with the boundaries that you have within an organization. So how do you balance these two? That’s a challenge for us.
Fiddler: What tools do you think are lacking today to fill these gaps in the process?
Ganesh: We use Scrum in our day to day project stability. When it comes down to machine learning you have to be truly agile when building machine learning products. The reason why I’m saying that is suddenly with machine learning coming in and meeting software engineering, everybody is talking about ML Ops. How do you get to show the value by involving the product team right from the outset? How do you iterate faster? But the more important thing is how do we iterate smatter? That is the key to me.
To me the data science team should also be empowered to get the model from inception to production. If I really look at it, the half-life of a model is determined by its north star metric. The moment these metrics go off track, you will have to retrain the model within weeks if not days. So, do we have that edge? Are we ready? That’s the key thing to me. I wouldn’t call it as a gap – it is something which we are working on to streamline the process. And that is why we as an organization are marching full steam into ML Ops. We have defined our core set of drivers which are key to achieve a successful ML Ops culture within the organization.
Fiddler: Ganesh we didn’t spend a lot of time talking about exactly what S&P Global does for your clients. Can you tell me a bit about that in terms of things like what sort of risk and trust and safety issues you’re dealing with?
Ganesh: Risk – that brings in to the core concept of explainability. Right now, we haven’t seen any adverse effect by not being able to explain our models, but eventually we’ll all get there. S&P Global has four lines of business. One is the Platts, then we have Indices, and then Market Intelligence and the Ratings division. I am part of S&P Global market intelligence team. Our main focus is to gather raw data transcripts and generate sentiment scores and provide actionable intelligence to our clients.
But when it really comes down to the risks in building these machine learning models, I don’t think organizations, and not just S&P Global, I don’t think organizations across the board are ready to take that leap. With all of the regulations coming up like GDPR regulations, it is so important for explainability to be a key factor in your AI. Think about it – if you are making a prediction with your AI, and the customer is going to ask you why, and you’re not in a position to explain that, then that would cause the trust to go whacky. And on the other hand, you don’t want your models to introduce any bias. Right now, as part of the ML Ops framework and design thinking, we wanted to incorporate explainability right in the design phase, and not at the end of the machine learning model’s lifecycle. So, you don’t want the machine learning model to go into production and then figure out explainability there.
Fiddler: I’m sure not too many people have heard this concept of explainable AI – XAI – as you mentioned. So, can you tell us a little bit about this black box AI model as it exists today and the need for something like explainable AI?
Eventually, whenever we build systems in traditional software engineering, we have people – as a software developer when I started my career and get queries from the client, I go and then look into the database or look into the code and then figure out what was the reason – as simple as that. To me the same principle holds good when we go into the machine learning and AI world. Why did the AI system make a specific prediction or a decision or why didn’t the AI system do something else? When did the AI system that we built succeed or fail, or how can the AI systems correct the errors that are coming out of it? Those are some things which resonated with me.
To me there are traditional models like for example the classic random forest models or any of the Bayesian algorithms – these can be explained, but if you look at the core neural networks, they’re a little bit difficult. When talking about deep layered neural network and more than a million parameters – even the ResNet 50 or the VGG 16 have 100 million parameters. There’s hope that sufficient progress can be made so that we can have both power and accuracy for our machine learning models to predict something, and at the same time we don’t lose the required transparency and explainability.
And that to me is very important – it’s good for the business. The community has already started talking about it in one form or another. They are visualizing what-if scenarios during the design phase. That’s what we do. And that has become our core element for our ML Ops journey. We know that explainability is important and we might decide to work on that later. Sometimes customers might need XAI upfront – we don’t know. So, this is where we need to have a tradeoff. It’s a balance between the fine art of the power and accuracy of your model predictions and transparency.
Fiddler: What do you feel about how you might have to build some of these things? How do you do it- are you building these things or are you looking for external solutions to help you include explainable AI in the design phase?
Ganesh: There have been some interesting conversations around this but we haven’t given serious thought about it. When we iterate on new projects, we’re engaging product owners and the customers and then asking them if this needs to be explainable. Not every model needs to be explainable. You don’t want to invest in explainability just for the sake of it. But if it really comes down to a project which has strict GDPR regulations, it’s better to ask all the right questions upfront during the design phase. You may not have answers but startups like Fiddler might have answers to explainability. As data scientists and engineers representing these bigger firms, it is so important for us to ask those questions upfront in the design phase and then if needed, put in the right thought process and engage the right people in a discussion. And think about how you would explain it – do you want some kind of visual dashboard? If customers were to ask ‘why did my loan get rejected’ because of this or what are the important parameters that may go into this prediction. You have to go back and then explain it. You don’t want to lose your customer because of the time it takes for you to explain it. We are not there yet, but eventually we’ll get there.
Fiddler: It’s getting important especially with all these regulations you mentioned. I’m curious: it seems like you might not have come across a situation yet where this black box AI has negatively impacted your organization or have you already come across a situation like this?
Ganesh: No, not really, but I’m thinking ahead. The reason is when you really look at credit risk or taking a step away from the financial industry -let’s talk about the health industry. If you’re going to make serious predictions which have a human impact, it can become extremely problematic not only for the lack of transparency, but also for possible biases which are inherited by the algorithms. This could come from human prejudices or artifacts hidden in the training data that can lead to unfair or wrong decisions. How do you uncover this? Right now, every organization, every line of business, every sub project in these businesses have some amount of data science going on. But they might not get to see the bigger picture. So, as a technology leader, my job is to ask these questions upfront – how do we learn about explainable? It’s through my interactions and attendance in industry conferences. And that’s when you get to understand what’s going on in the space.
Fiddler: As we come to the close of this episode what do you think are some of the core things that teams will need to think about?
Ganesh: The core things I would say an organization should think right now – we need to be thinking about ML Ops. That’s where the heart is right now. We have machine learning models and human brains and we need to figure out how to take this idea, iterate quickly and get to market. The third piece is explainability. That’s where we have to be upfront in asking the right set of questions during the design phase and then take it forward from there. Let’s try to get better with ML Ops so that people are able to see value in ideas that are generated and involve the customer at every phase. Needless to say, explainability will kick in around the corner with regulations coming up, and then you won’t have a choice.
Fiddler: Well thank you so much Ganesh for sharing all your insights on this. We really appreciate your time. Thanks for joining us today.
Ganesh: Thank you so much, Anusha. It’s been a pleasure.
Fiddler will be at the very first TwiML conference next week on October 1 & 2! It’s a new conference hosted by the amazing folks at TwiML, and we can’t wait to explore and learn about the latest and greatest for AI in the enterprise.
At Fiddler, our mission is to enable businesses to deliver trustworthy and responsible AI experiences by unlocking the AI black box.
Where to find us
1) October 1, 11.20 -11.45am, Robertson 2
Session: Why and how to build Explainability into your ML workflow
Join our CEO & Founder, Krishna Gade to learn how Explainable AI is the best way for companies to deal with business risks associated with deploying AI – especially in regulation and compliance heavy industries. Krishna comes from a data and explainability background having led the team that built Explainable AI at Facebook.
2) October 1 & 2, Community Hall, Booth #6 (see our location on the map below)
Come chat with us about:
Why it’s important to provide transparent, reliable and accountable AI experiences
Risks associated with lack of visibility into AI behavior
How to understand, manage, analyze & validate models using explainability
Schedule a time to connect with us
If you’d like to set up a meeting beforehand, fill out this meeting form and we’ll be in touch to finalize dates & times. We’re excited to chat with you!
Fiddler’s very own Ankur Taly, Head of Data Science, will be speaking on September 12 on Explaining Machine Learning Models. Ankur is well-known for his contribution to developing and applying Integrated Gradients — a new interpretability algorithm for Deep Neural Networks. He has a broad research background and has published in several areas including Computer Security and Machine Learning. We hope to see you at his session!
At Fiddler, our mission is to enable businesses of all sizes to unlock the AI black box and deliver trustworthy and responsible AI experiences. Come chat with us about:
Risks associated with not having visibility into model outputs
Most innovative ways to understand, manage, and analyze your ML models
Importance of Explainable AI and providing transparent and reliable experiences to end users
Schedule a time to connect with us
If you’d like to set up a meeting beforehand, then fill out this meeting form and we’ll be in touch. We’re excited to chat with you!
Where to find us
September 11 & 12
We’ll be in the Innovator Pavilion: Booth #K10, so stop by and say hi!
As machine learning models get deployed to high stakes tasks like medical diagnosis, credit scoring, and fraud detection, an overarching question that arises is – why did the model make this prediction? This talk will discuss techniques for answering this question, and applications of the techniques in interpreting, debugging, and evaluating machine learning models.
We’re excited to introduce Anusha Sethuraman, the newest member of our team. Anusha joins us as our Head of Product Marketing.
Anusha comes from a diverse product marketing background across startups and enterprises, most recently on Microsoft’s AI Thought Leadership team where she spearheaded the team’s storytelling strategy with stories being featured in CEO and exec-level Keynotes. Before this, she was at Xamarin (acquired by Microsoft) leading enterprise product marketing where she launched Xamarin’s first decision-maker event and was instrumental in creating the integrated Microsoft + Xamarin story. And prior to that, she was at New Relic (pre-IPO) leading product marketing for New Relic’s mobile monitoring product.
Anusha believes in a world where AI is responsible,ethical, and understandable. In her own words:
“The idea of democratizing AI is great, but even better – democratizing AI that has ethics and responsibility inbuilt. Today’s AI-powered world is nowhere close to being trustworthy: we still run into everyday instances of not knowing the why and how behind the decisions AI generates. Fiddler’s bold ambitions to create a world where technology is built responsibly, where humanity is not only putting AI to the best use possible across all industries and scenarios but creating this ethically and responsibly right from the start is something I care about deeply. I’m very excited to be joining Fiddler to lead Product Marketing and work towards building an AI-powered world that is understandable, transparent, explainable, and secure.”
Last week, the Explainable AI Summit, hosted by Fiddler Labs, returned to discuss top-of-mind issues that leaders face when implementing AI. Over eighty data scientists and business leaders joined us at Galvanize to hear from the keynote speaker and Fiddler’s head of data science, Ankur Taly, and our distinguished panelists moderated by our CEO Krishna Gade:
Manasi Joshi, Engineering Director of ML Productivity, Google Brain
Reprising some topics from our summit in February, the H2 summit focused on explainability techniques and industry-specific challenges.
Takeaway #1: In financial services, companies are working through regulatory and technical hurdles to integrate machine learning techniques into their business model.
Financial services understand the potential of AI and want to adopt machine learning techniques. But they are reasonably wary of running afoul of regulations. If someone suspects a creditor has been discriminatory, the Federal Trade Commission explicitly suggests that he or she consider suing the creditor in federal district court.
Banks and insurance companies already subject models to strict, months-long validation by legal teams and risk teams. But if using opaque deep learning methods means forgoing certainty around model fairness, then these methods cannot be a priority.
However, some use cases are less regulated or not regulated at all, allowing financial services to explore AI integration selectively. Especially if regulators continue to accept that AI may never reach full explainability, AI usage in financial services will increase.
Takeaway #2: And across all industries, leaders are prioritizing trustworthiness of their models.
Most companies understand the risk to their brand and consumer trust if models go awry and make poor decisions. So leaders are implementing more checks before models are promoted to production.
Once in production, externally facing models generate questions and concerns from customers themselves. Business leaders are seeing the need to build explainability tools to address these questions surrounding content selection. Fortunately, many explainability tools are available in open source, like Google’s TCAV and Tensorflow Model Analyzer.
And as automated ML platforms attract hundreds of thousands of users, platform developers are ramping up education about incorrect usage. Ramping up education is necessary but not sufficient. ML Platforms should assist modelers with capabilities to inspect model behavior against sub groups of choice to inform if there is potential bias as manifested by their models.
Takeaway #3: Integrated Gradients is a best-in-class technique to attribute a deep network’s (or any differentiable model’s) prediction to its input features.
A major component of explaining an AI model is the attribution problem. For any given prediction by a model, how can we attribute that prediction to the model’s features?
Currently, several approaches in use are ineffective. For example, an ablation study (dropping a feature and observing the change in prediction) is computationally expensive and misleading when features interact.
To define a better attribution approach, Ankur Taly and his co-creators first established the desirable criteria, or axioms. One axiom, for instance, is insensitivity: a variable that has no effect on the output should get no attribution. These axioms then uniquely define the Integrated Gradients method, which is described by the equation below.
Integrated gradients is easy to apply and widely applicable to differentiable models. Data science teams should consider this method to evaluate feature attribution inexpensively and accurately.
Thank you to Galvanize for hosting the event, to our panelists, and to our engaged audience! We look forward to our next in-person event, and in the meantime, stay tuned for our first webinar. For more information, please email email@example.com.
Two different explanation algorithm types, best in different situations.
Some of the most accurate predictive models today are black box models, meaning it is hard to really understand how they work. To address this problem, techniques have arisen to understand feature importance: for a given prediction, how important is each input feature value to that prediction? Two well-known techniques are SHapley Additive exPlanations (SHAP) and Integrated Gradients (IG). In fact, they each represent a different type of explanation algorithm: a Shapley-value-based algorithm (SHAP) and a gradient-based algorithm (IG).
There is a fundamental difference between these two algorithm types. This post describes that difference. First, we need some background. Below, we review Shapley values, Shapley-value-based methods (including SHAP), and gradient-based methods (including IG). Finally, we get back to our central question: When should you use a Shapley-value-based algorithm (like SHAP) versus a gradient-based explanation explanation algorithm (like IG)?
What are Shapley values?
The Shapley value (proposed by Lloyd Shapley in 1953) is a classic method to distribute the total gains of a collaborative game to a coalition of cooperating players. It is provably the only distribution with certain desirable properties (fully listed on Wikipedia).
In our case, we formulate a game for the prediction at each instance. We consider the “total gains” to be the prediction value for that instance, and the “players” to be the model features of that instance. The collaborative game is all of the model features cooperating to form a prediction value. The Shapley value efficiency property says the feature attributions should sum to the prediction value. The attributions can be negative or positive, since a feature can lower or raise a predicted value.
There is a variant called the Aumann-Shapley value, extending the definition of the Shapley value to a game with many (or infinitely many) players, where each player plays only a minor role, if the worth function (the gains from including a coalition of players) is differentiable.
What is a Shapley-value-based explanation method?
A Shapley-value-based explanation method tries to approximate Shapley values of a given prediction by examining the effect of removing a feature under all possible combinations of presence or absence of the other features. In other words, this method looks at function values over subsets of features like F(x1, <absent>, x3, x4, …, <absent>, …, xn). How to evaluate a function F with one or more absent features is subtle.
For example, SHAP (SHapely Additive exPlanations) estimates the model’s behavior on an input with certain features absent by averaging over samples from those features drawn from the training set. In other words, F(x1, <absent>, x3, …, xn) is estimated by the expected prediction when the missing feature x2 is sampled from the dataset.
Exactly how that sample is chosen is important (for example marginal versus conditional distribution versus cluster centers of background data), but I will skip the fine details here.
Once we define the model function (F) for all subsets of the features, we can apply the Shapley values algorithm to compute feature attributions. Each feature’s Shapley value is the contribution of the feature for all possible subsets of the other features.
The “kernel SHAP” method from the SHAP paper computes the Shapley values of all features simultaneously by defining a weighted least squares regression whose solution is the Shapley values for all the features.
The high-level point is that all these methods rely on taking subsets of features. This makes the theoretical version exponential in runtime: for N features, there are 2N combinations of presence and absence. That is too expensive for most N, so these methods approximate. Even with approximations, kernel SHAP can be slow. Also, we don’t know of any systematic study of how good the approximation is.
There are versions of SHAP specialized to different model architectures for speed. For example, Tree SHAP computes all the subsets by cleverly keeping track of what proportion of all possible subsets flow down into each of the leaves of the tree. However, if your model architecture does not have a specialized algorithm like this, you have to fall back on kernel SHAP, or another naive (unoptimized) Shapley-value-based method.
A Shapley-value-based method is attractive as it only requires black box access to the model (i.e. computing outputs from inputs), and there is a version agnostic to the model architecture. For instance, it does not matter whether the model function is discrete or continuous. The downside is that exactly computing the subsets is exponential in the number of features.
What is a gradient-based explanation method?
A gradient-based explanation method tries to explain a given prediction by using the gradient of (i.e. change in) the output with respect to the input features. Some methods like Integrated Gradients (IG), GradCAM, and SmoothGrad literally apply the gradient operator. Other methods like DeepLift and LRP apply “discrete gradients.”
Let me describe IG, which has the advantage that it tries to approximate Aumann-Shapley values, which are axiomatically justified. IG operates by considering a straight line path, in feature space, from the input at hand (e.g., an image from a training set) to a certain baseline input (e.g., a black image), and integrating the gradient of the prediction with respect to input features (e.g., image pixels) along this path.
This paper explains the intuition of the IG algorithm as follows. As the input varies along the straight line path between the baseline and the input at hand, the prediction moves along a trajectory from uncertainty to certainty (the final prediction probability). At each point on this trajectory, one can use the gradient with respect to the input features to attribute the change in the prediction probability back to the input features. IG aggregates these gradients along the trajectory using a path integral.
IG (roughly) requires the prediction to be a continuous and piecewise differentiable function of the input features. (More precisely, it requires the function is continuous everywhere and the partial derivative along each input dimension satisfies Lebesgue’s integrability condition, i.e., the set of discontinuous points has measure zero.)
Note it is important to choose a good baseline for IG to make sensible feature attributions. For example, if a black image is chosen as baseline, IG won’t attribute importance to a completely black pixel in an actual image. The baseline value should both have a near-zero prediction, and also faithfully represent a complete absence of signal.
IG is attractive as it is broadly applicable to all differentiable models, easy to implement in most machine learning frameworks (e.g., TensorFlow, PyTorch, Caffe), and computationally scalable to massive deep networks like Inception and ResNet with millions of neurons.
When should you use a Shapley-value-based versus a gradient-based explanation method?
Finally, the payoff! Our advice: If the model function is piecewise differentiable and you have access to the model gradient, use IG. Otherwise, use a Shapley-value-based method.
Any model trained using gradient descent is differentiable. For example: neural networks, logistic regression, support vector machines. You can use IG with these. The major class of non-differentiable models is trees: boosted trees, random forests. They encode discrete values at the leaves. These require a Shapley-value-based method, like Tree SHAP.
The IG algorithm is faster than a naive Shapley-value-based method like kernel SHAP, as it only requires computing the gradients of the model output on a few different inputs (typically 50). In contrast, a Shapley-value-based method requires computing the model output on a large number of inputs sampled from the exponentially huge subspace of all possible combinations of feature values. Computing gradients of differentiable models is efficient and well supported in most machine learning frameworks. However, a differentiable model is a prerequisite for IG. By contrast, a Shapley-value-based method makes no such assumptions.
Several types of input features that look discrete (hence might require a Shapley-value-based method) actually can be mapped to differentiable model types (which let us use IG). Let us walk through one example: text sentiment. Suppose we wish to attribute the sentiment prediction to the words in some input text. At first, it seems that such models may be non-differentiable as the input is discrete (a collection of words). However, differentiable models like deep neural networks can handle words by first mapping them to a high-dimensional continuous space using word embeddings. The model’s prediction is a differentiable function of these embeddings. This makes it amenable to IG. Specifically, we attribute the prediction score to the embedding vectors. Since attributions are additive, we sum the attributions (retaining the sign) along the fields of each embedding vector and map it to the specific input word that the embedding corresponds to.
A crucial question for IG is: what is the baseline prediction? For this text example, one option is to use the embedding vector corresponding to empty text. Some models take fixed length inputs by padding short sentences with a special “no word” token. In such cases, we can take the baseline as the embedding of a sentence with just “no word” tokens.
In many cases (a differentiable model with a gradient), you can use integrated gradients (IG) to get a more certain and possibly faster explanation of feature importance for a prediction. However, a Shapley-value-based method is required for other (non-differentiable) model types.
At Fiddler, we support both SHAP and IG. (Full disclosure: Ankur Taly, a co-author of IG, works at Fiddler, and is a co-author of this post.) Feel free to email firstname.lastname@example.org for more information, or just to say hi!
You can’t always change a human’s input to see the output.
At Fiddler Labs, we place great emphasis on model explanations being faithful to the model’s behavior. Ideally, feature importance explanations should surface and appropriately quantify all and only those factors that are causally responsible for the prediction. This is especially important if we want explanations to be legally compliant (e.g., GDPR, article 13 section 2f, people have a right to ‘[information about] the existence of automated decision-making, including profiling .. and .. meaningful information about the logic involved’), and actionable. Even when making post-processing explanations human-intelligible, we must preserve faithfulness to the model.
How do we differentiate between features that are correlated with the outcome, and those that cause the outcome? In other words, how do we think about the causality of a feature to a model output, or to a real-world task? Let’s take those one at a time.
Explaining causality in models is hard
When explaining a model prediction, we’d like to quantify the contribution of each (causal) feature to the prediction.
For example, in a credit risk model, we might like to know how important income or zip code is to the prediction.
Note that zip code may be causal to a model’s prediction (i.e. changing zip code may change the model prediction) even though it may not be causal to the underlying task (i.e. changing zip code may not change the decision of whether to grant a loan). However, these two things may be related if this model’s output is used in the real-world decision process.
The good news is that since we have input-output access to the model, we can probe it with arbitrary inputs. This allows examining counterfactuals, inputs that are different from those of the prediction being explained. These counterfactuals might be elsewhere in the dataset, or they might not.
Shapley values (a classic result from game theory) offer an elegant, axiomatic approach to quantify feature contributions.
One challenge is they rely on probing with an exponentially large set of counterfactuals, too large to compute. Hence, there are several papers on approximating Shapley values, especially for specific classes of model functions.
However, a more fundamental challenge is that when features are correlated, not all counterfactuals may be realistic. There is no clear consensus on how to address this issue, and existing approaches differ on the exact set of counterfactuals to be considered.
To overcome these challenges, it is tempting to rely on observational data. For instance, using the observed data to define the counterfactuals for applying Shapley values. Or more simply, fitting an interpretable model on it to mimic the main model’s prediction and then explaining the interpretable model in lieu of the main model. But, this can be dangerous.
Consider a credit risk model with features including the applicant’s income and zip code. Say the model internally only relies on the zip code (i.e., it redlines applicants). Explanations based on observational data might reveal that the applicant’s income, by virtue of being correlated to zip code, is as predictive of the model’s output. This may mislead us to explain the model’s output in terms of the applicant’s income. In fact, a naive explanation algorithm will split attributions equally between two perfectly correlated features.
To learn more, we can intervene in features. One counterfactual changing zip code but not income will reveal that zip code causes the model’s prediction to change. A second counterfactual that changes income but not zip code will reveal that income does not. These two together will allow us to conclude that zip code is causal to the model’s prediction, and income is not.
Explaining causality requires the right counterfactuals.
Explaining causality in the real world is harder
Above we outlined a method to try to explain causality in models: study what happens when features change. To do so in the real world, you have to be able to apply interventions. This is commonly called a “randomized controlled trial” (also known as an “A/B testing” when there are two variants, especially in the tech industry). You divide a population into two or more groups randomly, and apply different interventions to each group. The randomization ensures that the only differences among the groups are your intervention. Therefore, you can conclude that your intervention causes the measurable differences in the groups.
The challenge in applying this method to real-world tasks is that not all interventions are feasible. You can’t ethically ask someone to take up smoking. In the real world, you may not be able to get the data you need to properly examine causality.
We can probe models as we wish, but not people.
Natural experiments can provide us an opportunity to examine situations where we would not normally intervene, like in epidemiology and economics. However, these provide us a limited toolkit, leaving many questions in these fields up for debate.
There are proposals for other theories that allow us to use domain knowledge to separate correlation from causation. These are subject to ongoing debate and research.
Now you know why explaining causality in models is hard, and explaining it in the real world is even harder.
To learn more about explaining models, email us at email@example.com. (Photo credit: pixabay.) This post was co-written with Ankur Taly.
What does debugging look like in the new world of machine learning models? One way uses model explanations.
Machine learning (ML) models are popping up everywhere. There is a lot of technical innovation (e.g., deep learning, explainable AI) that has made them more accurate, more broadly applicable, and usable by more people in more business applications. The lists are everywhere: banking, healthcare, tech, all of the above.
In a deep learning neural network, instead of lines of code written by people, we are looking at possibly millions of weights linked together into an incomprehensible network. (picture credit)
So how do we find bugs in this network? One way is to explain your model predictions. Let’s look at two types of bugs we can find through explanations (data leakage and data bias), illustrated with examples from predicting loan default. Both of these are actually data bugs, but a model summarizes the data, so they show up in the model.
Bug #1: data leakage
Most ML models are supervised. You choose a precise prediction goal (also called the “prediction target”), gather a dataset with features, and label each example with the target. Then you train a model to use the features to predict the target. Surprisingly often there are features in the dataset that relate to the prediction target but are not useful for prediction. For example, they might be added from the future (i.e. long after prediction time), or otherwise unavailable at prediction time.
Here is an example from the Lending Club dataset. We can use this dataset to try modeling predicting loan default with loan_status field as our prediction target. It takes the values “Fully Paid” (okay) or “Charged Off” (bank declared a loss, i.e. the borrower defaulted). In this dataset, there are also fields such as total_pymnt (the payments received) and loan_amnt (amount borrowed). Here are a few example values:
Notice anything? Whenever the loan has defaulted (“Charged Off”), the total payments are less than the loan amount, and delta (=loan_amnt-total_pymnt) is positive. Well, that’s not terribly surprising. Rather, it’s nearly the definition of default: by the end of the loan term, the borrower paid less than what was loaned. Now, delta doesn’t have to be positive for a default: you could default after paying back the entire loan principal amount but not all of the interest. But, in this data, 98% of the time if delta is negative, the loan was fully paid; and 100% of the time delta is positive, the loan was charged off. Including total_pymnt gives us nearly perfect information, but we don’t get total_pymnt until after the entire loan term (3 years)!
Including both loan_amnt and total_pymnt in the data potentially allows nearly perfect prediction, but we won’t really have total_pymnt for the real prediction task. Including them both in the training data is data leakage of the prediction target.
If we make a (cheating) model, it will perform very well. Too well. And, if we run a feature importance algorithm on some predictions (a common form of model explanation), we’ll see these two variables come up as important, and with any luck realize this data leakage.
Below, the Fiddler explanation UI shows “delta” stands out as a huge factor in raising this example prediction.
There are other, more subtle potential data leakages in this dataset. For example, the grade and sub_grade are assigned by a Lending Club proprietary model, which almost completely determines the interest rate. So, if you want to build your own risk scoring model without Lending Club, then grade, sub_grade, and int_rate are all data leakage. They wouldn’t allow you to perfectly predict default, but presumably they would help, or Lending Club would not use their own model. Moreover, for their model, they include FICO score, yet another proprietary risk score, but one that most financial institutions buy and use. If you don’t want to use FICO score, then that is also data leakage.
Data leakage is any predictive data that you can’t or won’t use for prediction. A model built on data with leakage is buggy.
Bug #2: data bias
Suppose through poor data collection or a bug in preprocessing, our data in biased. More specifically, there is a spurious correlation between a feature and the prediction target. In that case, explaining predictions will show an unexpected feature often being important.
We can simulate a data processing bug in our lending data by dropping all the charged off loans from zip codes starting with 1 through 5. Before this bug, zip code is not very predictive of chargeoff (an AUC of 0.54, only slightly above random). After this bug, any zip code starting with 1 through 5 will never be charged off, and the AUC jumps to 0.78. So, zip code will show up as an important feature in predicting (no) loan default from data examples in those zip codes. In this example, we could investigate by looking at predictions where zip code was important. If we are observant, we might notice the pattern, and realize the bias.
Below is what charge-off rate would look like if summarized by the first digit of zip code. Some zips would have no charge-offs, while the rest had a rate similar to the dataset overall.
Below, the Fiddler explanation UI shows zip code prefix stands out as a huge factor in lowering this example prediction.
A model built from this biased data is not useful for making predictions on (unbiased) data we haven’t seen yet. It is only accurate in the biased data. Thus, a model built on biased data is buggy.
Other model debugging methods
There are many other possibilities for model debugging that don’t involve model explanations. For example:
Look for overfitting or underfitting. If your model architecture is too simple, it will underfit. If it is too complex, it will overfit.
Regression tests on a golden set of predictions that you understand. If these fail, you might be able to narrow down which scenarios are broken.
Since explanations aren’t involved with these methods, I won’t say more here.
If you are not sure your model is using your data appropriately, use explanations of feature importance to examine its behavior. You might see data leakage or data bias. Then, you can fix your data, which is the best way to fix your model.