A gentle introduction to GA2Ms, a white box model

A gentle introduction to a white box machine learning model called a GA2M, a Generalized Additive Model (GAM) with interaction terms.

This post is a gentle introduction to a white box machine learning model called a GA2M.

We’ll walk through:

  • What is a white box model, and why would you want one?
  • A classic example white box model: logistic regression
  • What is a GAM, and why would you want one?
  • What is a GA2M, and why would you want one?
  • When should you choose a GAM, a GA2M, or something else?

The purpose of all these machine learning models is to make a prediction towards a goal specified by a human. Think of a model that can predict loan default, or the presence of someone’s face in a picture.

The short story: A generalized additive model (GAM) is a white box model that is more flexible than logistic regression, but still interpretable. A GA2M is a GAM with interaction terms, which allows it to be more flexible still, but with a more complicated interpretation. GAMs and GA2Ms are an intriguing addition to your toolbox, interpretable at the expense of not fitting every kind of data. A picture:

For more about what that all means, read on.

White box models

The term “white box” comes from software engineering. It means software whose internals you can view, compared to a “black box” whose internals you cannot view. By this definition, a neural network could be a white box model if you can see the weights (picture credit):

However, by white box people really mean something they can understand. A white box model is a model whose internals a person can see and reason about. This is subjective, but most people would agree the weights shown above don’t give us information about how the model works in such a way as we could usefully describe it, or predict what the model is going to do in the future.

Compare the picture above to this one about risk of death from pneumonia by age from [1]:

Now that isn’t a whole model. Rather, it’s the impact of one feature (age) on the risk score. The green lines are error bars (±1 standard deviation in 100 rounds of bagging). The red line in the middle of them is the best estimate. In the paper, they observe:

  1. Risk score is flat until about age 50. Risk score here is negative, meaning less risk of death than the average in the dataset.
  2. Risk score rises sharply at 65. This could be due to retirement. In future, it might be interesting to gather data about retirement.
  3. The error bars are narrowest in ages 66-85. Perhaps that is where the most data is.
  4. Risk score rises again at 85. The error bars also widen again. Maybe this jump is not real.
  5. Risk score drops above 100. This may be due to lack of data, or something else. In the paper, they suggested one might wish to “fix” this region of the model by changing it to predict at the same level as ages 85-100 instead of dropping. This fix is using domain knowledge (“risk of pneumonia likely doesn’t go down after age 85”) to address possible model artifacts.
  6. Risk score between 66 and 85 is relatively flat.

All this from one graph of one model feature. There are facts, like the shape of the graph, and then speculation about why the graph might behave that way. The facts are useful to understand the data. The speculation cannot be answered by any tool, but may be useful to suggest further actions, like collecting new features (say, about retirement) or new instances (like points below age 50 or above 100), or new analyses (like looking carefully at data instances around ages 85-86 for differences).

These aren’t simulations of what the model would do. These are the internals of the model itself, so that graph is accurately describing the exact effect of age on risk score. There are 55 other components to this model, but each can be examined and reasoned about.

This is the power of a white box model.

This example also shows the dangers. By seeing everything, we may believe we understand everything, and speculate wildly or “fix” inappropriately. As always, we have to exercise judgment to use data properly.

In summary: make a white box model to

  • learn about your model, not from simulations or approximations, but the actual internals
  • improve your model, by giving you ideas of directions to pursue
  • “fix” your model, i.e., align it with your intuition or domain knowledge

One final possibility: regulations dictate that you need to fully describe your model. In that case, it could be useful to have human-readable internals for reference.

Here are some examples of white box and black box models:

White box modelsBlack box models
Logistic regression
GAMs
GA2Ms
Decision trees (short and few trees)
Neural networks (including deep learning)
Boosted trees and random forests (many trees)
Support vector machines

Now let’s walk through three specific white box models.

A classic: logistic regression

Logistic regression was developed in the early 1800s, and re-popularized in the 1900s. It’s been around for a long time, for many reasons. It solves a common problem (predict the probability of an event), and it’s interpretable. Let’s explore what that means. Here is the logistic equation defining the model:

There are three types of variables in this model equation:

  • p is the probability of an event we’re predicting. For example, defaulting on a loan
  • The x’s are features. For example, loan amount.
  • The 𝛽’s (betas) are the coefficients, which we fit using a computer.

The betas are fit once to the entire dataset. The x’s are different for each instance in the dataset. The p represents an aggregate of dataset behavior: any dataset instance either happened (1) or didn’t (0), but in aggregate, we’d like the right-hand side and the left-hand side to be as close as possible.

The “log(p/(1-p))” is the log odds, also called the “logit of the probability”. The odds are (probability the event happened)/(probability the event won’t happen), or p/(1-p). Then we apply the natural logarithm to translate p, which takes the range 0 to 1, to a quantity which can range from -∞ to +∞, suitable for a linear model.

This model is linear, but for the log odds. That is, the right-hand side is a linear equation, but it is fit to the log odds, not the probability of an event.

This model is interpretable as follows: a unit increase in xi is a log-odds increase in 𝛽i.

For example, suppose we’re predicting probability of loan default, and our model has a feature coefficient 𝛽1=0.15 for the loan amount feature x1. That means a unit increase in the feature corresponds to a log odds increase of 0.15 in default. We can take the natural exponent to get the odds ratio, exp(0.15)=1.1618. That means:

for this model, a unit increase (of say, a thousand dollars) in loan amount corresponds to a 16% increase in the odds of loan default, holding all other factors constant.

This statement is what people mean when they say logistic regression is interpretable.

To summarize why logistic regression is a white box model:

  • The input response terms (𝛽ixi terms) can be interpreted independently of each other
  • The terms are in interpretable units: the coefficients (betas) are in units of log odds.

So why would we use anything other than the friendly, venerable model of logistic regression?

Well, if the features and log odds don’t have a linear relationship, this model won’t fit well. I always think of trying to fit a line to a parabola:

If you have non-linear data (the black parabola), a linear fit (the blue dashed line) will never be great. No line fits the curve.

Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) were developed in the 1990s by Hastie and Tibshirani. (See also chapter 9 of their book “The Elements of Statistical Learning”.) Here is the equation defining the model:

This equation is quite similar to logistic regression. It has the same three types of elements:

  • E(Y) is an aggregate of dataset behavior, like the “p” in the equation above. In fact, it may well be the probability of an event, the same p.
  • g(.) is a link function, like the logit (or log odds) from the logistic equation above.
  • fi(xi) is a term for each dataset instance feature x1,…,xm.

The big difference is instead of a linear term 𝛽ixi for a feature, now we have a function fi(xi). In their book, Hastie and Tibshirani specify a “smooth” function like a cubic spline. Lou et al. [2] looked at other functions for the fi, which they call “shape functions.”

A GAM also has white box features:

  • The input response terms (f(xi) terms) can be interpreted independently of each other
  • The terms are in interpretable units. For the logit link function, these are log odds.

Now a term, instead of being a constant (beta), is a function, so instead of reporting the log odds as a number, we visualize it with a graph. In fact, the graph above of pneumonia risk of death by age is one term (shape function) in a GAM.

So why would we use anything other than a GAM? It’s already flexible and interpretable. Same reason as before: it might not be accurate enough. In particular, we’ve assumed that each feature response can be modeled with its own function, independent of the others.

But what if there are interactions between the features? Several black box models (boosted trees, neural networks) can model interaction terms. Let us walk through a white box model that also can: GA2Ms.

GAMs with interaction terms (GA2Ms)

GA2Ms were investigated in 2013 by Lou et al. [3]. The authors pronounce them with the letters “gee ay two em”, but in house we’ve taken to calling them “interaction GAMs” because it’s more pronounceable. Here is the model equation:

This equation is quite similar to the GAM equation from the previous section, except it adds functions that can account for two feature variables at once, i.e. interaction terms.

Microsoft just released a library InterpretML that implements GA2Ms in python. In that library, they call them “Explainable Boosting Machines.”

Lou et al. say these are still white box models because the “shape function” for an interaction term is a heatmap. The two features are along the X and Y axis, and the color in the middle shows the function response. Here is an example from Microsoft’s library fit to predicting loan default on a dataset of loan performance from lending club:

For this example graph:

  • The upper right corner is the most red. That means the probability of default goes up the most when dti (debt to income ratio) and fico_range_midpoint (the FICO credit score) are both high.
  • The left strip is also red, but turns blue near the bottom. That means that very low dti is usually bad, except if fico_range_midpoint is also low.

This particular heatmap is hard to reason about. This is likely only the interaction effect without the single-feature terms. So, it could be that the probability of default overall isn’t higher at high-dti and high-fico, but rather just higher than either of the primary effects predict by themselves. To investigate further, we could probably look at some examples around the borders. But, for this blog post, we’ll skip the deep dive.

In practice, this library fits all single-feature functions, then N interaction terms, where you pick N. It is not easy to pick N. The interaction terms are worthwhile if they add enough accuracy to be worth the extra complexity of staring at heatmaps to interpret them. That is a judgement call that depends on your business situation.

When should we use GAMs or GA2Ms?

To perform machine learning, first pick a goal. Then pick a technology that will best use your data to meet the goal. There are thousands of books and millions of papers on that subject. But, here is a drastically simplified way to think about how GA2Ms fit in to possible model technologies: they are on a spectrum from interpretability to modeling feature interactions.

  • Use GAMs if they are accurate enough. It gives the advantages of a white box model: separable terms with interpretable units.
  • Use GA2Ms if they are significantly more accurate than GAMs, especially if you believe from your domain knowledge that there are real feature interactions, but they are not too complex. This also gives the advantages of a white box model, with more effort to interpret.
  • Try boosted trees (xgboost or lightgbm) if you don’t know a lot about the data, since it is quite robust to quirks in data. These are black box models.
  • When features interact highly with each other, like pixels in images or the context in audio, you may well need neural networks or something else that can capture complex interactions. These are deeply black box.

In all cases, you may well need domain-specific data preprocessing, like squaring images, or standardizing features (subtracting the mean and dividing by the standard deviation). That is a topic for another day.

Now hopefully the diagram we started with makes more sense.

At Fiddler Labs, we help you explain your AI. Email us at info@fiddler.ai.

References

  1. Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-Day Readmission.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721–1730. KDD ’15. New York, NY, USA: ACM, 2015. https://doi.org/10.1145/2783258.2788613.
  2. Lou, Yin, Rich Caruana, and Johannes Gehrke. “Intelligible Models for Classification and Regression.” In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150–158. KDD ’12. New York, NY, USA: ACM, 2012. https://doi.org/10.1145/2339530.2339556.
  3. Lou, Yin, Rich Caruana, Giles Hooker, and Johannes Gehrke. Accurate Intelligible Models with Pairwise Interactions, 2017. https://www.microsoft.com/en-us/research/publication/accurate-intelligible-models-pairwise-interactions/.

Humans choose, AI does not

For the non-technical reader who sees scary headlines: every AI has a goal, usually a labeled dataset, precisely defined by a human.

Artificial intelligence isn’t human

Artificial Intelligence Will Best Humans at Everything by 2060, Experts Say”. Well.

First, as Yogi Berra said, “It’s tough to make predictions, especially about the future.” Where is my flying car?

Second, the title reads like clickbait, but surprisingly it appears to be pretty close to the actual survey, which asked AI researchers when “high-level machine intelligence” will arrive, defined as “when unaided machines can accomplish every task better and more cheaply than human workers.” What is a ‘task’ in this definition? Does “every task” even make sense? Can we enumerate all tasks?

Third and most important, is high-level intelligence just accomplishing tasks? This is the real difference between artificial and human intelligence: humans define goals, AI tries to achieve them. Is the hammer going to displace the carpenter? They each have a purpose.

This difference between artificial and human intelligence is crucial to understand, both to interpret all the crazy headlines in the popular press, and more importantly, to make practical, informed judgements about the technology.

The rest of this post walks through some types of artificial intelligence, types of human intelligence, and given how different they are, plausible and implausible risks of artificial intelligence. The short story: unlike humans, every AI technology has a perfectly mathematically well-defined goal, often a labeled dataset.

Types of artificial intelligence

In supervised learning, you define a prediction goal and gather a training set with labels corresponding to the goal. Suppose you want to identify whether a picture has Denzel Washington in it. Then your training set is a set of pictures, each labeled as containing Denzel Washington or not. The label has to be applied outside of the system, mostly likely by people. If your goal is to do facial recognition, your labeled dataset is pictures along with a label (the person in the picture). Again, you have to gather the labels somehow, likely with people. If your goal is to match a face with another face, you need a label of whether the match was successful or not. Always labels.

Almost all the machine learning you read about is supervised learning. Deep learning, neural networks, decision trees, random forests, logistic regression, all training on labeled datasets.

In unsupervised learning, again you define a goal. A very common unsupervised learning technique is clustering (e.g., the well-known k-means clustering). Again, the goal is very well-defined: find clusters minimizing some mathematical cost function. For example, where the distance between points within the same cluster is small, and the distance between points not within the same cluster is large. All of these goals are so well-defined they have mathematical formalism:

This formula feels very different from how humans specify goals. Most humans don’t understand these symbols at all. They are not formal. Also, a “goal-oriented” mindset in a human is unusual enough that it has a special term.

In reinforcement learning, you define a reward function to reward (or penalize) actions that move towards a goal. This is the technology people have been using recently for games like chess and Go, where it may take many actions to reach a particular goal (like checkmate), so you need a reward function that gives hints along the way. Again, not only a well-defined goal, but even a well-defined on-the-way-to-goal reward function.

These are types of artificial intelligence (“machine learning”) that are currently hot because of recent huge gains in accuracy, but there are plenty of others that people have studied.

Genetic algorithms are another way of solving problems inspired by biology. One takes a population of mathematical constructs (essentially functions), and selects those that perform best on a problem. Although people get emotional about the biological analogy, still the fitness function that defines “best” is a concrete, completely specified mathematical function chosen by a human.

There is computer-generated art. For example, deep dream (gallery) is a way to generate images from deep learning neural networks. This would seem to be more human and less goal-oriented, but in fact people are still directing. The authors described the goal at a high level: “Whatever you see there, I want more of it!” Depending on which layer of the network is asked, the features amplified might be low level (like lines, shapes, edges, colors, see the addaxes below) or higher level (like objects).

Original photo of addaxes by Zachi Evenor, processed photo from Google

Expert systems are a way to make decisions using if-then rules on a formally expressed body of knowledge. They were somewhat popular in the 1980s. These are a type of “Good Old Fashioned Artificial Intelligence” (GOFAI), a term for AI based on manipulating symbols.

Another common difference between human and artificial intelligence is that humans learn over a long time, while AI is often retrained from the beginning for each problem. This difference, however, is being narrowed. Transfer learning is the process of training a model, and then using or tweaking the model for use in a different context. This is industry practice in computer vision, where deep learning neural networks that have been trained using features from previous networks (example).

One interesting research project in long-term learning is NELL, Never-Ending Language Learning. NELL crawls the web collecting text, and trying to extract beliefs (facts) like “airtran is a transportation system”, along with a confidence. It’s been crawling since 2010, and as of July 2017 has accumulated over 117 million candidate beliefs, of which 3.8 million are high-confidence (at least 0.9 of 1.0).

In every case above, humans not only specify a goal, but have to specify it unambiguously, often even formally with mathematics.

Types of human intelligence

What are the types of human intelligence? It’s hard to even come up with a list. Psychologists have been studying this for decades. Philosophers have been wrestling with it for millennia.

IQ (the Intelligence Quotient) is measured with verbal and visual tests, sometimes abstract. It is predicated on the idea that there is a general intelligence (sometimes called the “g factor”) common to all cognitive ability. This idea is not accepted by everyone, and IQ itself is hotly debated. For example, some believe that people with the same latent ability but from different demographic groups might be measured differently, called Differential Item Functioning, or simply measurement bias.

People describe fluid intelligence (the ability to solve novel problems) and crystallized intelligence (the ability to use knowledge and experience).

The concept of emotional intelligence shows up in the popular press: the ability of a person to recognize their own emotions and those of others, and use emotional thinking to guide behavior. It is unclear how accepted this is by the academic community.

More widely accepted are the Big Five personality traits: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism. This is not intelligence (or is it?), but illustrates a strong difference with computer intelligence. “Personality” is a set of stable traits or behavior patterns that predict a person’s behavior. What is the personality of an artificial intelligence? The notion doesn’t seem to apply.

With humor, art, or the search for meaning, we get farther and farther from well-defined problems, yet closer and closer to humanity.

Risks of artificial intelligence

Can artificial intelligence surpass human intelligence?

One risk that captures the popular press is The Singularity. The writer and mathematician Verner Vinge gives a compelling description in an essay from 1993: “Within thirty years, we will have the technological means to create superhuman intelligence. Shortly after, the human era will be ended.

There are at least two ways to interpret this risk. The common way is that some magical critical mass will cause a phase change in machine intelligence. I’ve never understood this argument. “More” doesn’t mean “different.” The argument is something like “As we mimic the human brain closely, something near-human (or super-human) will happen.” Maybe?

Yes, the availability of lots of computing power and lots of data has resulted in a phase change in AI results. Speech recognition, automatic translation, computer vision, and other problem domains have been completely transformed. In 2012, when researchers at Toronto re-applied neural networks to computer vision, the error rates on a well-known dataset started dropping fast, until within a few years the computers were beating the humans. But computers were doing the same well-defined task as before, only better.

ImageNet Large Scale Visual Recognition Challenge error rates. 0.1 is a 10% error rate. After neural networks were re-applied in 2012, error rates dropped fast. They beat Andrej Karpathy by 2015.

The more compelling observation is: “The chance of a singularity might be small, but the consequences are so serious we should think carefully.”

Another way to interpret the risk of the singularity is that the entire system will have a phase change. That system includes computers, technology, networks, and humans setting goals. This seems more correct and entirely plausible. The Internet was a phase change, as were mobile phones. Under this interpretation, there are plenty of plausible risks of AI.

One plausible risk is algorithmic bias. If algorithms are involved in important decisions, we’d like them to be trustworthy. (In a previous post, we discussed how to measure algorithmic fairness.)

Tay, a Microsoft chatbot, was taught by Twitter to be racist and woman-hating within 24 hours. But Tay didn’t really understand anything, it just “learned” and mimicked.

Tay, the offensive chatbot.

Amazon’s facial recognition software Rekognition, falsely matched 28 U.S. Congresspeople (mostly people of color) with known criminals. Amazon’s response was that the ACLU (who conducted the test) used an unreliable cutoff of only 80 percent confident. (They recommended 95 percent.)

MIT researcher Joy Buolamwini showed that gender identification error rates in several computer vision systems were much higher for people with dark skin.

All of these untrustworthy results arise at least partially from the training data. In Tay’s case, it was deliberately fed hateful data by Twitter users. In the computer vision systems, there may well have been less data for people of color.

Another plausible risk is automation. As artificial intelligence becomes more cost-efficient at solving problems like driving cars or weeding farm plots, the people who used to do those tasks may be thrown out of work. This is the risk of AI plus capitalism: businesses will each try to be cost effective. We can only address this at a societal level, which makes it very difficult.

One final risk is bad goals, possibly aggravated by single-mindedness. This is memorably illustrated by the paper-clip problem, first described by Nick Bostrom in 2003: “It also seems perfectly possible to have a superintelligence whose sole goal is something completely arbitrary, such as to manufacture as many paperclips as possible, and who would resist with all its might any attempt to alter this goal. For better or worse, artificial intellects need not share our human motivational tendencies.” There is even a web game inspired by this idea.

Understand your goals

How do we address some of the plausible risks above? A complete answer is another full post (or book, or lifetime). But let’s mention one piece: understand the goals you’ve given your AI. Since all AI is simply optimizing a well-defined mathematical function, that is the language you use to say what problem you want to solve.

Does that mean you should start reading up on integrals and gradient descent algorithms? I can feel your eyeballs closing. Not necessarily!

The goals are a negotiation between what your business needs (human language) and how it can be measured and optimized (AI language). You need people to speak to both sides. That is often a business or product owner in collaboration with a data scientist or quantitative researcher.

Let me give an example. Suppose you want to recommend content using a model. You choose to optimize the model to increase engagement with the content, as measured by clicks. Voila, now you understand one reason the Internet is full of clickbait: the goal is wrong. You actually care about more than clicks. Companies modify the goal without touching the AI by trying to filter out content that doesn’t meet policies. That is one reasonable strategy. Another strategy might be to add a heavy penalty to the training dataset if the AI recommends content later found to be against policy. Now we are starting to really think through how our goal affects the AI.

This example also explains why content systems can be so jumpy: you click on a video on YouTube, or a pin on Pinterest, or a book on Amazon, and the system immediately recommends a big pile of things that are almost exactly the same. Why? The click is usually measured in the short-term, so the system optimizes for short-term engagement. This is a well-known recommender challenge, centered around mathematically defining a good goal. Perhaps a part of the goal should be whether the recommendation is irritating, or whether there is long-term engagement.

Another example: if your model is accurate, but your dataset or measurements don’t look at under-represented minorities in your business, you may be performing poorly for them. Your goal may really to be accurate for all sorts of different people.

A third example: if your model is accurate, but you don’t understand why, that might be a risk for some critical applications, like healthcare or finance. If you have to understand why, you might need to use a human-understandable (“white box”) model, or explanation technology for the model you have. Understandability can be a goal.

Conclusion: we need to understand AI

AI cannot fully replace humans, despite what you read in the popular press. The biggest difference between human and artificial intelligence is that only humans choose goals. So far, AIs do not.

If you can take away one thing about artificial intelligence: understand its goals. Any AI technology has a perfectly well-defined goal, often a labeled dataset. To the extent the definition (or dataset) is flawed, so too will be the results.

One way to understand your AI better is to explain your models. We formed Fiddler Labs to help. Feel free to reach us at info@fiddler.ai.