AI needs a new developer stack!

In today’s world, data has played a huge role in the success of technology giants like Google, Amazon, and Facebook. All of these companies have built massively scalable infrastructure to process data and provide great product experiences for their users. In the last 5 years, we’ve seen a real emergence of AI as a new technology stack. For example, Facebook built an end-to-end platform called FBLearner that enables an ML Engineer or a Data Scientist build Machine Learning pipelines, run lots of experiments, share model architectures and datasets with team members, scale ML algorithms for billions of Facebook users worldwide. Since its inception, millions of models have been trained on FBLearner and every day these models answer billions of real-time queries to personalize News Feed, show relevant Ads, recommend Friend connections, etc.  

However, for most other companies building AI applications remains extremely expensive. This is primarily due to a lack of systems and tools for supporting end-to-end machine learning (ML) application development — from data preparation and labeling to operationalization and monitoring [1][9][10][11].

The goal of this post is 2-fold:

  1. List the challenges with adopting AI successfully: data management, model training, evaluation, deployment, and monitoring;
  2. List the tools I think we need to create to allow developers to meet these challenges: a data-centric IDE with capabilities like explainable recommendations, robust dataset management, model-aware testing, model deployment, measurement, and monitoring capabilities.

Challenges of adopting AI

In order to build an end-to-end ML platform, a data scientist has to go through multiple hoops of the following workflow [3].

End-to-End ML Workflow

A big challenge to building AI applications is that different stages of the workflow require new software abstractions that can accommodate complex interactions with the underlying data used in AI training or prediction. For example:

Data Management requires a data scientist to build and operate systems like Hive, Hadoop, Airflow, Kafka, Spark etc to assemble data from different tables, clean datasets, procure labeling data, construct features and make them ready for training. In most companies, data scientists rely on their data engineering teams to maintain this infrastructure and help build ETL pipelines to get feature datasets ready.

Training models is more of an art than science. It requires understanding which features work and what modeling algorithms are suitable to the problem at hand. Although there are libraries like PyTorch, TensorFlow, Scikit-Learn etc, there is a lot of manual work in feature selection, parameter optimization, and experimentation.

Model evaluation is often performed as a team activity since it requires other people to review the model performance across a variety of metrics from AUC, ROC, Precision/Recall and ensure that model is calibrated well, etc. In the case of Facebook, this was built into FBLearner, where every model created on the platform would get an auto-generated dashboard showing all these statistics.  

Deploying models requires data scientists to first pick the optimal model and make it ready to be deployed to production. If the model is going to impact business metrics of the product and will be consumed in a realtime manner, we need to deploy it to only a small % of traffic and run an A/B test with an existing production model. Once the A/B test is positive in terms of business metrics, the model gets rolled out to 100% of production traffic.

Inference of the models is closely tied with deployment, there can be 2 ways a model can be made available for consumption to make predictions.

  • batch inference, where a data pipeline is built to scan through a dataset and make predictions on each record or a batch of records.
  • realtime inference, where a micro-service hosts the model and makes predictions in a low-latency manner.

Monitoring predictions is very important because unlike traditional applications, model performance is non-deterministic and depends on various factors such as seasonality, new user behavior trends, data pipeline unreliability leading to broken features. For example, a perfectly functioning Ads model might need to be updated when a new holiday season arrives or a model trained to show content recommendations in the US may not do very well for users signing up internationally. There is also a need for alerts and notifications to detect model degradation quickly and take action.

As we can see, the workflow to build machine learning models is significantly different from building general software applications. f models are becoming first-class citizens in the modern enterprise stack, they need better tools. As Tesla’s Director of AI Andrej Karpathy succinctly puts it, AI is Software 2.0 and it needs new tools [2].

If we compare the stack of Software 1.0 with 2.0, I claim we require transformational thinking to build the new developer stack for AI.

We need new tools for AI engineering

In Software 1.0, we have seen a vast amount of tooling built in the past few decades to help developers write code, share it with other developers, get it reviewed, debug it, release it to production and monitor its performance. If we were to map these tools in the 2.0 stack, there is a big gap!

What would an ideal Developer Toolkit look like for an AI engineer?

To start with, we need to take a data-first approach as we build this toolkit because, unlike Software 1.0, the fundamental unit of input for 2.0 is data.

Integrated Development Environment (IDE): Traditional IDEs focus on helping developers write code, focus on features like syntax highlighting, code checkpointing, unit testing, code refactoring, etc.

For machine learning, we need an IDE that allows easy import and exploration of data, cleaning and massaging of tables. Jupyter notebooks are somewhat useful, but they have their own problems, including the lack of versioning and review tools. A powerful 2.0 IDE would be more data-centric, starts with allowing the data scientist to slice and dice data, edit the model architecture either via code or UI and debug the model on egregious cases where it might be not performing well. I see traction in this space with products like StreamLit [13] reimagining IDEs for ML.

Tools like Git, Jenkins, Puppet, Docker have been very successful in traditional software development by taking care of continuous integration and deployment of software. When it comes to machine learning, the following steps would constitute the release process.

Model Versioning: As more models get into production, managing the various versions of them becomes important. Git can be reused for models, however, it won’t scale for large datasets. The reason to version datasets is that to be able to reproduce a model, we need the snapshot of the data the model was trained upon. Naive implementations of this could explode the amount of data we’re versioning, think 1-copy-of-dataset-per-model-version. DVC [12] which is an open-source version control system is a good start and is gaining momentum.

Unit Testing is another important part of the build & release cycle. For ML, we need unit tests that catch not only code quality bugs but also data quality bugs.

Canary Tests are minimal tests to quickly and automatically verify that the everything we depend on is ready. We typically run Canary tests before other time-consuming tests, and before wasting time investigating the code when the other tests are failing [8]. In Machine Learning, it means being able to replay a previous set of examples on the new Model and ensuring that it meets certain minimal set of conditions.

A/B Testing is a method of comparing two versions of an application change to determine which one performs better [7]. For ML, AB testing is an experiment where two or more variations of the ML model are exposed to users at random, and statistical analysis is used to determine which variation performs better for a given conversion goal. For example in the dashboard below, we’re measuring click conversion on an A/B experiment dashboard that my team built at Pinterest, and it shows the performance of the ML experiments against business metrics like repins, likes, etc. CometML [14] lets data scientists keep track of ML experiments and collaborate with their team members.

Debugging: One of the main features of an IDE is the ability to debug the code and find exactly the line where the error occurred. For machine learning, this becomes a hard problem because models are often opaque and therefore exactly pinpointing why a particular example was misclassified is difficult. However, if we can understand the relationship between feature variables and the target variable in a consistent manner, it goes a long way in debugging models, also called model interpretability, which is an active area of research. At Fiddler, we’re working on a product offering that allows data scientists to debug any kind of models and perform root cause analysis.

Profiling: Performance analysis is an important part of SDLC in 1.0 and profiling tools allow engineers to figure out slowness of an application and improve it. For models, it is also about improving performance metrics like AUC, log loss, etc. Often times, a given model could have a higher score on an aggregate metric but it can be performing poorly on certain instances or subsets of the dataset. This is where tools like  Manifold [5] can enhance the capabilities of traditional performance analysis.

Monitoring: While superficially, application monitoring might seem similar to model monitoring and could actually be a good place to start, we need to track a different class of metrics for machine learning. Monitoring is crucial for models that automatically incorporate new data in a continual or ongoing fashion at training time, and is always needed for models that serve predictions in an on-demand fashion. We can categorize monitoring into 4 broad classes:

  • Feature Monitoring: This is to ensure that features are stable over time, certain data invariants are upheld, any checks w.r.t privacy can be made as well as continuous insight into statistics like feature correlations.
  • Model Ops Monitoring: Staleness, regressions in serving latency, throughput, RAM usage, etc.
  • Model Performance Monitoring: Regressions in prediction quality at inference time.
  • Model Bias Monitoring: Unknown introductions of bias both direct and latent.
Annual Income vs Probability of defaulting on a granted Loan in a Credit Risk Model trained on a public lending dataset.

Conclusion

I walked through 1) some challenges to successfully deploying AI (data management, model training, evaluation, deployment, and monitoring), 2) some tools I propose we need to meet these challenges (a data-centric IDE with capabilities like slicing & dicing of data, robust dataset management, model-aware testing, and model deployment, measurement, and monitoring capabilities). If you are interested in some of these tools, we’re working on them at Fiddler Labs. And if you’re interested in building these tools, we would love to hear from you at https://angel.co/fiddler-labs

References

  1. https://arxiv.org/pdf/1705.07538.pdf
  2. https://medium.com/@karpathy/software-2-0-a64152b37c35
  3. https://towardsdatascience.com/technology-fridays-how-michelangelo-horovod-and-pyro-are-helping-build-machine-learning-at-uber-28f49fea55a6
  4. https://christophm.github.io/interpretable-ml-book/
  5. https://eng.uber.com/manifold/
  6. http://www.fiddler.ai
  7. https://www.optimizely.com/optimization-glossary/ab-testing/
  8. https://dzone.com/articles/canary-tests
  9. https://research.fb.com/wp-content/uploads/2017/12/hpca-2018-facebook.pdf
  10. http://stevenwhang.com/tfx_paper.pdf
  11. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
  12. https://dvc.org/
  13. https://streamlit.io/
  14. https://www.comet.ml/

A gentle introduction to GA2Ms, a white box model

A gentle introduction to a white box machine learning model called a GA2M, a Generalized Additive Model (GAM) with interaction terms.

This post is a gentle introduction to a white box machine learning model called a GA2M.

We’ll walk through:

  • What is a white box model, and why would you want one?
  • A classic example white box model: logistic regression
  • What is a GAM, and why would you want one?
  • What is a GA2M, and why would you want one?
  • When should you choose a GAM, a GA2M, or something else?

The purpose of all these machine learning models is to make a prediction towards a goal specified by a human. Think of a model that can predict loan default, or the presence of someone’s face in a picture.

The short story: A generalized additive model (GAM) is a white box model that is more flexible than logistic regression, but still interpretable. A GA2M is a GAM with interaction terms, which allows it to be more flexible still, but with a more complicated interpretation. GAMs and GA2Ms are an intriguing addition to your toolbox, interpretable at the expense of not fitting every kind of data. A picture:

For more about what that all means, read on.

White box models

The term “white box” comes from software engineering. It means software whose internals you can view, compared to a “black box” whose internals you cannot view. By this definition, a neural network could be a white box model if you can see the weights (picture credit):

However, by white box people really mean something they can understand. A white box model is a model whose internals a person can see and reason about. This is subjective, but most people would agree the weights shown above don’t give us information about how the model works in such a way as we could usefully describe it, or predict what the model is going to do in the future.

Compare the picture above to this one about risk of death from pneumonia by age from [1]:

Now that isn’t a whole model. Rather, it’s the impact of one feature (age) on the risk score. The green lines are error bars (±1 standard deviation in 100 rounds of bagging). The red line in the middle of them is the best estimate. In the paper, they observe:

  1. Risk score is flat until about age 50. Risk score here is negative, meaning less risk of death than the average in the dataset.
  2. Risk score rises sharply at 65. This could be due to retirement. In future, it might be interesting to gather data about retirement.
  3. The error bars are narrowest in ages 66-85. Perhaps that is where the most data is.
  4. Risk score rises again at 85. The error bars also widen again. Maybe this jump is not real.
  5. Risk score drops above 100. This may be due to lack of data, or something else. In the paper, they suggested one might wish to “fix” this region of the model by changing it to predict at the same level as ages 85-100 instead of dropping. This fix is using domain knowledge (“risk of pneumonia likely doesn’t go down after age 85”) to address possible model artifacts.
  6. Risk score between 66 and 85 is relatively flat.

All this from one graph of one model feature. There are facts, like the shape of the graph, and then speculation about why the graph might behave that way. The facts are useful to understand the data. The speculation cannot be answered by any tool, but may be useful to suggest further actions, like collecting new features (say, about retirement) or new instances (like points below age 50 or above 100), or new analyses (like looking carefully at data instances around ages 85-86 for differences).

These aren’t simulations of what the model would do. These are the internals of the model itself, so that graph is accurately describing the exact effect of age on risk score. There are 55 other components to this model, but each can be examined and reasoned about.

This is the power of a white box model.

This example also shows the dangers. By seeing everything, we may believe we understand everything, and speculate wildly or “fix” inappropriately. As always, we have to exercise judgment to use data properly.

In summary: make a white box model to

  • learn about your model, not from simulations or approximations, but the actual internals
  • improve your model, by giving you ideas of directions to pursue
  • “fix” your model, i.e., align it with your intuition or domain knowledge

One final possibility: regulations dictate that you need to fully describe your model. In that case, it could be useful to have human-readable internals for reference.

Here are some examples of white box and black box models:

White box modelsBlack box models
Logistic regression
GAMs
GA2Ms
Decision trees (short and few trees)
Neural networks (including deep learning)
Boosted trees and random forests (many trees)
Support vector machines

Now let’s walk through three specific white box models.

A classic: logistic regression

Logistic regression was developed in the early 1800s, and re-popularized in the 1900s. It’s been around for a long time, for many reasons. It solves a common problem (predict the probability of an event), and it’s interpretable. Let’s explore what that means. Here is the logistic equation defining the model:

There are three types of variables in this model equation:

  • p is the probability of an event we’re predicting. For example, defaulting on a loan
  • The x’s are features. For example, loan amount.
  • The 𝛽’s (betas) are the coefficients, which we fit using a computer.

The betas are fit once to the entire dataset. The x’s are different for each instance in the dataset. The p represents an aggregate of dataset behavior: any dataset instance either happened (1) or didn’t (0), but in aggregate, we’d like the right-hand side and the left-hand side to be as close as possible.

The “log(p/(1-p))” is the log odds, also called the “logit of the probability”. The odds are (probability the event happened)/(probability the event won’t happen), or p/(1-p). Then we apply the natural logarithm to translate p, which takes the range 0 to 1, to a quantity which can range from -∞ to +∞, suitable for a linear model.

This model is linear, but for the log odds. That is, the right-hand side is a linear equation, but it is fit to the log odds, not the probability of an event.

This model is interpretable as follows: a unit increase in xi is a log-odds increase in 𝛽i.

For example, suppose we’re predicting probability of loan default, and our model has a feature coefficient 𝛽1=0.15 for the loan amount feature x1. That means a unit increase in the feature corresponds to a log odds increase of 0.15 in default. We can take the natural exponent to get the odds ratio, exp(0.15)=1.1618. That means:

for this model, a unit increase (of say, a thousand dollars) in loan amount corresponds to a 16% increase in the odds of loan default, holding all other factors constant.

This statement is what people mean when they say logistic regression is interpretable.

To summarize why logistic regression is a white box model:

  • The input response terms (𝛽ixi terms) can be interpreted independently of each other
  • The terms are in interpretable units: the coefficients (betas) are in units of log odds.

So why would we use anything other than the friendly, venerable model of logistic regression?

Well, if the features and log odds don’t have a linear relationship, this model won’t fit well. I always think of trying to fit a line to a parabola:

If you have non-linear data (the black parabola), a linear fit (the blue dashed line) will never be great. No line fits the curve.

Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) were developed in the 1990s by Hastie and Tibshirani. (See also chapter 9 of their book “The Elements of Statistical Learning”.) Here is the equation defining the model:

This equation is quite similar to logistic regression. It has the same three types of elements:

  • E(Y) is an aggregate of dataset behavior, like the “p” in the equation above. In fact, it may well be the probability of an event, the same p.
  • g(.) is a link function, like the logit (or log odds) from the logistic equation above.
  • fi(xi) is a term for each dataset instance feature x1,…,xm.

The big difference is instead of a linear term 𝛽ixi for a feature, now we have a function fi(xi). In their book, Hastie and Tibshirani specify a “smooth” function like a cubic spline. Lou et al. [2] looked at other functions for the fi, which they call “shape functions.”

A GAM also has white box features:

  • The input response terms (f(xi) terms) can be interpreted independently of each other
  • The terms are in interpretable units. For the logit link function, these are log odds.

Now a term, instead of being a constant (beta), is a function, so instead of reporting the log odds as a number, we visualize it with a graph. In fact, the graph above of pneumonia risk of death by age is one term (shape function) in a GAM.

So why would we use anything other than a GAM? It’s already flexible and interpretable. Same reason as before: it might not be accurate enough. In particular, we’ve assumed that each feature response can be modeled with its own function, independent of the others.

But what if there are interactions between the features? Several black box models (boosted trees, neural networks) can model interaction terms. Let us walk through a white box model that also can: GA2Ms.

GAMs with interaction terms (GA2Ms)

GA2Ms were investigated in 2013 by Lou et al. [3]. The authors pronounce them with the letters “gee ay two em”, but in house we’ve taken to calling them “interaction GAMs” because it’s more pronounceable. Here is the model equation:

This equation is quite similar to the GAM equation from the previous section, except it adds functions that can account for two feature variables at once, i.e. interaction terms.

Microsoft just released a library InterpretML that implements GA2Ms in python. In that library, they call them “Explainable Boosting Machines.”

Lou et al. say these are still white box models because the “shape function” for an interaction term is a heatmap. The two features are along the X and Y axis, and the color in the middle shows the function response. Here is an example from Microsoft’s library fit to predicting loan default on a dataset of loan performance from lending club:

For this example graph:

  • The upper right corner is the most red. That means the probability of default goes up the most when dti (debt to income ratio) and fico_range_midpoint (the FICO credit score) are both high.
  • The left strip is also red, but turns blue near the bottom. That means that very low dti is usually bad, except if fico_range_midpoint is also low.

This particular heatmap is hard to reason about. This is likely only the interaction effect without the single-feature terms. So, it could be that the probability of default overall isn’t higher at high-dti and high-fico, but rather just higher than either of the primary effects predict by themselves. To investigate further, we could probably look at some examples around the borders. But, for this blog post, we’ll skip the deep dive.

In practice, this library fits all single-feature functions, then N interaction terms, where you pick N. It is not easy to pick N. The interaction terms are worthwhile if they add enough accuracy to be worth the extra complexity of staring at heatmaps to interpret them. That is a judgement call that depends on your business situation.

When should we use GAMs or GA2Ms?

To perform machine learning, first pick a goal. Then pick a technology that will best use your data to meet the goal. There are thousands of books and millions of papers on that subject. But, here is a drastically simplified way to think about how GA2Ms fit in to possible model technologies: they are on a spectrum from interpretability to modeling feature interactions.

  • Use GAMs if they are accurate enough. It gives the advantages of a white box model: separable terms with interpretable units.
  • Use GA2Ms if they are significantly more accurate than GAMs, especially if you believe from your domain knowledge that there are real feature interactions, but they are not too complex. This also gives the advantages of a white box model, with more effort to interpret.
  • Try boosted trees (xgboost or lightgbm) if you don’t know a lot about the data, since it is quite robust to quirks in data. These are black box models.
  • When features interact highly with each other, like pixels in images or the context in audio, you may well need neural networks or something else that can capture complex interactions. These are deeply black box.

In all cases, you may well need domain-specific data preprocessing, like squaring images, or standardizing features (subtracting the mean and dividing by the standard deviation). That is a topic for another day.

Now hopefully the diagram we started with makes more sense.

At Fiddler Labs, we help you explain your AI. Email us at info@fiddler.ai.

References

  1. Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-Day Readmission.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721–1730. KDD ’15. New York, NY, USA: ACM, 2015. https://doi.org/10.1145/2783258.2788613.
  2. Lou, Yin, Rich Caruana, and Johannes Gehrke. “Intelligible Models for Classification and Regression.” In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150–158. KDD ’12. New York, NY, USA: ACM, 2012. https://doi.org/10.1145/2339530.2339556.
  3. Lou, Yin, Rich Caruana, Giles Hooker, and Johannes Gehrke. Accurate Intelligible Models with Pairwise Interactions, 2017. https://www.microsoft.com/en-us/research/publication/accurate-intelligible-models-pairwise-interactions/.