How and Why to Interpret Black Box Models


Grant Fleming

Date Published:
March 27, 2020

Demand for data science services continues to accelerate, which has fueled the rapid development of ever more complex models. That complexity has contributed to the poor application of models and thus to controversy surrounding the true value of data science. It is vital for us as data scientists to ensure that, while our models continue to improve in performance, we can also interpret how they function, and thereby diagnose any harms that they might cause through biased or unfair predictions.


Top data science models now exceed doctors’ performance at detecting multiple medical issues, surpass human reading comprehension, and generate photorealistic images of imaginary people. Progress in the performance of models will increase and the scope of solvable problems will widen as corporations and research groups continue to expend vast resources on developing innovative models and techniques. This relentless progress along with associated increases in model complexity is a primary force shaping the future of data science.


Progress in data science has also led to controversy, with new reports of governments and companies intentionally deploying harmful models appearing regularly. While most data science practitioners operate with good intentions, that is not enough to avoid the unintended consequences of bad modeling practices. Misapplied models within healthcarethe legal systemhiring processes, and home loan offerings have harmed the people and organizations that they were built to serve. Such cases have understandably led to calls for stronger regulation around algorithmic data collection, transparency, and fairness. If relentless progress is one important force shaping data science’s future, then another is the increased public skepticism of data science and demands for increased regulation of it.

If data science continues to grow in popularity and the models deployed become ever more complex, how can the industry avoid further model misapplications and comply with current and future regulations? How can we as data scientists make sure that we are doing data science ethically by balancing high modeling performance with accountability to our clients and wider society? Employing interpretability methods is one way to better explain how our models function and diagnose issues of bias or fairness that might otherwise go undetected.

Data science practitioners must strive to ethically balance model performance with the ability to interpret how a model’s predictions are generated in order to make their models more accountable.
Learn More

Popular Ways to Explain Black Box Models

The most effective modeling approaches today are “black boxes” — models with mathematical behavior too complicated to directly interpret the effect that individual input features have on the model’s output. Fortunately, several interpretability methods exist which can be layered on top of any arbitrary black box model to approximate how:

  1. Individual features impact predictions globally across the model (global interpretability)
  2. Individual predictions are locally influenced by feature values (local interpretability)

Let’s survey the most used methods in these two categories.

Global Interpretability Methods

As a rule, global interpretability methods are model agnostic — they can work with any method. They identify important features in models by calculating which ones most impact modeling error when modified. For example, permutation feature importance identifies a model’s key features by randomly shuffling the values of each feature, and ranking features (in descending order) based on how much the modeling error increases as a consequence. For example, Figure 1 shows that, when predicting home price within the Ames Housing dataset, the most important input features are: main living area square footage (gr_liv_area), basement square footage (total_bsmt_sf), and garage square footage (garage_area).

Unfortunately, feature importance methods don’t explain how a feature contributes to the model’s predictions. They can’t identify cases where a feature’s values have a non-linear impact on a model’s predictions or show whether increased feature values have a positive or negative effect on the output. For these cases, practitioners can use Individual Conditional Expectation (ICE) plots and Partial Dependence Plots (PDPs). Figure 2 demonstrates how changing values of square footage affect the sale price of a home. Each black line represents an ICE plot for an individual observation. Only one point on each line comes from real data; the rest are created by changing the input feature to be the associated value on the x-axis and calculating the new prediction. We can tell by the curvature of the ICE plots that the feature (square footage) has a non-linear impact on the model’s predictions (home price). Because all of the ICE plots in Figure 2 show roughly the same degree of curvature, we can also see that there is little interaction between square footage and the other input features.

Figure 2: ICE and PDP plots of the impact of square footage (x-axis) on the sale price of homes (y-axis). The PDP plot (red line) is the average of all of the ICE plots (black lines) The non-linear relationship between the square footage and sale price is clearly shown by both plots

Datasets with many observations can produce cluttered ICE plots that are hard to read. In these cases, overlaying a PDP plot (the red line in Figure 2) makes the aggregate trends of the ICE plots easier to interpret. The PDP plot is a simple average of all of the ICE plots. In Figure 2, it shows that changes in square footage has a larger impact on sale price in a narrow region.  Between 1,000 and 3,000 square feet the predicted sale price changes by tens of thousands of dollars, while moving from 3,000 square feet to 6,000 square feet results in almost no price change. Such relationships between the features and the output are hard to discover without global interpretability methods.

Local Interpretability Methods

Global interpretability methods help us understand how our models behave generally, however, they are not capable of producing the justifications behind individual predictions required by regulations like GDPR, or by clients in cases where erroneous predictions are especially harmful or costly (e.g. falsely accusing someone of fraud). To provide more precise explanations for an individual observation, practitioners should use methods capable of explaining the local contribution of feature values at a given spot in features space rather than methods that approximate global feature contributions across all observations. Methods such as Local Interpretable Model-agnostic Explanations (LIME) and Shapley Values can generate these local feature contributions.

Local interpretability methods can be divided into approximate methods and exact methods. The most popular of the approximate methods, LIME, assigns local feature contributions to specific observations by linearly approximating the behavior of the black box model within a small region around an observation of interest. LIME describes the black box model’s behavior within these regions by first imputing artificial observations around the one of interest to ensure that the region is populated by observations with similar feature values. Afterwards, an interpretable model (usually linear or logistic regression) is fit using all of the observations within the region as input. Finally, the LIME model’s fitted coefficients are used as approximate feature contributions for the underlying black box model within that region. Figure 3 shows an example of the LIME outputs for observation 122 of the Ames Housing dataset. Each of the horizontal bars show how much each feature value contributed to the LIME model’s output for that observation. The working assumption is that these feature contributions approximate those of the underlying black box for observations with similar feature values.

The exact methods for calculating local feature contribution arose out of a game theory concept called the Shapley value. Shapley values are calculated as each feature value’s excess contribution; that is, its exact contribution to a specific output across all combinations of the features minus the average contribution of all features. Because computing Shapley values is very expensive when the input dataset has many features, sampling variants like SHAP are often used to generate reasonable approximations of Shapley values while using much less compute time. Similar to Figure 3, Figure 4 illustrates a plot of feature attributions for observation 122 of the Ames Housing dataset, though I employed a different underlying model here (Random Forest) to provide a greater range of feature contributions. Interestingly, the garage_area variable, which had the second highest positive feature contribution according to LIME, now has the most negative feature contribution according to Shapley values. I will publish a blog post in the future that will explore understanding cases such as this where interpretability methods appear contradictory.

Interpreting Deep Neural Networks (DNN)

Local interpretability methods like Lime and Shapley values have also proven successful as a diagnostic tool in cases where Deep Neural Networks (DNNs) misclassify observations. Figure 5 shows an example of LIME being used with a Convolutional Neural Network (CNN) to explore why an image of a husky might have been misclassified as a wolf. The model is revealed to have picked up on the snow in the background of the image to make its wolf classification, alerting us that the original training data contained that spurious correlation. In this case, the modelers clearly need to augment their training data to make their model more robust.

The issue of DNNs being caught off-guard by insufficient testing can be addressed by several DNN-specific local interpretability methods. In general, each method relies on some function of the activation gradients that are backpropagated through the network after their predictions are generated. For example, Google Research’s Integrated Gradients method attributes a value to each input feature of the network. Typically, these input features are an image pixel or individual word. The value attributed to each input feature is the difference between the integral of the gradient between the input of interest and the network activation for a user-defined baseline input (e.g., a black image). Figure 6 illustrates these gradients for image data by making the more important features (pixels with higher gradient activations) appear brighter.


Data scientists will continue to develop more complex models as they chase higher model performance. Similarly, individuals, organizations, and governments will only intensify their demands for standards of model accountability. The application of interpretability methods addresses calls for model accountability by improving model transparency and issue diagnostics, ethically serving the sometimes divergent needs of data scientists, regulators, and the public.