In my previous blog, Ensembles and Regularization – Analytics Superheroes, I reviewed the many advantages of model ensembles including removing “noise” variables, generalizing better than single component models, and reducing sensitivity to outliers.
In this article I take a deeper dive into the attributes and applications of model ensembles, and explore potential downsides to provide context for when to use them.
Ensembles Are Sophisticated
Elder Research defines ten increasingly sophisticated levels of analytics in the Ten Levels of Analytics¹. Model ensembles is Level 9, just below causal modeling, because it is a highly regarded tool for improving predictive value.
Still, don’t choose ensembles just because of their sophistication. Have them prove their worth for your modeling project. To paraphrase the star ball player in Jerry Maguire, “Show me the money!”
First, define the performance metrics to use to evaluate candidate models. Standard measures such as accuracy, sensitivity, and precision may not fully address the business problem you are trying to solve. Other measures may need to be weighed to establish the winning model. These can include business and practical constraints. How will the winning model be deployed in real-world operating conditions? Will the model function at a specific workload or range of workloads where the most interesting cases are reviewed first? If so, we only care about specific points or ranges along the Receiver Operating Characteristic (ROC) curve (assuming binary classification). Metrics such as area under the curve (AUC) are meaningless in such scenarios because AUC equally weighs performance along the whole ROC curve, whereas you would only want to evaluate accuracy on the top-scoring (most interesting) cases. Another consideration is the trade-off between model performance, model interpretability, and model scoring speed. As the data grow, scoring hundreds of decision trees in a model ensemble could become too slow in situations where near real time evaluation is required.
One Goal, Many Paths
In the 1990s, Dr. John Elder discovered the marvel of ensembles while working on the challenge of predicting species of bats. He got diversity of results between base learners (single models that contribute to making predictions within an ensemble) by using completely different types of modeling algorithms, and called the combination “model fusion” or “bundling”. It worked best when the base learners — such as decision trees, linear discriminate analysis (LDA), and neural networks (models 1, 2, and 3 in Figure 1) — each performed sufficiently well on its own. John found they could be somewhat overfit and the ensembling (by averaging or voting, say) would compensate. In fact, the ensemble would be statistically less complex (paradoxically) than even a single component model², ³.
Figure 1. Model bundling
In cases where one or more of the candidate base learners exhibits weak performance relative to the others, bundling a subset of relatively “good” base learners may result in better overall performance. Permutation testing is required to find the best (or nearly optimal) model subset for bundling. One drawback is that custom code may be required to build the bundled model ensemble and to perform the necessary accounting. This includes using the same data for training and for out-of-sample evaluation, and for generating the overall ensemble score or class for each case via averaging, voting, or some other method.
Another way to create diversity in the base learners is to vary the cases (rather than the model type) through bootstrap aggregation (aka bagging). Each model is trained using a different random sample (see “sampling strategy” in Figure 2), for the training data.
Figure 2. Ensemble building with the same model type
Modern ensemble techniques rely on more than bagging to generate variability in their base learners. For instance, random forests, an ensemble of decision trees, employ bagging as well as a random selection of candidate input features at each tree split to increase variation in base learners. Even though the structure of individual trees can be quite variable—different variables selected in splits, different parameter values, and variation in the location and depth of splits in the trees—the underlying structure of the model never changes. We’re restricted to the same structural entities – decision trees in this example, without knowing if trees best represent the method to solve our problem. By depending on a single model type in our ensemble, we risk building many weak base learners with common limitations while misclassifying the same cases, or subsets of cases, with similar properties. Boosting is a technique that can overcome these shortcomings. With boosting, models are built in sequence, and later ones correct the errors of earlier ones by upweighting errors. That is, new models pay more attention to the training cases misclassified previously. Ensemble strategies are summarized in Table 1.
Table 1. Ensemble building strategies and key attributes
Ensembles Gone Wrong
There are conditions under which ensembles fail or underperform and a single model can outperform an ensemble of that same type. For instance, a gradient boosted model with a bad shrinkage parameter can perform worse on a simple classification problem compared to a single decision tree. Consider Figure 3, with two classes, red and blue, located inside and outside of an oval, respectively.
Figure 3. Effect of shrinkage hyperparameter on GBM ensemble performance. Decision boundaries shown for a single CART tree (green), random forest model (black, thick line), and GBM (purple) along with the true circular decision boundary (black, thin line) separating the two classes (blue and red points).
The shrinkage hyperparameter controls the impact each additional base learner has on the developing ensemble; set to 0.2, the generalized boosted regression model ensemble (GBM; in purple) fits the true boundary of the circle with precision equal to that of a random forest with default settings (black line). Both the GBM and random forest fit the circle better than a single CART decision tree (green line). However, if the shrinkage is 0.01, the GBM decision boundary is nearly square and worse than those of the random forest and single CART tree.
A Recipe For Success
Ensembles can be more challenging but the performance benefits are usually worth the effort. For classification, I recommend starting by building a classification and regression tree (CART) with N-fold (N≥5) cross validation. The CART model out-of-sample performance will provide a valuable baseline for evaluating an ensemble. Next, build a random forest model with default settings, including hyperparameters such as number of trees and fraction of input variables selected at each split. Then assess whether the random forest adds predictive value based on the evaluation criteria established prior to starting your experiment. Be sure to follow best practices in building and evaluating the random forest model, including:
- Dividing your data into training and evaluation sets
- Performing cross validation on training data
- Comparing performance metrics (accuracy, precision, recall) across all folds and evaluation data
Is overfit an issue, or how much overfit is acceptable?
From our tests against data sets posted on the UCI machine learning repository (e.g. annual earnings, housing home value, land cover type, contraceptive use, and letter recognition) almost all ensemble variants using decision trees performed well out of the box. Models included random forests and variants (e.g. extraTrees in R) as well as gradient boosted trees. These model ensembles required very little data preparation / cleaning prior to model building. One surprise was that CART (rpart model in R) performed on par with tree ensembles with minimal tuning. Setting only one parameter — the number of cases in a node required for additional splitting (minsplit), to 20 — prevented model overfit. In addition, setting the complexity parameter (cp) to zero resulted in the best cross validation and evaluation performance. This finding reinforced the notion that regularization (of single models) and ensembles are complementary tools for achieving the goals of model quality and generalizability. To understand if an ensemble is adding value and predictive power compare its performance against a tuned and regularized single model.
My experience with model ensembles and regularization has led me to ask several questions:
Frontier 1: What is the interplay between regularized base learners and ensemble-building? Is there any advantage to these analytics superheroes teaming up or should they go solo (or compete against one another for dominance)? Usually, regularization and model ensembling are considered separately. One challenge in ensemble building with regularization is the search space – that is, simultaneously selecting the best base learner regularization parameters along with the ensemble hyperparameters.
Frontier 2: Are there better ways to combine base learners in our ensemble? In this blog, we have focused on combining base learners in static ways. That is where the decision of how to combine models is not dependent on the input signal, and this is true even in boosted models. A little-discussed alternative is dynamically combining base learners. In this “mixture of experts” method (Bullinaria 2004), model outputs are combined non-linearly using a gating system such as a neural network. In a classification model, the decision of every case potentially represents a different set of weights across base learners.
Still, before trying either of these approaches (combining regularization and ensembling or mixture of experts), consider how much added lift in predictive power is needed to justify the extra effort. In many situations, you may meet your analytic objectives with either a single well-tuned and regularized model or with an ensemble with minimal hyper-parameter tuning. Is the ensemble model defensible? Does your final model ensemble meet performance objectives set forth for the project? Does it beat (or at least match) performance of a single regularized model? If not, just use the single model and not the ensemble. Is it explainable? The most important features should make business and logical sense for inclusion in the model. And, most importantly, does the model perform well on truly out-of-sample data? For time series data, does the model perform reasonably well on data time-stamped after all data used for training and cross-validation? If you can successfully answer these questions, you are on the path to success.