Ensemble algorithms and regularization techniques lie at the heart of many predictive analytics and forecasting projects. When should one be used in favor of the other? Which technique wins — ensembles or regularization?
The answer depends on several factors:
- How will the model be deployed?
- What resources are available?
- Who is consuming the model results?
These questions should be considered at the outset and at each phase of the project, as defined by the Cross Industry Standard Process for Data Mining (CRISP-DM). As consulting data scientists at Elder Research, we use Agile Data Science methods to build and deploy models, but we find it valuable to also engage our clients in this agile journey. For the data scientist, being agile means building models that are sufficiently flexible and complex to solve the problem, but simple enough to perform up to expectations on new data. Too much complexity tends to result in overfit, or performance degradation on new and unseen data. A way is needed to automatically find a balance between simplicity and accuracy.
In years past, the most popular way to control complexity was to penalize the number of terms in a model. One would choose the model which minimized the weighted sum of training error and complexity, as defined by a penalty such as Akaike’s Information Criterion, Minimum Description Length, or Predicted Squared Error. But, as John Elder hopes to explain in a future blog, this works well only for linear models. With nonlinear models, there is a very weak relation between a model’s true complexity (expressive power and flexibility) and its apparent complexity (how much ink is needed to describe it).
In more recent years, emphasis has moved away from deleting terms – that is, minimizing their number – and instead toward reducing their influence. (With ensembles, in fact, more terms or inputs are brought in, compared with a single model, although each term has much less individual influence.) This has become known as regularization.
What is Regularization?
Regularization is broadly defined as any part of the model building process which considers the finiteness, imperfections, and limited information in data (Rosset, 2003; Seni and Elder 2010). An overfit model is one that performs well on the training data, but does much more poorly on new data not seen during training. But, the test performance is what matters; that best indicates how the model is likely to do when deployed in the production environment. Andrey Tikhonov discovered the importance of regularization in 1943, hence the term Tikhonov Regularization. Yet, widespread acceptance of regularization by the technical community only began when statistician Arthur E. Hoerl introduced ridge regression in the 1970s. Adoption accelerated in the 1990s with advanced and targeted procedures such as the Lasso, Garrotte, and LARS algorithm, with Stanford University professors and their protégés leading the charge. And, the procedures are computationally expensive so their popularity grew as computers became exponentially more efficient.
What are Model Ensembles?
An ensemble combines different models to arrive at a single set, or vector of predictions. In the words of Seni and Elder in their book Ensemble Methods in Data Mining, “…it was discovered to be possible to break through asymptotic performance ceiling of an individual algorithm by employing the estimates of multiple algorithms.” The basic recipe is to construct varied models and combine their estimates, though the exact “flavors” of ensembles vary considerably. For instance, Dr. Elder’s approach was termed “bundling” or “model fusion”. He initially combined diverse types of models, each with different strengths such as neural networks, decision trees, and polynomial networks. Others took a different approach by combining models of the same type and creating diversity through bagging (randomly sampling training cases) or gradient boosting involving sequential models, with changing training weights depending on earlier results.
Analytics Dynamic Duo
For a recent fraud detection project our team developed and deployed a regularized regression model. The ratio of fraudulent transactions to false positives (i.e. cases the model identified as fraud but were valid transactions) was triple the ratio of the client’s existing rules based system – a great result. The model was highly interpretable and easy to deploy given its structural simplicity. However, a complex journey was required to arrive at this seemingly simple solution. For instance, the team sifted through hundreds of candidate variables, many of which were weakly predictive of fraud and highly correlated with one another. Building random forests (ensembles built entirely from decision trees) was a great place to start. They require limited investment of time and resources to wrangle the data and the model’s outputs include diagnostics such as ranked variable importance, providing insights into variable selection for other models. This enabled us to generate hypotheses for deriving new variables, such as counts and aggregations, which provided historical insights beyond current transactions. As variables were added and refined, we could rapidly iterate the model build process.
A Faster, Leaner Model
Although the final solution did not include random forests, ensembles helped the team with variable reduction and selection. Boruta, a package in the R statistical software language, uses many random forests and random permutations of model input features to determine the value of candidate variables. Boruta reduced our candidate variables by a factor of ten. We used a Lasso regression model to further reduce variables and to shrink the magnitude of coefficients, which increased resiliency to overfit. But, we needed to answer a critical question: Was our model fit for deployment? Cross validation answered this question. We built many regression models using random samples of the training data while testing model performance on a hold-out sample of data—transactions not used to build the model. Yet, there was one more challenge to address. Some of the transactions in the hold-out sample occurred prior to those used in training. Perhaps underlying processes were transient and evolving, in which case patterns uncovered by the model would no longer be valid? To address this concern, we held out a final data set whose transactions all occurred after those used to build and test our models. The rate of fraud uncovered in these “future” transactions was equal to, or better than, that predicted. Our final regression model could now be deployed.
A Recipe for Success
Are you new to ensembles and regularization? Do you want to be more agile in building models and meeting the analytics goals for your organization? Here are some guidelines for moving forward with these techniques and avoiding “analysis paralysis”.
First, construct an analytic base table (ABT) where example cases (e.g. individual transactions) and candidate model input variables are the rows and columns, respectively. Do you have a labeled outcome (i.e., target) for each example case? Are all of your candidate inputs available at deployment? These questions should be addressed prior to devoting any time to modeling. However, restrict your time budget to variable selection and data cleaning at this stage—think agile! In your preferred language (R, Python, etc.), build a random forest (an ensemble). What are the most important variables? Is this what you expected? Can you derive any other predictors from your top candidate variables? Also, how will you evaluate the “goodness” of your model? Perhaps you will use a receiver operator characteristic (ROC curve) or the true positive rate (“hits”) at some workload or resource constraint. Perhaps you only have resources to review the highest scoring 5% of transactions? In that case, just score a model on how well it’s doing at that point on the list.
A logical next step is to employ a variable reduction technique such as Boruta in R (based on decision tree ensembles) and compare selected variables to those of a regularized Lasso regression model. Again, our main concern is performance on unseen data. Regularization is a balancing act. We want to shrink some coefficients to zero to avoid overfit while retaining non-zero coefficients of those inputs that are truly predictive of the outcome (e.g. fraud). Do you want even better performance, or are decision trees and regression models too plain for your tastes? Get creative and revel in the journey. Run gradient boosted models (e.g. xgboost), support vector machines, and neural networks and have them compete for inclusion in your final model. By our earlier definition, these candidate models are regularized; We have done the hard work of considering the finiteness of the data and have limited the models to access a useful subset of candidate inputs. Finally, follow the advice of Dr. Elder by bundling the best of class candidate models to extract extra performance and propel your “good” project to a “great” one.
Ensembles and regularization are critical tools in any deployed predictive analytics project. These tools help to develop models with few surprises, such as degraded performance on new data. Although the structure of ensembles seem more complex than single models, they offer important benefits such as:
- Helping remove “noise” variables
- Generalizing better than single component models
- Reducing sensitivity to outliers
Regularization performs similar functions through variable reduction (or “shrinkage” of less important variables) and by providing better generalization. And the best part is that these tools play well together. You can build ensembles from regularized models or use regularization to select variables included in your model ensembles. Together, regularization and model ensembling form a beautiful partnership, and we hope you will try them in all of your future analytics projects.