For a new data scientist, the first real project can be challenging. Real-world engagements aren’t as cleanly set up as academic assignments! One usually experiences many pitfalls but can learn many valuable lessons. I’ve found that a sensible choice of modeling method can help alleviate many headaches, and strongly recommend considering Random Forests.
Here are four reasons. Random Forests are:
- Not (excessively) complex, mathematically
- Very helpful with thorny data wrangling challenges
- Scalable across large datasets (big data)
- Widely available in data science toolkits
Below is an example of training and classification processes using random forest. A) Each decision tree in the ensemble is built upon a random bootstrap sample of the original data, which contains positive (green labels) and negative (red labels) examples. B) Class prediction for new instances using a random forest model is based on a majority voting procedure among all individual trees.
Not Mathematically Complex
Unlike Support Vector Machines and Neural Networks, Random Forests are not very mathematically complex. You don’t need a PhD to understand how they work. Many algorithms require tuning parameters that have little or no intuitive explanation. The parameters for Random Forests have clear implications and can be reliably modified, even by a novice. Also, Random Forests are non-parametric— the algorithm makes no strong assumptions about the underlying statistical nature of the data. This is desirable since using parametric algorithms with incorrect assumptions of the underlying data can lead to very poor results.
Can Assist With Your Most Vexing Data Challenges
The most time-consuming part of data analysis is the data wrangling stage, which consists of gathering, validating, cleaning, and preparing the data to effectively transform it for use by analytical algorithms. Many estimates put data wrangling at 80% of the workload for a data scientist, though it seems like 80% of the discussion in data science focuses on the models and algorithms. Random Forests simplify the data wrangling portion of analytic work through such properties as their:
Robustness to Outliers
Random Forests use trees, which split the data into groups (repeatedly) according to whether a case is above or below a selected threshold value on a selected feature variable. It doesn’t matter how much higher it is, for instance, just if it’s higher. Thus, input outliers don’t have extra influence, like they do in regression, for instance, where they can become known as leverage points. Also, output outliers will affect the estimate of the leaf node they are in, but not the values of any other leaf node. Again, this is different from other methods – ones John Elder calls “consensus” methods, like regression or neural networks – where every data point affects the estimate at every other data point. Instead, tree methods are “contributory” methods, where only local points – those in the same leaf node – affect a given point’s estimate. So output outliers have a “quarantined” effect. Thus, outliers that would wildly distort the accuracy of some algorithms have less of an effect on the prediction of a Random Forest.
Features often have different scales, and some algorithms, such as k-Nearest Neighbors, only do well if features are first transformed to have the same range. While the best implementations of k-NN do that automatically, naïve implementations don’t, and one can get very misleading results (where the variables with the largest ranges dominate the distance, and thus the decisions). One never has to worry about scaling, though, with Random Forests. As they are based on decision trees, they only depend on rank; that is, only a data point’s relative value within a feature, not its magnitude, matters during training.
Ability to Handle Missing Data
One of the more difficult challenges of data wrangling is deciding how to handle observations with missing data. The crudest way is to remove all rows (or columns) with any missing data. One could instead try to impute (fill in) the missing value, but this involves tough judgment calls (e.g., use the column mean, median, or mode?) Or, should you develop another model to predict the missing values? These are tough questions for an experienced data scientist, and really requires considering whether the data is missing at random or missing for a systematic reason, and adds another layer of complexity before modeling can even start. Random Forests can’t solve this problem, but fortunately, it can allow you to jump past it to get going, and look at “missing-ness” as another feature level it takes into account when making predictions. This means we do not need to remove potentially useful data from the model and Random Forests can make reasonable decisions with missing data without guessing about what the missing value might be.
Ability to Select Features
Deciding which features to use is key to building effective predictive models. Excluding important features is bad, of course, because important information about the model is lost. But adding too many features can unnecessarily increase the complexity and computational cost of the model; it also increases the risk that the model and data scientist will listen to noise in the data instead of the signal we are trying to capture. Random Forests build many, many trees, which each ruthlessly winnow features away. The collection of features employed is very useful to examine. And it is typically not harmful to the modeling process at all to include extraneous variables in the set of candidate inputs, unlike with say, neural networks, which have to use all the variables presented, whether they help or not.
Ability to Rank Features
Not only can Random Forests differentiate between relevant and non-relevant features but they can rank the features by how important they are to the model. A typical Decision Tree will focus only on the most important features in its greedy search to make predictions. This can result in secondary, but still useful, features being left out. A Random Forest solves this by creating many Decision Trees each built with different subsets of the feature space and data. This encourages secondary features to appear. By evaluating how often a feature is used in the Decision Trees and what proportion of the data it split we can rank feature importance.
While Random Forests cannot solve every problem a data scientist will encounter in a project it can simplify several vexing challenges. It’s especially helpful in the data wrangling stage of analysis. Many algorithms need to be primed and prepped while Random Forests say, “come as you are, warts and all”.
Love Big Data
While most datasets are not big and do not require parallelization, it is an important consideration for some organizations. Random Forests can be run in parallel across many different computers working towards a common goal. This allows the modeling technique to be used on data that is terabytes or petabytes in size. Some of the most famous algorithms — such as Gradient Boosting, consistently used on sites such as Kaggle to win competitions — are not parallelizable and cannot be applied to big data. In fact, the winning algorithm of the famous $1 million Netflix challenge was never implemented. One reason was that the model was too complex and cumbersome to be engineered at the large scale that Netflix required. (Another was that the contest took so long that the business situation changed; for instance, the models were built for predicting “households” and discs, and the business was increasingly focusing on individuals and streaming.)
Available Practically Everywhere
Random Forests are so powerful and frequently used that they appear in virtually every commercial and open source software package that supports predictive analytics. They can be easily implemented in R, Python, SPSS Modeler, Statistica, SAS, and more. Since Elder Research is tool agnostic when it comes to implementing models, we have experience implementing Random Forests on each of these platforms and can attest to their utility, flexibility, and ease of use.
All Things in Moderation
As a parting caveat, remember John Elder’s advice in his Top 10 Data Mining Mistakes about relying on just one technique. Although Random Forests are a great algorithm to start with, remember that “every dog has its day”. Different algorithms will prove useful in different scenarios. As your team grows in experience and sophistication you will want to experiment with other state of the art methods. But you will never regret starting with Random Forests!