Random forests can be applied to many different industries to help classify data. This supervised machine learning algorithm can be digested by breaking it down into a few simple steps. In this brief video, we will walk you through an example where we first randomly sample our data many times (bootstrapping), then build a decision tree on each subset (hundreds of times) by randomly sampling features, and finally, reach an outcome by aggregating the results from each tree.
More From This Series: How to Create Winning ML Models by Avoiding Overfit & Underfit
Making accurate predictions can be difficult to do. In data analytics we can strengthen our predictions by ensuring our models strike the right balance between being underfit and being overfit. Understanding the pitfalls of each category and finding the right trade-off is important for the creation of robust predictions.
Video Transcript
Random forests can be applied to many different industries to help classify data. This supervised machine learning algorithm can be digested by breaking it down into a few simple steps. In this brief video, we will walk you through an example where we first randomly sample our data many times (bootstrapping), then build a decision tree on each subset (hundreds of times) by randomly sampling features, and finally, reach an outcome by aggregating the results from each tree.
You’ve just been invited to compete in the next Gladiator Games. But wait, how do you know if you have what it takes to beat Gerard Butler? Don’t worry. I’m gonna share with you a technique to assess if you have the right physique.
Decision Trees
Let’s start with a decision tree. A decision tree is used to classify data. This supervised machine-learning technique is a map of possible outcomes based on how a previous set of questions were answered. You can simply think of a flow chart. We’ll start at one point and we’ll follow a path until we reach a specific outcome. I’ll show you an example shortly, but if we think back to earlier in the video series, Shaylee Davis taught us that decision trees are prone to overfit. Head to Elder Research’s YouTube page if you wanna learn more.
But for now, to avoid overfitting, we’re gonna need more than one tree. In fact, I’m gonna show you how to build a random forest so that you can determine if you have what it takes to win the Gladiator Games.
Building a Random Forest
Let me first introduce you to the lineup. Here we have five gladiators, including Gerard Butler, and with each gladiator are specific characteristics and their score. We’re gonna be evaluating gladiators based on their strengths, their speed, if they’re injured or not, and what their age is. With this data, we can build a random forest to determine what makes a gladiator victorious.
So let’s imagine that we have an arena. We’re gonna randomly select gladiators from our data set to put ’em in this arena. We’ll take gladiator 1, we’ll take Gerard, and let’s take gladiator 4. We’ll go through this process of randomly sampling with replacement many, many times. And this process of creating random subsets is called bootstrapping.
Let’s focus on our first set. With each subset, we’re gonna create independent decision trees that each gladiator will be evaluated on. What I mean by that is that this random sample will be evaluated on this decision tree, and this decision tree was created by randomly selecting features from our data set. So here we’re looking at strength and speed. Now to evaluate the gladiators, let’s go through the process together.
For gladiator 1, we’ll ask ourselves, does gladiator 1 have a strength greater than three? Yes. Okay, next we’ll ask, does gladiator 1 have a speed greater than three? No. So it doesn’t look too good for gladiator 1. So how about Gerard? Okay, does Gerard have a strength greater than three? Yes. Does Gerard have a speed greater than three? Yes. So it looks like he’ll move on to win this round.
And I know what you’re thinking, you’re thinking, okay, but what about me? How am I gonna compare? What characteristics do I need?
Don’t worry. Let’s assume you’re like me and you’re pretty strong, you’re kind of fast, you’re not injured, and you’re in the prime of your life. We’re gonna take this new data point and we’re going to evaluate it on each tree that’s been created.
So again, is the strength greater than three? Yes. Is the speed greater than three? No. Ah, well, you can’t win ’em all. Okay, how about this next tree? Is the speed greater than two? Yes. Am I injured? No. My first victory. Lastly, what about this last tree? Is the strength less than four? No. I have a strength of four, so I’ll win again.
Aggregation: Determining the Final Outcome
Once this new data point has gone through each of the decision trees, we’ll use a process called aggregation to assess the final outcome. What I mean by this is we’ll take the majority of the outputs from the decision trees evaluated, and that will be the final outcome. So I didn’t win the first one, but I won the second two. To me, it sounds like I have a pretty good chance in the Gladiator games. Now you can assess yourself.
Takeaways
But if you’re thinking that this application sounds a little bit like ancient history, you’re not wrong. In fact, a random forest can be used on a multitude of industries. Let’s take the medical field, for example. If we wanna know what kind of characteristics classifies an individual to have a disease or not, you can use the random forest. How about in finance? Is Gerard Butler gonna default on a loan?
With these next steps I’ll show you, you can determine your application for the random forest.
1. Again, we’re gonna take our data set and randomly sample to create many subsets with the process called bootstrapping.
2. We’ll then fit a tree to each of our subsets, and we’re gonna ideally do this hundreds of times, and each tree is built based on randomly selecting the features from our data set.
3. Once we’ve done this, we’ll then use the process of aggregation to determine the final output.
Now, do you have what it takes to win the Gladiator games?