How to Create Winning ML Models by Avoiding Overfit & Underfit

Making accurate predictions can be difficult to do. In data analytics we can strengthen our predictions by ensuring our models strike the right balance between being underfit and being overfit. Understanding the pitfalls of each category and finding the right trade-off is important for the creation of robust predictions.

More on Modeling: White Paper

A method for improving model accuracy more powerful than tailoring the algorithm has been discovered: bundling models into ensembles.

This white paper explores how bundling competing models into ensembles almost always improves generalization.

Video Transcript

Congratulations, your team made it to the NBA finals. It can be hard to wait and see if your team is going to win or lose.

Let’s make a prediction. It doesn’t matter what sport or team we’re talking about. It’s always fun to make a prediction and see whether you earn those bragging rights or you need to do a little tweaking. You can base a prediction on sports team loyalty, regular season performance, playoff performance or even sports superstition.

In data analytics, we like to build machine learning models to create predictions. And one way to make sure your model is performing to the best of its ability is to check for overfit and underfit and find the right trade-off.

An NBA Example: Underfit

Let’s look at some NBA data and see what we mean by underfit and overfit.

Here you can see we’ve mapped the amount of Western Conference final team regular season wins to Eastern Conference final team wins. And then we denoted who actually won the NBA finals that year by matching the colors.

If we fed this information to a simplistic model, it might find that splitting the data based on who had the most regular season wins can give us fairly high accuracy in determining and predicting who the winner is.

So if the Western Conference final team had more regular season wins, we will predict them as the finals winner. And in the same vein, we will predict the Eastern Conference team as the finals winner if they had more regular season wins.

For the most part, this model performs pretty well. There’s only two instances in which the team with less regular season wins won the finals. Overall, this is about 80% accuracy, and it does well when we pick any 10, 5, 20 years of NBA history. You’ll get about 75% accuracy, which isn’t terrible.

A More Nuanced Model: Overfit

But what if we want to up our predictions and find some more reasons for why these two might have performed differently than we expected? We could add some more features and some more nuance to our model.

One way to do so would be to build a decision tree. Let’s pick five years of NBA history, let’s say 2015 through 2019. If we gave a decision tree this information and let it make splits based on important features it found in the data, it could perform with 100% accuracy.

One factor they might pick up on is are Golden State Warriors playing? It could pick up on, is Steph Curry playing? It could look at information about who is injured and who is not injured until it comes up with the perfect algorithm that will predict the winner or loser of the NBA finals.

Now, 100% accuracy is better than the 75% accuracy we were talking about before, correct? Yes, but if we took this model and applied it to a different set of years from NBA history, let’s say five years from the ’90s, would Golden State Warriors be a great prediction factor in the early ’90s? If it picked up on Steph Curry’s performance, was Steph Curry an NBA player in the ’90s or was he just a baby?

Overfit vs. Underfit: Finding a Balance

We can see while this has great accuracy, unlike the other model it doesn’t generalize well to other years of NBA history. We would call this model overfit while we call the other model underfit. So when we are building a model so that we can have the best predictions year after year, an underfit model will generalize well to any time period we give it.

Our underfit model might be like this (green line). It comes close to picking up the pattern, but not quite.

An overfit model will find reasons to fit the data and predict perfectly (red line). The strength in the underfit model is that if we present it with a new 10, 20 years of data, it will have about the same accuracy. It follows about the same pattern. The overfit model though is so specific to the data set it was trained on, it will have far less accuracy than it had in its training data.

Instead of an overfit or underfit model, we want to blend them together and try to get the best fit possible, something more like this (pink line).

It isn’t 100% perfect, but it’s also closer to capturing the patterns in the data and adds a little more nuance. You want that trade-off of something that generalizes well to different periods of time, but also can pick up on more important factors and add depth to your predictions.

If you can find the trade-off between an underfit model and overfit model, you can have some of the best predictions year after year. And hopefully, you’ll be predicting your team as the winner this coming season.