(Almost) Ignore Machine Learning Model Accuracy

In this whiteboard video, John Elder V discusses the importance of custom fitting your machine learning model. He emphasizes that off-the-shelf evaluation metrics, such as accuracy, may not be suitable for all situations.

Instead, John highlights the need to consider the potential costs associated with different types of errors and how customizing the model’s threshold can optimize decision-making in various business scenarios.

Why Optimization Is Critical to Machine Learning, but Often Overlooked

The key role that optimization plays in Machine Learning (ML) is often overlooked. ML models use optimization to determine how their intermediate predictions can best be improved. Understanding this concept leads to better modeling, because often the objective being optimized must be tailored to the problem.

Watch video

Video Transcript

Custom Fitting Your Machine Learning Model

Think of a great all-purpose outfit—something you might reach for in the closet if you don’t know what the day is going to bring. Something that will serve most purposes well.

Maybe you reach for something like this: a light jacket and some matching shoes. This guy is prepared for a lot of kinds of days but not every kind of day. If you’re going to a job interview, he might want to at least add pants.

Similarly, when we spend a bunch of time building our predictive models on our data, we reach for an off-the-shelf evaluation metric—something like accuracy.

And in this talk, I’m going to try to convince you that you should probably put that back on the shelf and go for something custom. There’s a lot of value to be gained there.

Whiteboard Image - (Almost) Ignore Model Accuracy

Sounding the Alarm Avoids Costly Mistakes

When I was a freshman in college, I lived in a seven-story dorm with hundreds of other young men on their own for the first time.

What could go wrong? A lot went wrong. One of those things was a fire alarm that went off many, many nights and sometimes many times during the night—usually the night before midterms or final exams or when projects were due.

And we figured out people would stay up late making popcorn in their microwaves, burning the popcorn, and the fire alarm hated that smell. And so it would go off all the time. We would file out of the dorm—all seven stories, all hundreds of freshmen—and the fire department would have to be called in to go room by room and clear it before they let us back in.

Now, this fire alarm was very sensitive. I thought it was much too sensitive. In fact, I was losing a lot of sleep, and my academic performance probably went down because of that fire alarm. I moved off sophomore year—lived in an apartment after that.

However, years later I realized fire alarm was probably tuned right. Because even though a lot of people lost a lot of sleep and their grades suffered, if it had made the other kind of error—not going off when it should have gone off—the cost would have been incalculable. So thank you, fire alarm.

Identifying the Right Problem to Address

And similarly, if we were to try to connect this idea to a business use case—say developing a fraud model at a bank where we look at transactions that customers submit and try to label them as benign or potentially fraudulent—the same phenomenon occurs where one kind of mistake is much, much more costly than the other.

So if we flag a transaction that should not have been flagged, a benign transaction gets halted, and the customer gets locked out of their account. That’s annoying to the customer. It costs them time, and it costs us at the bank time as well. But letting a fraudster get ahold of a customer’s account damages the company’s reputation and could potentially cause a much larger issue for the customer.

And so fraud is much more of a problem dollar-wise per transaction than flagging a benign transaction. So if we were to build a binary model, meaning a model that predicts zero or one, in this case not fraud or fraud, the model will go through all the cases that it sees—every transaction—and give it a probability between zero and one of fraud.

Evaluating the Cost of a Mistake

And the default to do in that case, if you’re trying to decide on having used that probability to predict the event, is to take the default threshold of 0.5. If a transaction is more likely, it means there’s more than a 50 percent chance of this fraud to be fraud, then we say it’s fraud. If it’s less than a 50 percent chance of being fraud, then we say it’s not fraud.

However, there are a couple problems with this approach. One is that a model that’s trained on a large data set consisting of mostly non-fraudulent transactions will struggle to get even a single record predicted as more than 0.5 chance of fraud. And so this threshold might not even come into play at all in your decision making.

The other problem is one we’ve already talked about that this kind of error here—freezing a benign transaction—costs a lot different amount of money than letting through a fraudulent transaction. And so we need to take that into account.

What we’ll do is we will simulate what would happen to the business if we used the threshold between zero and one. Go ahead and use 10 or even 100 different thresholds within that range, count up the number of times it will cause you to make this mistake, and multiply it by the cost of making that mistake.

Do the same with this kind of mistake and the costs associated with it. If we set the threshold very low, in this example it means that almost all transactions are going to be flagged as potentially fraudulent. We’re going to make many, many of these kinds of errors—many, many small errors.

And then on the other extreme, if we set our threshold very, very high, it means almost no records are going to be flagged as fraud. We’re going to start letting in some of these really large errors.

Finding the Balance for Better Business Decisions

So somewhere in the middle, you’re going to find that there’s a region where the total cost is minimized. So you might say anywhere in here we’re okay with the total costs of making the kinds of errors that we’re seeing here.

You would choose a region in between those two points that is agreeable and acceptable for your business, depending on what else you might want to prioritize.

You would try to minimize the overall cost in that way. And so that’s probably going to look a lot different than just going with 0.5. And it requires some business knowledge and actually contextualizing your model within the parameters of what it’s actually going to be affecting.

So we believe that if you’ve gone to a lot of trouble to specify a model carefully—to train it and test it—make sure your data is clean and represents reality. Then don’t forget the last step of customizing and contextualizing the way that you’re actually evaluating that model’s performance. Make sure that it’s making—it’s helping you make a decision that’s actually grounded in the reality of what you’re applying it to.