Debate: One Metric to Rule Them All? AUC for Classification Models | Mining Your Own Business Podcast

When considering evaluation metrics for classification models, is it possible for one metric to rule them all? Join us for a lively debate between Aric LaBarr, Associate Professor of Analytics at NC State’s Institute for Advanced Analytics, and Robert Robison, Elder Research Senior Data Scientist.

During the debate Robert champions AUC’s comprehensive measure of model performance, while Aric advocates for a broader perspective, emphasizing the importance of business context in metric selection. Tune in as host Evan Wimpey moderates the discussion, and gain valuable insight on what really matters when it comes to machine learning model evaluation. We hope you enjoy the conversation!

In this episode you will learn:

The importance of exploring various metrics to evaluate model performance
Why metrics should align with business objectives
The need for data science teams to invest time in feature engineering
Why a model’s success relies not only on its performance but also on stakeholders’ ability to understand and trust the insights it provides

Learn more about why we created the Mining Your Own Business podcast.

About This Episode's Participants

Aric LaBarr | Guest

An associate professor in the Institute for Advanced Analytics, Dr. Aric LaBarr is passionate about helping people solve challenges using their data. At the institute, Aric helps design the innovative program to prepare a modern workforce to wisely communicate and handle a data-driven future at the nation’s first master of science in analytics degree program. He teaches courses in predictive modeling, forecasting, simulation, financial analytics, and risk management.

Previously, he was a director and senior scientist at Elder Research, where he mentored and lead a team of data scientists and software engineers. As director of the Raleigh office, he worked closely with clients and partners to solve problems in the fields of banking, consumer product goods, healthcare, and government. Dr. LaBarr holds a B.S. in economics, as well as a B.S., M.S., and Ph.D. in statistics—all from NC State University.

Follow Aric on LinkedIn

Robert Robison | Guest

Robert is a senior data scientist at Elder Research with several years’ experience working in the Intelligence Community (IC) and more recently as a private sector consultant. He has experience using Python, R, and JavaScript to streamline workflows for other analysts, to analyze data and uncover actionable insights, and to build production machine learning models. Robert earned a BS in Aerospace Engineering from the University of Virginia, and an MS in Analytics through Georgia Tech.

Follow Robert on LinkedIn

Key Moments from This Episode

00:00 Evan introduces the debate topic and guests, Aric LaBarr and Robert Robison.
01:37 Robert begins his argument by defining AUC (Area Under the Curve) and its significance as a metric for classification models.
06:11 Aric begins his rebuttal, challenging the notion that AUC is the only metric to consider.
09:26 Robert provides a rebuttal to Aric’s points.
11:48 Aric starts his rebuttal, focusing on communicating models to business users.
14:41 Robert responds to Aric’s points.
16:18 Evan asks Robert if certain cases may require metrics other than AUC.
17:03 Robert responds to Evan’s question.
17:53 Aric weighs in on the question.
19:37 Evan asks Aric if focusing solely on AUC may save time and costs.
20:30 Aric responds to Evan’s question.
22:20 Evan gives time for the debaters to ask each other questions.
25:30 The debaters share closing remarks, summarizing their positions.
29:14 Evan wraps up the show.

Show Transcript

Evan Wimpey: Hello and welcome to the Mining Your Own Business podcast. I’m your host, Evan Wimpey. Today, very excited to introduce two guests for our first, hopefully many debate episodes. We’ve got Robert Robison and Aric LaBarr here to debate the resolution when building a model for classification purposes. AUC is the only metric that you need to consider.

So I’ll do a quick introduction and let these two smart folks jump to it. But Robert Robinson is a senior data scientist at Elder Research—one of our best and brightest. He will be debating for the affirmative of the resolution. And then Dr. Aric LaBarr is an associate professor of analytics at NC State. Go Pack!

Okay, that’s my grad school. I don’t actually say, Go Pack, but I, Aric was a teacher of mine. And so, also one of, one of the best and brightest, and I will, I will say how, how can two smart people disagree on something like this? I will preface our audience with the hope of today’s debate is not to have a winner and loser is to inspire some, some thoughtful considerations and to show how Non personal this is.

Both Robert and Aric were, were eager to debate either side of this resolution. So there are no hard feelings either way. This is just where we ended up Robert with the affirmative, and Aric with the negative. So gentlemen, thank you so much for joining. We will start out Robert with, with you for, for five minutes arguing in the affirmative.

Take it away, Robert.

Robert Robison: All right. Thanks, Evan. Thanks for having me on to do this. Super excited about it. So I’m arguing for the affirmative. The AUC is the only metric you need to consider. So I’ll start with defining what AUC is. AUC is a measure of how good a classification model is. Higher AUC is better.

Lower AUC not as good. It’s usually between 0.5 and 1. So 1 would be. Your model is perfect. 0.5 means your models just randomly guessing. And that’s true, regardless of the problem, regardless of how many ones or zeros you have. That’s always true. It often helps me to use an example. So I’ll Let’s, let’s say we’re, you know, we’re working for a email company and we’ve got to decide incoming emails.

Are they spam or are they not spam? We’ve got to decide which to send to the spam folder. And so we do is we take a lot of past emails. We have users tag. Whether they were spam or not. And then we train a few machine learning models to predict for new emails. Are those new emails spam or are they not spam?

Now, let’s say in our data set, we have far more not spam than we do spam. We might have, let’s say, two percent of the emails. You might be wondering, why do we need AUC? Why don’t we just use accuracy? So the issue with that is what if we just sent every single email to the inbox? We said there’s no spam at all.

We would end up with 98 percent accuracy, which sounds like it’s pretty good. But in reality our model did nothing. Our model is just passing everything. So we have a useless model that gets a really good accuracy. So clearly, Accuracy is not the metric for every use case, and certainly not this one. So what AUC does is it actually, it measures what directly comes out of the model.

The almost no machine learning models will directly predict spam or not spam. They’ll predict a number that’s between zero and one. And then after the fact, either the software you’re using or. You do it manually, but you those and you round them to zero. You round them to one and assign spam or not spam.

And you can round it based on what’s above 5 or you can use any threshold you want. But, to use accuracy, you have to do that step. But the model does not do that step. The model just gives you a number between zero and one. So with accuracy, you’re not directly measuring it. Okay. The performance of the model, but with AUC, you are because you’re looking directly at those numbers that come out of the model.

The numbers between zero and one. And so it has the advantage of directly measuring the model itself and not any tasks or steps that come after the model. And so to give it some more context in our spam problem, the, the AUC is going to measure how well we separate the two classes, how well we separate. The actual spam from the not spam in terms of how high our predictions are between zero and one.

So, in the case I said earlier, let’s say we sent nothing to spam. We sent everything to the inbox. It didn’t separate anything, so that’s, that’s no better than random. That’s no better than 0.5. Let’s say we had, you know, everything we sent to the inbox was perfect, or let’s say every non-spam was predicted lower than every spam, then that would be a perfect AUC.

It’s perfectly separating the two. And so that leads to the, maybe another way to interpret the AUC that, that might make more sense is, it’s probability that the model ranks a spam higher than a non-spam. So that probability it ranks a spam higher than a non-spam. So you have that interpretation.

You have, it’s between 0.5 and one always, regardless of the application. It’s directly on what comes out of the model. And so I would argue that it’s the only metric that you need to use.

Evan Wimpey: Fantastic. Thanks, Robert, for the overview and, and, and the argument. Now, Aric, you’ve got give or take five minutes for rebuttal or for your opening statement rather.

Aric LaBarr: Sounds good. Perfect. Thank you so much, Evan and Robert. I love your you’re, you’re, you’re talk about accuracy not being a good metric. I agree.

Can we just turn this into bashing accuracy for the next 30 minutes? Cause I think it’s an absolutely horrible metric, but outside of that, let’s go back to the original one. I love your example about emailing and the spam folder. And I think that actually helps a little bit, sort of my argument against something like AUC.

A reason being that for me, a lot of times the way I think about a classification problem is not always getting both the ones and the zeros correctly. It’s usually focusing on one of the classes. So something like an email spam folder where you have the ones and we’re focusing on really, I need to identify where the spams are with trying to minimize as much non-spam as possible.

And if a couple of those, you know, non-spams sneak through, they sneak through. So I’d make an argument against something like AUC being the only metric correct. Because AUC does a wonderful job of being able to try and figure out a notion of balancing sensitivity and specificity, focusing on the prediction of the ones and the zeros.

When in reality, maybe we only care about predicting one of them really well. So something like lift or something like a gain chart, for example, would do a really good job of being able to say, this model is really good at identifying quickly who your spam emails are and all the rest of them, whether they’re ranked higher, ranked better, lesser, more spam, maybe not so much, but we’ve identified the definite spams.

And so that’s where I would think something like a lift or a gain chart would be better, but also trying to be able to balance things out in terms of not just focusing on predicting ones and zeros from a true positive and a true negative rate. Again, Robert did a wonderful job explaining AUC. So let me quickly define true positive, true negative for the audience.

True positive would be something along the lines of if it really was a spam, how many times, what percentage of the time did I actually say it was spam? True negative would be something along the lines. If it really wasn’t spam, how many times, what percentage of the time did I say it really wasn’t spam?

And so that would sort of be the idea that what AUC and where it comes from, the rock curve is trying to balance. However, something like precision, I think, would also be really good in this scenario, where we’re trying to be able to balance that idea of all the things that I predict spam. How many of those did I get right?

It’s not just of all the actual spams, how many did I capture, but of all the predicted spams, how many did I get right? So again, another metric, maybe we want to be able to balance in there outside of AUC. How good does your model do in terms of precision across different rankings? Which actually ties directly to lift.

And so precision across many different cutoffs, much like you described, when many different cutoffs can help us determine that balance overall for AUC. Precision across many different cutoffs would be something like lift that again, I think would go a long way to being able to help out. So I think there’s my initial opening statements to be able to help out.

Keep keeping a couple just in my back pocket, just in case.

Evan Wimpey: Awesome. Thank you, Aric. Robert, a couple minutes for rebuttal.

Robert Robison: Sure. Yeah. Yeah, that was really good. Aric. I actually—part of me agrees with a lot of what you’re saying. But here comes the other part of me, I guess.

Aric LaBarr: It’s good to see you’re arguing with yourself. That’s always good. It’s always helpful.

Robert Robison: Yeah. So, I completely agree that I think we have to make a distinction between, how we communicate the value of a model, to the business and how we actually measure how good a model is, and then compare the different, how different models compare against one another.

Okay. So you brought up a lot of good points. I think most of it can be condensed to you. Generally, there’s you can narrow down what you really care about to either one class, or maybe one region of the data set. Maybe you only care about the high predictions if you’re doing lift. And so what I would say is, yes, that while that is true in a lot of cases, I would say those metrics can be, can be very noisy, especially with something like lift.

If you’re looking at lift that, you know, the top one percent, you’ve now thrown away 99 percent of your data set in terms of sample size when you’re evaluating, and that a model that, Okay. In general, a model that discriminates and separates the classes well overall will tend to do so in all regions of your predictions moving forward.

There definitely are counterexamples to this, but I like to say there are counterexamples, but Your data set isn’t one of them. So they’re pretty rare, I would say, and I don’t think we should base our common practice around the edge cases are common practice should be around what generally tends to be the case.

And if you have an AUC, that’s significantly higher in one model than the other, then. I would rely on that model more almost regardless of what the task is, unless it’s again, unless it’s an edge case. That’s just completely out there.

Aric LaBarr: Sounds good. No, I agree. It’s one of those things. Like, as you were describing the idea of, like, we could come up with an example. Like, I, we could all probably draw one right now where it’s like, well, what about this? And you’re exactly right. I love that notion of. You know, in a, in a real-world scenario, though, does this happen?

And so I think my counter by my rebuttal would be focusing in on exactly what you said when it comes to the communication of the model versus the model metrics that we use to determine the model. And so you’re exactly right. I think there’s a lot of times how we communicate models as data scientists may not be exactly how we judge models as data science, but I want to push back on that a little bit, you know, if we’re trying to be able to reach the business user where they are and trying to help them understand why we chose what we chose.

Being able to summarize our model in a good way that they understand is helpful, but being able to take the perspective of, well, what if I try and find the model that may save you the most money now, that model may not be the best at being able to predict all of my ones and zeros. It may not be the model with the highest AUC.

Maybe it has a much higher error rate for false positives, but in the end. I’d, you know, it costs me much fewer dollars, let’s say, to being able to make a mistake one direction or the other. And so with that being the case, instead of focusing on what we would think of as a good model metric as a data scientist.

My argument would be that the actual business user, how they judge a model would actually be more important for us to be able to pick models on. I think of a marketing thing. I’m going to think of the spam sender instead of the spam receiver on your emails. Now as a spam sender, I could imagine I would rather send out many more opportunities for people to click on my email, even though I know very few people will actually do it.

With the premise being, well, in the end, it costs me fractions of a penny to send an email. And if I send too many, I send too many. But if I miss one person who’s going to send me all their money as a bad guy, well, I want to make sure I capture that one person. And so from a cost perspective, I think that could be a better model’s performance in terms of money saved or money earned as compared to something like AUC, which might say, hey, this model over here is better.

But in the end, maybe it doesn’t make me as much money. I can’t believe it, Evan, you’ve made me argue for a bad guy in this scenario. I’m blaming you completely on this.

Evan Wimpey: Happy to advocate for our fellow bad guys. Although I’m not, not nearly as bad as a spam sender, please.

Aric LaBarr: I’ll say, you said fellow. I just, making sure that that’s recorded. That’s there’s recorded no editing.

Evan Wimpey: Relative, relative. Also. I don’t know. Robert. Do you care to care to respond to that? Or we could open it up to questions.

Robert Robison: I’m happy to give a quick response. Sure. Go ahead. That’s allowed. Yeah. So, I would say that I completely agree again with your, your notion of we have to do whatever saves the business the most money, whatever makes them the most money to me.

That’s after you’ve chosen. The model that does the best, you would then choose where to draw that threshold based on what makes the most money. If that makes sense. So, I think, and it sounds like you’re saying you would, you would switch the order of that. You would kind of choose that threshold ahead of time and choose whatever model does best on that threshold.

And then and then give that off. And then I would also say the other comment I had was that I would think it’s incredibly rare to see a case where you have a significantly higher AUC for one model. And a significantly higher, you know, lift metric or metric where you weighted one higher false positives more than false negatives or something like that, in another model, like, I would think that almost always in a real-world data set significantly higher AUC will be perfectly correlated with all of these other metrics. And in a less noisy sense, you’re more likely to see AUC distinguish itself versus these other metrics might be more noisy.

Evan Wimpey: If I could, Robert, I’ll jump in with moderator’s prerogative to ask a question.

I think most folks would generally agree if you see significantly, significant divergence in AUC, you’d see significant divergence and in these other metrics that that might be more business relevant. What if you see very, very similar AUCs, then would it make sense to jump to another metric? Is it possible?

Or is it? It’s possible with anything, but is it more common or more likely in the real world to see maybe very similar indistinguishable AUCs one model does a better job at the top of the data set or on the positive cases?

Robert Robison: Yeah, I mean, it’s certainly possible. I’m generally talking from the more common cases.

If you have AUCs where there’s like a lot of overlap, and you especially have plotted the uncertainty intervals for both, and there’s a lot of overlap with those, I would tend to think that basically what, what Aric’s argue, if I understand correctly, most of the time would be choosing a specific point where you’re, you’re narrowing it on one point.

In that curve, and if it’s possible, but I would say, I would think more often than not, if the AUCs don’t have separation, then neither will the other the point you’re interested in.

Aric LaBarr: Yeah, no, and I agree to go to Robert’s point. I think you’re exactly right. I think it’s, you know, when focusing on a single point in a cut off threshold, that that is a little bit separate.

I would make an argument that you could have two separate models. Much like what Evan said, where you have, let’s say, an extra boost model and something like a logistic regression. Let’s say they have a similar AUC, and in the argument of a similar AUC, then it’s okay. If these two models are similar in their predictive power, then are there other metrics that we could potentially use to be able to decipher between them a little bit better?

Now where it comes to potential, you’re right, I agree. I think from a lift and an overall model metric, you probably see some similarities, but you might be able to have some model that might be able to be quicker to evaluate in terms of time and scoring. In terms of trying to think of an example. What if you had a model that was more interpretable?

What if you had a model that would make more business sense? And in those ideas, again, now it’s a notion of if I have models that are comparable on this metric, which model would I choose from? I’d go off of that metric to be able to find another metric to be able to help out. And so, and it’s always hard arguing that you can only use one thing, but you know, like we all generally understand that, that idea, but yeah, that would be my argument again, going back to that cost where cost isn’t necessarily just, you The cost of a false positive or false negative, but now could be a cost of actual business.

Like how do we understand the variables that come from this? How can I sell this idea to my boss? How can I quickly evaluate this model on a realistic scoring scenario? You know, how well can I put it into production? All those costs that may come into play as well.

Evan Wimpey: I very much appreciate it, Aric. And I’ll, I’ll ask you a question. It feels a little bit devil’s advocate question here. We’re thinking about cost. We’re thinking about a data scientist, data analyst, somebody who’s building a model and what you’re proposing is. Learning a lot of business context, what is important, how do we measure the false positives, the false negatives and what, you know, there’s some cost imposed on a, on a model builder with that wouldn’t be easier if we just said, build us the model with the best AUC.

And then we reduce a lot of contextual knowledge that they have to have. We reduce a lot of startup costs and we just say, AUC is a metric. It’s going to be really highly correlated. If not, it’s not. Perfectly correlated with these other metrics that are business important. So why can’t the data scientists just build for AUC?

Aric LaBarr: To devil’s advocate, the devil’s advocate. Why do I need a data scientist at that point? I mean, seriously, if we’re just going to throw something into a computer and let the computer pick the best number, then I don’t need any context. It’s just X one X two and X three. I better go ahead and just shut these doors.

Now. I don’t need to teach any more of you all. This would become much, much easier, but no, I think it’s, it’s a fair point, Evan, that a lot of people do make right. Is that. There’s an argument for trying to get out of the context of the problem to be able to see the data in a more pure way. My counter argument to that always is a notion that the best models are the models that are fully understood by what’s going into them.

And this was a debate that, that, you know, maybe Robert and I will have for another day. I know we could both argue both sides of, but it’s like, you know, for me. Feature engineering is far more important than the model you select. If you can give me a good variable, I’ll take a good variable nine times out of 10 times out of 10 over a better model.

And it’s because of the fact that the variables are what drive models. Models don’t drive variables and without context and understanding of the problem at that point. Why would I, you know, trust just what the model says, you know, X one’s the best variable and then you turn out and you reveal the curtain and next one is just something that doesn’t make any intuitive business sense.

And the boss just laughs you out of the office, you know, she’s like, what are you even doing in here? Why did I hire you again? You know, so that would be my, my counter arguments to the devil’s advocate, my angels. I don’t know. I don’t know how it’ll be called.

Evan Wimpey: I, Aric, you’re really speaking my language here.

First, empathizing with bad guys, then talking about boss, laughing you out of the office like this. You’re living in my world. So, I’ve asked a few questions. I want to open it up to either of you. Do you have a question for for your counterpart here?

Robert Robison: I can go first. I do have a. I have a list, but I would not like to debate you on what you just said, Aric, because I 100 percent agree.

I was actually going to bring that up before you even mentioned it. Because to me, when we talk about using resources, data science, time and. Et cetera, would you rather you’re like, if you’re employing a data scientist, would you rather them spend their time making all of these metrics that correlate and using those and calculating?

What’s the uncertainty at list at three percent? You know, how do I compare that? Or would you rather than spend more time on feature engineering?

Aric LaBarr: Oh, no, I agree completely. Feature engineering hands down would be where I would spend all of my time. However, the downside though, is the whole context of the problem, right?

They can spend all their time spinning their wheels on the feature. And then at the very end, if they can’t communicate to me, what’s going to save me the most money or why it’s going to be the most important that we still have to have some metric at the end. And so I agree, but then that would be my argument of.

If I had to pick a metric, although I do love AUC, like it’s my baby. I love it. It’s good. I’d go with something else that may be more business contextual, because in the end of the day, although I spent all my time understanding this business, if I can’t explain it to my boss in a way that they understand.

Then all that time spent just goes out the window.

Evan Wimpey: Perfect. Before Robert goes down his last, Aric, do you have any questions for Robert?

Aric LaBarr: Oh man. Hmm, okay. I guess I wasn’t prepared to ask questions. I didn’t think this would be like that kind of formal debate. All right. All right. So I guess for me, Robert, a question I would have would be outside of just AUC.

Why would something not like the rock curve itself and just the overall idea of examining rock curves be better than trying to summarize a rock curve, which is a picture into a single number?

Robert Robison: Just easier to communicate, easier to understand, takes less time. I think it can be difficult to interpret a rock curve. I’ve actually spent probably a lot of time trying to get to the point where I can look at a rock curve and immediately tell like. I mean, in the extreme example, you have one model that’s, you know, goes up the vertical axis and then comes across, or you have one that just goes kind of diagonally up and then rides the top axis and it’s, you know, to get to the point where you immediately understand the difference between those.

It has some value, but it takes so long, and it’s so rarely seen where you have different models doing different doing opposite things with the same features. It just seems maybe not worth the time and effort from my point of view.

Evan Wimpey: All right. Perfect. Yeah. Why, why look at a picture when a single number? Even somebody like Evan might could understand it. Okay, we’re in the interest of time, we’re, we’re gonna, we’re gonna close the Q and a, and let’s take, I think a couple minutes, maybe two or three minutes, Robert, starting with you to, to, to wrap up, state your case.

Robert Robison: Sure, sure. So, in summary, the AUC, for me, is the best single metric to use when evaluating a classification model. It summarizes how well a model can distinguish between the two classes that it’s been trained on, that it’s trying to distinguish between. And because it uses the entire data set to make that inference, I would argue that when it concludes that one model is better, that is the most trustworthy thing you could use.

Even if your business case focuses on one area over the other. In fact, some. We’ve started to do some research and haven’t finished it yet, but the, the early returns on it is that sometimes when you’re, if your business metric is something like list at one percent, the AUC is sometimes more highly correlated with the, the true list at one percent.

Then lift at once percent is on your data set, you know, at certain sample sizes and everything. So if I completely and to go back to what you said, most recently, Aric, and your response, there’s, there’s a difference to me between communicating the value of a model and distinguishing between which models are better.

And I, while a, you see, can be communicated well to. To anyone, it’s just it’s a number between 0.5 and 1—higher is better. You know, it’s pretty intuitive. I think that it’s far better if, you know, by the end, you’re communicating it in. dollars saved, or, you know, how much you’re making off of this model.

What’s the return of invest, return on your investment if you use this model? However, in terms of which model is actually better, I would argue that using AUC will ultimately give you an answer that’s more correlated with the business metric than if you had just used the business metric, because that uncertainty can be so high in a lot of real-world data sets.

Again, not always the case. There are edge cases for sure. But in the common use case, I think you’d be better off saving your time for things like feature engineering and such

Evan Wimpey: Perfect. Thank you, Robert. Aric, a few minutes to summarize your position.

Aric LaBarr: Sure. Sounds good. Sounds good. I think, for me, the best summary of the position is that although I do think a, you see, is probably one of the best of a series of metrics to consider it.

The only metric would not be a valuable way of being able to look at an entire picture. And so, when looking at the notion of being able to explain the context of something to a variety of different stakeholders, yes, maybe for us, It could be something where we look at it as this model may be best in terms of this AUC, but being able to throw out those other metrics and not use them, even from a business context as a data scientist would be not a good practice on our end to be able to understand who we’re going to eventually talk to.

And so I’m trying to make sure that we understand the full business problem so that when we talk to the business user, it’s not just a here, let me just make this so you can understand it. It’s here. I actually understand your problem. And with understanding your problem came a different context to how I would have potentially looked at this model.

And so I chose the model that best accounted for a variety of things, not just one that accounted only for a single metric.

Evan Wimpey: Perfect. Gentlemen, thank you so much. I’ve just gotten the results. They’re finalized for the debate.

Aric LaBarr: Wow. There must not have been a lot of viewers. That was really …

Evan Wimpey: We’ve got an AUC. I don’t think that’s too important. I’ll skip over that. Let’s see. What’s the most important. This has been the friendliest debate I think possible. You guys have been. Great. Very thoughtful. Really appreciate both of you coming on the show. We hope our listeners walk away from this being able to thoughtfully implement classification models and, and, and aware of the metrics that they’re using. So I really appreciate both of you guys for coming on the show today.

Robert and Aric, thank you guys.

Aric LaBarr: Thank you so much. It’s a pleasure. Thanks Evan.