How can we identify outliers that are often difficult to find in large, multidimensional data? In this video, Elder Research data scientist Garrett Pedersen demonstrates how anomaly detection methods like CADE help locate outliers. Anomaly detection tools also help data practitioners assess risk and identify potential cases of fraud across multiple industries.
More on Anomaly Detection: White Paper
This paper is a guide for anomaly detection as a tool in data exploration and modeling. The paper distinguishes between outliers and anomalies and provides five powerful methods for detecting outliers, which in turn may help identify anomalies.
Was your last credit card transaction fraudulent?
Can your smartwatch predict whether or not you’re going into cardiac arrest?
Is that Twitter user a real person or a bot?
These are some of the questions that data practitioners attempt to answer using a method known as anomaly detection.
What is Anomaly Detection?
We’re gonna get to that Twitter example in a moment, but first, let’s talk about what anomaly detection actually is. Anomalies are observations that deviate from what is normal or expected, and we might use anomaly detection to identify those outliers or the needle in a haystack. The two main uses that we would use this for identifying those outliers are:
1. To purge the data of observations that may have been recorded incorrectly or are having influence on our data even though the observation itself isn’t that significant.
2. Or, we might want to learn more about those observations. We wanna find that needle in a haystack. Could this potentially be a fraudulent activity?
Classifier Adjusted Density Estimation: CADE
Now there are many different ways that we can practice anomaly detection. The one we’re gonna talk about today is known as the Classifier Adjusted Density Estimation, also known as CADE.
We’re gonna go over that Twitter example using CADE and how we might be able to detect bots.
There are many different groups that actually try and research whether or not they can predict a bot, and they use hundreds if not thousands of variables to make their decisions. Here we only have two, those variables being the following-to-followers ratio, or how many people am I following versus how many people are following me? And then also the number of retweets in the last several days, and so that would be just sharing posts from other users. How many times did this particular user retweet something? And this is something that researchers look at to try and determine if these bots are spreading misinformation, and if we should go ahead and try and identify those.
So looking at this data, you can already probably tell that these three observations are likely anomalies because they’re kind of away from the majority of the data here. We’re gonna apply CADE to try and verify that these are more likely to be classified as anomalies.
The first step of CADE is to assume that all of your data, all of these blue points right here, even these seemingly outliers, are not anomalies.
Second, we’re gonna overlay the data with uniformly distributed fake data. And we’re gonna assume that all of those fake data are anomalies. So if I go ahead and do this, I wanna make sure that this stays in the bounds of the data, and distributed equally across this distribution.
So when I think of CADE I like to remember something that my mom told me when I was a kid, and that’s that if I spend a lot of time with kids that misbehave or get into trouble, then other people are gonna assume that I am getting into trouble and misbehaving. CADE kind of acts in the same way, in that as you look at all these fake data, these red Xs here that we’re all assuming are anomalies. We’re going to use a classification model with this data to try and predict whether or not each of these observations is more likely to be anomaly or not.
And we can see here, this observation is surrounded by a lot of red Xs. So a classification model like a logistic regression or a random forest might see that this point has similar attributes to all these anomalies. So this would be scored highly as an anomaly. Same, too, with these observations.
Now, in the context of why these might be Twitter bots, Twitter bots tend to have a very high following-to-followers ratio. They’re looking for a lot of different content that they can just spread rapidly. So not many people are following them, or not many users are following their accounts, but they’re following like millions. And then similarly, they spread information quickly, much more quickly than a person could. This could be like, you know, 20 or 30 retweets in a second. So something like this, we might expect this to be a bot. This is unusual, but maybe it’s not a bot because they just happen to tweet a lot and they have a more regular following-to-followers ratio. And then same thing with this one. Not a lot of retweeting going on but it’s following a lot more people than is following them. Maybe that’s a new user that’s just gotten their account set up.
So CADE, again, doesn’t answer the question necessarily bot or not, but it helps us identify anomalies so that we can dig into those specific examples and try and identify this. And the main takeaway I want you to walk away from this video is that anomaly detection, particularly CADE, is useful at helping predict anomalies and helping us find that needle in a haystack across many, many dimensions.