Fraud detection is about finding needles in haystacks and requires reliably labeled instances of fraudulent (needle) and non-fraudulent (straw) behavior. A predictive model can be trained using these labels to learn the underlying patterns in the input variables that best separate fraud from non-fraud cases, and thereby estimate the fraud-likeness of any future case. Typically, the interesting cases are very scarce, in which case we might have to carefully up-sample the rare class and/or down-sample the abundant class to help the model pay enough attention to the rare class to be useful. But what do we do when labels are not just rare, but are completely absent?
It is reasonable to assume that fraud cases will – in some way – be anomalies, or instances that do not conform to expectation. Anomalies are different enough from the bulk of cases to raise suspicion that they may have been generated by an entirely different process (e.g., fraud, data errors). How ‘different’ an instance is compared to its peers can be measured in a variety of ways. For example, a simple statistical method considers substantial deviations from the average to be anomalous while density-based methods assume that normal instances tend to cluster together, thereby marking fringe instances as anomalies.
But beware, not all anomalies are alike! One may exhibit radically different characteristics from another and even figuring out what makes them anomalies isn’t always clear. They remind a colleague of mine of that famous opening line of Tolstoy’s; “All happy families are alike; each unhappy family is unhappy in its own way.” (Anna Karenina, 1878).
Comparison by Example
Consider a scenario: You are interested in identifying NBA superstars. The supervised approach would involve creating a new label, superstar, to flag top performing players. Using attributes about the player’s performance you build a model to predict the likelihood of a player being a superstar. (This could be easy; of course, superstars have great stats, but the label directs our model to define the line between superstar and (merely) great pro players.
Alternatively, an unsupervised approach can be used to find unusual cases in the data. The assumption here is that at least one type of anomaly will be a good match for superstar. A Mahalanobis Distance approach assumes the data is Gaussianly distributed, and measures each case’s sigma-adjusted distance to the mean of the data cloud. Using it on all NBA players in the 2017-2018 regular season results in the following list of top 10 anomalies:
What does NBA superstar LeBron James have in common with Chinanu Onauaku? Lebron was drafted early in the first round (a good indicator of an athlete’s superstar potential) and has had a storied career, earning multiple MVP awards and winning three league championships in 15 seasons as a professional. He is arguably the best player of his generation, an extremely positive anomaly. Chinanu Onauaku, on the other hand, just finished his second season, which culminated with a trade to the Dallas Mavericks where he was waived (released) four days later. These two players could not be any less similar. Yet, our algorithm flags both LeBron and Chinanu as anomalies. A heat map of what qualifies each player chosen for the “Anomalous Top 10” is shown in Figure 1.
What Can We Learn From the Anomaly Detection Algorithm?
For our goal to identify superstar players using anomaly as a proxy, our results are hit-or-miss. Several of the players identified in this list had remarkable seasons: LeBron James averaged 27 points and nine assists per game, James Harden averaged 30 points and roughly 10 free throw attempts per game (due to his ability to drive to the basket and pick up fouls). Other players — Chinanu Onuaku, Aaron, Jackson, Jeremy Lin — were deemed anomalous for other reasons entirely. Most notable is that all three played only one regular season game last season, and in Lin’s case, a game in which he scored every free throw he attempted!
We must remember that the objective of anomaly detection – particularly by Mahalanobis distance — is to identify players on the fringes of the data universe. Here, it identified players with great seasons (more assists, blocks, field goals, etc. than average) as well as those on the other end of the spectrum (or who stood out for playing time, say, whether they did well or not.) A visual depiction of the data — aided by dimensionality reduction to a plane via principle components – is shown in Figure 2.
Using anomaly detection yields an interesting, and diverse, set of results. However, a key is that it might not reveal exactly what you were looking for (superstars), which is sometimes forgotten.
Which Method Should You Use?
If you have plenty of labeled instances of what you’re looking for, supervised modeling is the most powerful approach to use. If you have none, you must try anomaly detection. If you have only a few labeled cases, try both techniques. If a cluster of odd points is near your labeled fraud cases, you may have found more fraud instances following that same scheme. If a cluster is somewhere else, you may have discovered a new scheme! That extremely valuable outcome is not something that is easy at all to do with supervised learning alone. Just be sure to brainstorm alternative explanations for the different types of anomalies you find. It has happened in the past that we at Elder Research have discovered something even more valuable than what we were originally seeking by carefully studying the “unhappy families” uncovered.