Curbing Fraud by Leveraging Analytics


Jericho McLeod

Date Published:
February 16, 2023

Define the Problem

Fraud is an ever-growing problem worldwide, with the estimated losses for consumers in 2020 reaching billions of dollars, federal tax fraud as high as half a trillion dollars, and credit card fraud affecting more than half of Americans.


To fight back, businesses and government agencies are increasingly turning to analytics to help detect and prevent fraud. Organizations analyze data to identify suspicious patterns, trends, and anomalies that indicate potential fraud. They can also use analytics to create predictive models able to detect potential fraudulent transactions before they occur. With the power of analytics, organizations can prevent future fraud, and uncover past fraud that would have otherwise gone unnoticed. Analytic models are the best protection against financial fraud and are quickly becoming the industry standard.

How do Analytics help with Fraud detection?

Fraud is caused by withheld or inaccurate information. Because we have limited resources to actively monitor the world around us, there are openings for attack. Fraud investigators have limited time and resources, and their effectiveness depends on maximizing the positive impact they can have within these constraints.

When the type of fraud being committed is well known, analytics can apply historical knowledge to new cases using statistics and supervised machine learning. That is, we can use history to find similar cases using computers much faster than requiring investigators to examine material manually. However, such techniques are much less useful against previously undiscovered types of fraud. Instead, we rely on anomaly and outlier detection methods. In contrast with supervised machine learning, which aims to label cases as “fraud” or “not-fraud”, these methods don’t have labels but aim to find cases that stand out in some way. This is a type of “automated tip” where the computer locates unusual cases, allowing investigators to focus their limited resources to investigate a subset of promising cases.

Fraud cases are shown in red, normal are shown in grey, investigator capacity & selection shown with a shaded blue rectangle. Figure above shows a higher proportion of fraudulent cases in the investigator’s selection on the right as an example of fraud detection model outcomes.

What kinds of Analytics are used for Fraud Detection?

Analyzing data for fraud without prior labeled data to learn from is an unsupervised machine learning task. We’re looking for cases set apart in some way from others, such as outliers. The term anomaly is often used interchangeably with outlier, but in this article, we are going to refer to outliers as extreme values, and anomalies as those cases found by more complex methods.

The Starting Point

The starting point in outlier and anomaly detection is frequently looking at each characteristic, or variable, one at a time, i.e., univariate metrics. This can show outliers, such as in the example (to the right). An example of fraud that would show up for this type of analysis is retail return fraud, where the quantity of returns for people committing fraud are likely to be unexpectedly high.


Expanding on this, we can examine two characteristics at the same time using bivariate statistics. In the first example (left), we can see an outlier that is apparent in one dimension, but not the other. That is, if we flattened the data to the bottom of the plot and made a histogram like the previous example, it would be an outlier. If we flattened the to the left side and made a histogram, it would not be.

Bivariate or Multivariate Methods

In the second plot, it would not be an outlier for either dimension, but would be an outlier if both were considered together. This is where bivariate or multivariate methods are useful, such as Mahalanobis distance. Tax fraud that omits sources of income, but claims related deductions or expenses, can be highlighted using these methods.

Utilizing Shape and Location Characteristics

In the following example, even multivariate methods that consider distance would fail; even with a additional information showing shape. However, with the collection of shape and location characteristics, it can still be identified. Methods for identifying this type of anomaly, with mixed data types, include clustering, isolation forests, and CADE (Classifier Assisted Density Estimation). See brief explainer videos on Isolation Forests and CADE below.

While complex methods will work on simpler cases, it is wise to use the simplest method that will solve a particular problem. Simpler models are easier to explain, understand, update, and maintain, and are usually faster and cheaper to operate.

Anomaly detection methods uses the information available to understand what a typical case looks like, and then highlight those that veer away from those characteristics. In most problems, characteristics move together in common patterns; for instance, as income goes up on tax filings, expenses also tend to increase; as credit card charges spread out geographically, so do travel-related expenses, and so on. For example, in a study on healthcare fraud, the quantity of claims compared to total claim amounts was used to identify outliers as potential cases of fraud. Finding cases where variable combinations differ from expectations is how we prioritize cases to maximize the limited investigative resources available.

Building Fraud Analytics

CRISP-DM Diagram

The search for fraud risk can be structured using the CRISP-DM Process. This process is less a strict end-to-end description of steps that always happen in order, but more the general flow between the important phases of an analytics project. Feedback is a necessary component in fraud analytics and is represented by the counterclockwise arrows in the process diagram. New or unexpected discoveries during the modeling process, for instance, can lead to a better understanding of the subject matter, and in turn lead to improvements of the model design.

With fraud analytics in particular, information from the results of investigations can be fed back into the modeling process to improve models and thus better select cases for investigators. In early phases of this type of analytics, investigative results can be leveraged to refine models by pruning unnecessary features, refining useful features, and identifying gaps where features may be missing. Over longer periods, sufficient feedback may even enable supervised machine learning methods to be engaged. When this occurs, it can be important to continue using unsupervised methods to monitor for new types of fraud, as supervised learning only identifies cases that are similar to what has occurred in the past.

Moving Forward

Using Analytics Results

Deploying an analytics model is a significant milestone, but it is not the end of the journey! A model needs to contribute to desired outcomes to be successful. Frequently, this means providing model outputs to investigators in a useful manner, such as a report or dashboard, that can be integrated with existing case management systems to prioritize investigations. It can also mean using the results to identify processes and controls that can prevent future fraud.

If you have ever had a credit or debit card blocked from transactions while traveling, you have experienced this first-hand: the location of card-usage became an outlier, and the card issuer attempted to prevent future fraud by locking the account and requiring that the account owner verify whether the change was legitimate. For a good client this “false alarm” creates a cost (negative experience) which must be weighed against the opposite error, a “false dismissal” of actual fraud, to find the best operating balance.

Maintaining Models

Model maintenance is critical for generating long-term value from a fraud analytics model. Circumstances change over time, and models need to reflect these changes to remain effective. For credit cards, the quantity of online transactions has grown over time; models that have not been updated would not reflect the current rates of online transactions accurately so might wrongly select cases for investigation. Similarly, inflation affects the average dollar-amounts of transactions so needs to be monitored for model updates to preserve accuracy.

Changes can also come from more direct sources, such as feedback from investigators, new business understanding, or a new fraud-prevention process that alters the landscape of risk assessment. A new procedure that provides additional oversight for a previously high-risk activity and that effectively reduces risk necessitates model updates to reduce the likelihood of identifying events from this source as fraud.

To improve the CRISP-DM framework we need to include model maintenance — the connection between deployment and business understanding. Maintenance begins after model deployment and is initiated by understanding the environment in which the model operates.


Fraud detection and prevention is growing rapidly, thanks to analytics being increasingly present on the front lines. The amount of transaction data available makes human review impossible but analytics can make the first pass on all the data to score risk, enabling you to focus investigative efforts where the impact will be highest. Elder Research has experience in a wide range of analytics models used in fraud detection and prevention, and in managing the end-to-end life cycle for analytics models. Whether you are taking your first steps in fraud analytics or have complex systems in place that need updates, we can help you reach your goals.