Business Insights Meet Analytics Skills in Anomaly Detection

Author:

Tom Shafer

Date Published:
March 12, 2025
Data scientist Tom Shafer sitting at laptop

Surveying the commercial analytics landscape, it is clear that organizations have worked hard over the last decade to collect, organize, and govern their data. Many have invested in business intelligence or similar tools to discover valuable insights and present them in intuitive ways. Others have gone further to build, deploy, and maintain collections of machine learning and AI applications, helping to identify reproducible and timely patterns in critical data and plan for future business.

Most analytics applications are built for regular data and normal processes—handling about 99% of use cases. But at business scale, the remaining 1% of irregular or anomalous cases can translate to millions or billions of data elements—each potentially connected to other data sets. This mixing of regular and irregular data can be a serious problem for machine learning or AI models that only expect to process typical data.

Quote: Sometimes anomalous data don’t stem from data-entry errors, storage corruption, or faulty manipulations.But an opportunity hides here as well. Sometimes anomalous data don’t stem from data-entry errors, storage corruption, or faulty manipulations. When anomalies correspond to fraudulent or otherwise unethical actions, worrisome sensor readings, or physically impossible entries, more value might come from identifying and remedying anomalies than from modeling the normal cases.

Anomalies can be positive discoveries, too: In one instance, we uncovered a drug compound that was likely to (and did) pass FDA approval, even though the client’s in-house methods (and data preparation techniques) had assessed the compound as unremarkable. In (very) rare cases, we have found market-trading strategies that really do work out-of-sample; these unusual cases are extremely valuable!

Identifying irregular data is the job of anomaly detection, a set of techniques and algorithms that separate usual data from outliers—often without any prior examples of what unusual looks like. In simple data sets, where records are described by only one or two attributes, anomaly detection might not be very challenging at all. However, as the number of attributes grows, it becomes increasingly difficult to separate anomalies from normal data. As data set complexity grows, anomalies can look very typical from any single direction (or projection) but still possess characteristics that are completely different from the rest of the data.

A red block stands out from the rest

Working Together to Find Meaningful Anomalies

Because anomaly detection is targeted at unusual cases, it requires more nuance than typical analytics and is less amenable to plug-and-play solutions than other modeling tasks. This is a domain in which statistical practice and expert judgement must work together to identify what constitutes typical behavior in a data set, what is atypical, and, crucially, what atypical data are relevant to the business.

Quote: If we don’t work to surface meaningful anomalies, we risk wasting time pursuing outliers of no value.Looking through a large-enough data set, we’d expect to find all kinds of odd or interesting things, but only a fraction of these discoveries would turn out to be useful to the business.

If we don’t work to surface meaningful anomalies, we risk wasting time pursuing outliers of no value. This has real-world implications.

Our Blackmarker product, for example, employs quality assurance controls to guard against user-input errors and ensure that documents are correctly redacted. Anomaly detection plays an important role in this system, identifying those documents that don’t match expectations.

Simply deploying anomaly-detection models, however, isn’t good enough because there are many more ways to be different than there are to be typical, and not all of these are relevant to the problem. To provide targeted anomaly-detection models that raise alarms at the right times requires more data collection, more model tuning, and close work with product groups.

A red pencil stands out from the rest

Different Kinds of Anomalies in Different Data

This business–analyst relationship is vital because different kinds of anomalies can hide—and hide in different ways—in different data sets. One-size-fits all rarely applies in these situations; consider a few examples from our own experience:

Predictive Maintenance

Quote: Different kinds of anomalies can hide—and hide in different ways—in different data sets.One component of our Aegir predictive maintenance solution incorporates anomaly detection for sensor data collected over time. Time-series data can make detection more difficult because it provides several ways for anomalous points to hide. Anomalies can present as a break in the data’s pattern over time, even if the magnitudes of the individual anomalous points are unremarkable, so Aegir’s machine learning tools must be able to account for these kinds of patterns.

Insider Threats

In another effort, augmenting internal protection and insider threat capabilities, we were able to use contextual information to design a simple, interpretable detector. Accounting for problem context and basic physics allowed us to provide an interpretable indicator for potential anomalies and avoided the need for advanced AI or machine learning (ML).

Fraud Detection

In many fraud-detection applications or others involving the identification of out-of-bounds behavior, the subjects we want to identify are the most motivated to cover their tracks. These active, adversarial relationships can complicate anomaly detection. Even in these cases, it’s usually possible to identify bad actors by applying ML and AI approaches guided by business input and delivering analysis tailored for business needs.

We Win by Matching Technique to Context

In all these examples, it is the combination of the right analytics and models with expert knowledge that produces a useful solution. As experts in machine learning, prediction, and AI, we bring techniques that can identify anomalies in all kinds of data, including tabular data, time-series, graph networks, images, and geospatial data. There are many models and algorithms that can be applied to these problems, ranging from complex, full-scale AI solutions to robust, low-maintenance techniques.

But without working hand in hand with the business, analytics risks matching the wrong solution to the problem.

Quote: Without working hand in hand with the business, analytics risks matching the wrong solution to the problem.By combining business expertise with state-of-the-art anomaly detection expertise, we provide tailored anomaly detection that generates real business value. The biggest successes come from matching ML expertise to business knowledge to find meaningful anomalies in the data, whether those anomalies are errors to be fixed, bad actors to be caught, or new opportunities to be explored.

A red lamp stands out from the rest

Recommended Further Reading

If you’d like to dive deeper into technical details, we’ve curated a few articles, research papers, and software packages that we’ve found useful or interesting, both in our day-to-day work and while writing this overview.

Prof. Yue Zhao has compiled a list of more than 100 papers, presentations, benchmarks, and tools related to anomaly detection, making this GitHub repository a go-to for anomaly detection resources.

Chandola, et al.’s article has helped to shape our own thinking and teaching of anomaly detection over the years. Han et al.’s more recent paper also goes to great lengths to organize the many kinds of anomalies and to measure the algorithms that try to detect them.

PyOD is collects implementations of many popular anomaly detection algorithms into a single Python library. Version 2 was released in December 2024, too, building on the original to include many more deep-learning models. Meanwhile, scikit-learn is a standard Python machine learning library and includes several anomaly-detection algorithms.