Surveying the commercial analytics landscape, it is clear that organizations have worked hard over the last decade to collect, organize, and govern their data. Many have invested in business intelligence or similar tools to discover valuable insights and present them in intuitive ways. Others have gone further to build, deploy, and maintain collections of machine learning and AI applications, helping to identify reproducible and timely patterns in critical data and plan for future business.
Most analytics applications are built for regular data and normal processes—handling about 99% of use cases. But at business scale, the remaining 1% of irregular or anomalous cases can translate to millions or billions of data elements—each potentially connected to other data sets. This mixing of regular and irregular data can be a serious problem for machine learning or AI models that only expect to process typical data.
But an opportunity hides here as well. Sometimes anomalous data don’t stem from data-entry errors, storage corruption, or faulty manipulations. When anomalies correspond to fraudulent or otherwise unethical actions, worrisome sensor readings, or physically impossible entries, more value might come from identifying and remedying anomalies than from modeling the normal cases.
Anomalies can be positive discoveries, too: In one instance, we uncovered a drug compound that was likely to (and did) pass FDA approval, even though the client’s in-house methods (and data preparation techniques) had assessed the compound as unremarkable. In (very) rare cases, we have found market-trading strategies that really do work out-of-sample; these unusual cases are extremely valuable!
Identifying irregular data is the job of anomaly detection, a set of techniques and algorithms that separate usual data from outliers—often without any prior examples of what unusual looks like. In simple data sets, where records are described by only one or two attributes, anomaly detection might not be very challenging at all. However, as the number of attributes grows, it becomes increasingly difficult to separate anomalies from normal data. As data set complexity grows, anomalies can look very typical from any single direction (or projection) but still possess characteristics that are completely different from the rest of the data.