Mike Thurber, Lead Data Scientist and fraud specialist at Elder Research, presented Elder Research’s fraud detection methodology at Predictive Analytics World for Government last year. Consider the scenario of detecting fraudulent insurance claims, such as the audacious “accidental” death scheme in the 1944 noir film Double Indemnity.
A long-established firm, like the fictional Pacific All Risk Insurance Company of “Double Indemnity” fame, with a long-running life insurance product, will probably have a set of known past fraudulent claims. The characteristics of these frauds can then be used to train statistical models to predict whether a given future claim is likely to be fraudulent.
In many cases, though, an organization does not have a large or well-organized set of “labeled” fraud cases, for example, if the firm or the product is new. This is where Mike’s expertise comes in. The first stage of developing a fraud detection capability is anomaly detection – if you do not have known fraud cases as a base, you can at least start by identifying cases that are different from the others (outliers), and merit further investigation.
Enter Supervised Learning
As the organization gains maturity, the investigations of the anomalous cases yield labels – cases are confirmed as fraudulent or not. Domain expertise is used to refine the feature set that describes all cases. These newly labeled cases, and the improved feature set, can then start to be used in statistical models to predict whether a new case is fraudulent or not.
Supervised learning can also be used to identify “confirmed not fraud” cases, which are also labeled as a result of the investigations. So we end up with three categories:
- Investigated & confirmed fraud
- Investigated and confirmed not fraud
- Not investigated
The ones that are not investigated did not qualify sufficiently as outliers to merit investigation. At this point, the organization might refine its anomaly detection model to bring more cases into the labeling process, as more information is learned about features.
This is an excellent example of a holistic data science approach that combines feature engineering, input from domain experts, unsupervised and supervised learning working together, and iterated model improvement informed by a growing body of ground truth.