Automating Data Pipelines and Network Entity Detection

The Challenge

Elder Research was tasked by a major government enforcement agency to identify networks of document preparers who worked together in a given year in order to surface potential collusion and abusive behaviors. Documents with particular information in common indicated a possible network connection. The goal was to enhance the client’s capabilities in two ways:

Improve efficiency of identifying and re-assigning mis-identified documents, which severely hampered investigative efforts due to poor data and thus poor identification.
Enable enforcement agents and analysts to see and explore preparer networks instead of only individual preparers in order to prioritize investigative resource deployment.

The Solution

The project required several interrelated stages of data analysis using multiple data sources and formats. First, the document data was extremely large and big data. Elder Research needed to create a methodical and cost-effective approach to data ingestion and transformation. Then, the document data needed to be fused and interrelated in order to create useful information and insight. It was very common for the documents to have typographical errors, therefore preparers could appear to be linked when they did not actually work together (and vice versa, preparers could appear to be unrelated when in fact they were working closely together). Elder Research used extensive data validation and entity resolution procedures to account for missing data and ensure that the documents and preparers were correctly identified and categorized together. Our data science teams employed a novel network identification approach to achieve this. We quantified the strength of relationship between each pair of document preparers. The idea behind this technique came via Elder Research’s extensive experience with text mining, as exemplified by the award-winning book, Practical Text Mining, co-authored by Dr. John Elder and five others (Elsevier, 2012).

Network of Preparers

Further phases of network analysis techniques were used to sort through the web of connections and boil down the relationships to the most interesting, risky, and suspicious networks of preparers who worked together, as shown in the example below. Our team leveraged a variety of metrics-based and machine learning approaches to quantify risk and to quantify risky connectivity between actors and institutions in the network.

To make it easy for end-users to search and explore preparer relationships, Elder Research deployed its proprietary browser-based network visualization tool called RADR. This enabled the client to interactively explore and visualize relationships among preparers and to also explore preparer connections based on the raw document data.

Results

Elder Research developed an automated data pipeline to cleanse, transform, resolve, and fuse data in a highly intelligent fashion. The result was an extensive graph database with over 2 billion nodes and edges that created highly connected and interactive data. Our team also created network-based risk modeling approaches that surfaced highly suspicious connections within the graph database. We finalized the data pipeline to feed a data visualization tool used to identify and explore document preparer network relationships. Advanced analytics and data visualization automated 40% of the cases being investigated for improper preparer identification, reducing case investigation from 20 minutes per case to less than a minute per case, significantly improving investigative resource allocation.

Download This Case Study