Automating Data Pipelines and Network Entity Detection

The Challenge

Elder Research was tasked to identify networks of document preparers who worked together in a given year. Documents with particular information in common indicated a possible network connection. The goals were to enhance the client’s capabilities in two ways:

  1. Improve efficiency of finding documents with incorrect preparer identification and re-assigning them to the correct preparer identification when possible.
  2. Enable the analysts to consider preparer networks instead of only individual preparers to better deploy investigative resources.

The Solution

The project required several interrelated stages of data analysis using multiple data sources and formats. Since it was common for the documents to have typographical and other errors preparers could appear to be linked when they did not actually work together. Elder Research used extensive data validation procedures to account for missing data and ensure that the documents identified the proper preparer.For the preliminary phase of network identification Elder Research quantified the strength of relationship between each pair of document preparers. The idea behind this technique came via Elder Research’s extensive experience with text mining, as exemplified by the award-winning book, Practical Text Mining, co-authored by Dr. John Elder and five others (Elsevier, 2012).Further phases of network analysis techniques were used to sort through the web of connections and boil down the links to the most likely networks of preparers who worked together, as shown in the example below.

Network of Preparers

To make it easy for end-users to search and explore preparer relationships Elder Research deployed its proprietary browser-based network visualization tool. This enabled the client to interactively explore and visualize relationships among preparers and to also explore preparer connections based on the raw document data (without the pre-processing analytics).


Elder Research developed an automated data pipeline to cleanse data and feed a data visualization tool used to identify and explore document preparer network relationships.  Advanced analytics and data visualization automated 40% of the cases being investigated for improper preparer identification, reducing case investigation from 20 minutes per case to less than a minute per case, significantly improving investigative asset utilization.

Download This Case Study