The project required several interrelated stages of data analysis using multiple data sources and formats. First, the document data was extremely large and big data. Elder Research needed to create a methodical and cost-effective approach to data ingestion and transformation. Then, the document data needed to be fused and interrelated in order to create useful information and insight. It was very common for the documents to have typographical errors, therefore preparers could appear to be linked when they did not actually work together (and vice versa, preparers could appear to be unrelated when in fact they were working closely together). Elder Research used extensive data validation and entity resolution procedures to account for missing data and ensure that the documents and preparers were correctly identified and categorized together. Our data science teams employed a novel network identification approach to achieve this. We quantified the strength of relationship between each pair of document preparers. The idea behind this technique came via Elder Research’s extensive experience with text mining, as exemplified by the award-winning book, Practical Text Mining, co-authored by Dr. John Elder and five others (Elsevier, 2012).
Further phases of network analysis techniques were used to sort through the web of connections and boil down the relationships to the most interesting, risky, and suspicious networks of preparers who worked together, as shown in the example below. Our team leveraged a variety of metrics-based and machine learning approaches to quantify risk and to quantify risky connectivity between actors and institutions in the network.
To make it easy for end-users to search and explore preparer relationships, Elder Research deployed its proprietary browser-based network visualization tool called RADR. This enabled the client to interactively explore and visualize relationships among preparers and to also explore preparer connections based on the raw document data.