Text Mining Unlocks Valuable Data from Scanned Insurance Documents

The Challenge

Each application for life insurance includes many supporting documents that must be analyzed by underwriters to determine whether a case will be accepted or denied. The analysis process is time consuming for underwriters. A leading insurance provider identified an opportunity to use text mining to unlock valuable applicant information from scanned images (PDF files) to improve their underwriting risk model to determine whether to accept or decline new cases.

The Solution

An Attending Physician Statement (APS) is a medical history summary from a physician, hospital, or medical facility that has treated the patient and is one of the most sound and proven forms of additional background information to assess medical risk. Mining text from digitized files comprised of multiple formats, and of varying document quality, can be extremely challenging. Based on the specific challenges with the APS documents a pipeline architecture was developed to maximize the yield of the important text features, as shown below.

The documents were processed by Percept, a tool developed by Elder Research to extract text from documents in a variety of formats including PDF, Word, and HTML and convert that text into a format suitable for use by predictive models.


The text extraction framework provided valuable inputs for the underwriter in an easy-to-scan structured data format that could be combined with the other structured formats to prioritize which applications are examined in depth. Prioritizing case decisions based on improved risk prediction increased efficiency and reduced cost by minimizing the time subject matter experts spend reviewing new applications.

Download This Case Study