Natural Language Processing for RegTech: Uncovering Hidden Patterns in Regulatory Documents


Evan Mitchell

Date Published:
July 10, 2020

Natural language processing (NLP) is a branch of artificial intelligence aimed at giving computers the ability to use and understand human language and speech. Technology features we take for granted every day are a product of NLP. When you dictate a text message to Siri or ask Alexa the weather, that’s natural language processing. When our email services filter out spam, check our spelling and grammar, and even autocomplete entire messages, that’s NLP too. NLP is also a key part of Elder Research’s approach to RegTech.

Natural language processing has seven key technical capabilities:

  • Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques.
  • Topic modeling is a method based on statistical algorithms to help uncover hidden topics from large collections of documents.
  • Text categorization sorts texts into specific taxonomies following it being trained by humans.
  • Text clustering is a technique used to group text or documents on similarities in content.
  • Information extraction is used to automatically find meaningful information in unstructured text.
  • Named entity resolution is a method that extracts the names of people, places, organizations, and more and classifies them into predefined labels and links the named entities to a specific logic.
  • Relationship extraction is a capability that helps establish semantic relations between entities.

Natural Language Processing for RegTech

When it comes to regulatory compliance, growing complexity has led to business leaders and teams of lawyers poring over paper and digital documents — sifting through dunes of regulatory text looking for single grains of data to answer, “What do we need to do to comply?” Organizations using NLP can output summarized reports and answer what they need to do to be compliant under the set rules and restrictions without getting lost in the large amounts of information provided in the document.

Beyond its value in automating compliance for regulated enterprises, NLP is critical to at least two others out of the six major regulatory activities described in HData’s RegTech Manifesto. For example, Elder Research serves multiple U.S. federal regulatory agencies with analytics solutions that extract insights from unstructured reports that the agencies receive from the enterprises they regulate.

Elder Research has branched away from the traditional NLP landscape to leverage Deep Neural Networks (DNN) for applications such as text classification, image capturing, question answering, and language creation. DNNs enhance performance through their ability to learn more complex nonlinear functions, setting them apart from more basic NLP models.

Traditional NLP models use a technique called “Bag of Words” to represent text data for machine learning. Bag of Words counts the number of times a word appears in a document and calculates the frequency that each word appears in a document compared to the other words in it, but does not consider word order. The problem with not considering word order is that two sentences can have identical model input, but can have a positive, neutral, or negative meaning by the way the words are ordered within the sentence. But DNNs allow us to distinguish meaning based on word order as well as word selection.

Example: Extractive Summarization For Audit Findings

In one project, an extractive summarization method was applied to A-133 Single Audit documents, which report on the results of audits of recipients of federal grants. Because such audit documents are mostly unstructured, grant program managers must read through them, identify auditors’ key findings, and then analyze those findings manually. Extractive summarization can streamline the process.

The project focused on the “Findings” sections of about 20,000 Single Audit documents filed between 2016 and 2018. First, the text was preprocessed to remove irrelevant words like “the”, “a”, and “is”. Next, a geometric function was generated for each remaining word. Combining the geometric functions of all the words in each sentence yielded a single, averaged geometric function for each sentence. Plotting these geometric functions on a graph created sentence clusters as shown in the figure below.

The sentence closest to the centroid of each cluster was extracted. Finally, by stringing the centroid-proximate sentences together, extractive summaries were automatically created.

The summaries were then judged by non-expert human evaluation and compared to human-generated summaries using the ROUGE metric. ROUGE software is a set of metrics used for evaluating automatic summarization and machine translation software in natural language processing. The initial goal was to fully automate the A-133 Single Audits, but human input was needed at various stages due the different writing styles, content, and context.

This process resulted in a concise, understandable summary of every finding from each of the tens of thousands of audit documents.

Ultimately, NLP helps regulatory agencies, regulated enterprises, and markets understand unstructured regulatory documents without countless hours spent researching, reading, and analyzing. It helps analysts increase efficiency, derive actionable insights, and uncover hidden topics from large collections of rules, filings, or reports. RegTech is a growing industry, powered by advances like those Elder Research is pursuing to create efficiency for both the regulatory sector and the private sector.

Note: Blog was originally published by HData and published with permission.