AWS Data Pipeline Design Supports Natural Language Processing

The Challenge

A nationwide leader in the fast-food industry hired Elder Research to assist in the design of a cloud-accelerated Data Pipeline System Design to support Natural Language Processing (NLP) allowing the company to gain business insights from their customer and organizational data assets. NLP is a branch of artificial intelligence (AI) in which computers analyze textual data to understand, interpret, and manipulate human language. Not only is NLP modeling a difficult technical and statistical discipline, it also requires an accompanying understanding of a wide variety of cloud services, as well as strategic architecture knowledge to stitch together cloud solutions that are cost-considerate, secure, high speed, reliable, and finally–consumable by business end-users.

The Solution

After a strategic in-person discovery session to understand the relevant AWS infrastructure, data environment, and priority use cases, our team of data strategists, data scientists, and data engineers designed an end-to-end NLP pipeline to be deployed in the client’s Amazon Web Services (AWS) cloud environment. The integrated pipeline would tap into a wide variety of data sources across the enterprise, and automatically ingest, integrate, transform and serve textual data for NLP. We recommended a phased collaborative project plan for deployment for two priority use cases.

The NLP data pipeline design enabled:

  • Data availability across the enterprise – Moving NLP data from vendor-driven silos to organization-curated central locations leads to shared, actionable business intelligence.
  • Data cleanliness and formatting – Defining and applying frequently used categories across data platforms will unify insights for each segment; enabling multi-attribution will allow data rich engagements to reflect all categories referenced in the language; and translating audio files from telephone interactions into text will accurately capture the engagement.
  • Tool enhancements – Real-time translation of audio to text and resulting analytics can proactively identify potential problems and solutions as the agents are on the line with an operator or customer. Using NLP for categorizing web-based inputs will provide more nuanced and consistent data.
  • Future readiness – Harnessing the power of NLP and applying what is learned across a variety of business units, the client can continue their history of innovation and evolve to meet the consumer and owner-operator expectations of the future.

Two priority use cases were identified for the NLP pipeline:

  • Customer feedback – Improve the current customer service and customer engagement survey data processes by implementing a query engine for analysts to enable nimble extraction and manipulation of relevant data, auto-classifying categories for customer compliments or complaints to capture more robust data in a consistent format and definitions, frequency analysis to help identify patterns and prioritize solutions, and sentiment analysis to understand how customers are feeling and determine the best type of intervention.
  • Franchise owner support – Through several means, improve support service processes for franchise owners reaching out to the internal help desk: integrating data across data stores, real-time transcription of phone calls, and providing better recommendations from the knowledge base to resolve the problem in real time using a Resolution Suggestion Model.

The NLP Data Pipeline design incorporated various AWS services:

  • AWS Simple Storage Service (S3) – Saving raw data in S3 allows new analytics to be performed on data that was not previously leveraged. S3 features direct integration into many of Amazon’s products including Amazon Transcribe, making it a practical intermediate storage step before enhancing the data.
  • AWS Relational Database Service (RDS) – Used to store the parsed and processed data for analytics. RDS is highly scalable to allow for growth in the future without a prohibitive upfront cost. Housing all the data in RDS allows queries and analytics to pull from one primary source, allowing for more complicated and potentially interesting questions to be answered.
  • Amazon Transcribe – Automatically recognizes the speech in audio files and creates a transcription allowing for real-time insight and channel identification during support calls.
  • Amazon SageMaker – Features real-time predictions during model deployment which allows the Resolution Suggestion Model to provide Knowledge Base article recommendations to agents in real-time.
  • AWS Glue – Serverless and fully managed extract, transform and load (ETL) service used to reshape and enrich Voice of the Customer data. Glue analyzes the data, builds a metadata library, and automatically generates Python code for recommended data transformations.
  • Amazon Kinesis – Using AWS Kinesis Video Streams for help desk call audio provides real-time collection, processing, and data analysis to allows insights to be generated as quickly as possible. All infrastructure underlying the streaming process can be managed by Amazon, enabling the client to focus on the use of the data, rather than managing the streaming process.
  • AWS Lambda – Used to run code in a serverless, scalable way to eliminate the need to provision or manage servers and reduce cost by only paying for active compute time.

The flow diagram below shows the AWS Cloud data pipeline design for the customer feedback use case.



Understanding of nuanced situations through voice and text transcription, translation, and sentiment provided by NLP is a potential market differentiator for our client. This NLP pipeline enabled our client to exceed customer and franchise owner expectations by providing insights that can be applied across business units. The solution increased enterprise access to textual data and insights, streamlined text data processing so resources can be reallocated to new efforts, minimized the number of assumptions made about text data, and generated the potential for using textual data in innovative applications.