Blog

Updating a Data Pipeline with AWS’s Latest Offerings

Todd Gerdy

February 14, 2020

 BLOG_Updating a Production Data Pipeline with AWS’s Latest Offerings

In December I attended AWS re:Invent, Amazon Web Services' annual learning conference. It was five days filled with over 4,000 sessions, keynote announcements, a partner expo, and hands-on training and certification opportunities. I learned about a number of tools and services (some brand new) that will improve the data pipeline solutions we develop for clients. This article describes a production pipeline solution and several options for improving it using these tools and services.

Pipeline Project Overview

Elder Research developed a data pipeline for a client interested in analyzing software usage logs to provide insight to their research and development and sales teams. This pipeline has been in production for over 3 years (see Figure 1).


Figure 1: Overview of the Current Data Pipeline

The client receives logs from opt-in users for each software usage session. An onsite utility (Uploader) reads the logs and sends them to a REST API; this API is hosted inside a Docker container sitting on an Amazon EC2 Instance. The API accepts the logs and sends them to an Amazon Elasticsearch Service (ES) domain that serves as our Source System of Record (SSoR). Each log is stored in ES in a zipped and encrypted format to save space and secure the session. The API also sends a session ID to an Amazon Simple Queue Service (SQS) standard queue. The Parser, which is another Docker container sitting on the same EC2 instance, polls the SQS queue for new session IDs and grabs the corresponding data from the SSoR to parse. The parsed data is stored in a second ES domain and a MySQL database on Amazon Relational Database Service (RDS). Several analytics utilities, also hosted on the EC2, analyze parsed data and load an analytics focused MySQL database on a scheduled basis. Data from this database is consumed by the client’s Tableau server and a custom BI dashboard.

Moving Forward

At re:Invent, I learned about AWS offerings that could be used to revamp this pipeline to follow the AWS Well-Architected Framework. This framework is designed to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for diverse applications and includes five pillars (see Figure 2):

  • Operational excellence pillar - automates changes, continually improving procedures and systems, and monitoring systems to improve business value.
  • Security pillar - protects data and systems by managing access to services, data integrity and confidentiality, and monitoring to detect security events.
  • Reliability pillar - prevents, and rapidly recovers from, failures and plans how to handle future changes to the system.
  • Performance efficiency pillar - uses the proper resources efficiently, including monitoring performance and planning around an evolving business.
  • Cost optimization pillar - improves spending on services by helping users understand cost details, making sure the most appropriate number of resources are utilized as well as analyzing spend over time.

Figure 2: AWS Well-Architected Framework

Awareness of the five pillars is very important when designing solutions using AWS services. While we have recently made changes to align the client's services with the AWS Well-Architected Framework, there are additional improvements that we can make.

Improvement #1 - Convert the SSoR

To streamline the service, we could convert the SSoR from an Elasticsearch domain to Amazon’s Simple Storage Service (S3). Using AWS Data Pipeline, a service that automates the data movement, we would be able to directly upload to S3, eliminating the need for the onsite Uploader utility and reducing maintenance overhead (see Figure 3). The shift to S3 would provide additional benefits:

  • Improve Durability - S3 is designed to have 99.9999999% (aka eleven 9’s) of durability on objects.
  • Enhance Data Security - We would be able to secure the data so that it is only accessible from within the AWS Virtual Private Cloud (VPC) via S3 VPC Endpoints.
  • Automation - Every time an object is added to S3, we would be able to trigger an AWS Lambda function to add the new ID to the SQS queue.
  • Reduce Storage Cost - Older logs that are no longer used can be stored in S3 Glacier, an archival-style storage service a fraction of the cost of S3.
  • Optimize Cost - S3 Intelligent-Tiering can be used to move data between frequent access and infrequent access.

In addition to handling uploads, the REST API allows the client to query certain data from Elasticsearch and the MySQL table. We could replace the Docker-based API with API Gateway, a more reliable, efficient, and scalable RESTful API with built-in monitoring and security options. This would maintain the query functionality while removing the need to maintain custom code.

Figure 3: Updating the SSoR    

Improvement #2 - Update the Parser

Another improvement would be to replace the Parser utility with Amazon Glue. Glue is a fully managed, highly scalable, serverless Extract/Transform/Load (ETL) service that supports data sources like S3, Amazon Aurora, RDS, or any other non-native source by using custom Scala or Python code (see Figure 4). We can have the S3 trigger a Lambda that runs the log through Glue. While the Parser is constantly running on an EC2 instance, Glue is serverless and only uses compute power when it is called upon. Glue’s scheduling feature can reach out to the S3 as often as we want and grab newer objects to parse the data.

Figure 4: Replacing the parser with AWS Glue

Improvement #3 - Enhance Monitoring and Fault Tolerance

Currently, the REST API, Parser, Analytics tools, and BI Dashboard all reside on a single EC2 instance, which is a reliability weakness. If the EC2 were to fail, the process would come to a halt. Monitoring the services on EC2 is tricky. We have created a system using Python utilities with scheduled run times sitting on the EC2 Instance to notify us via Slack when issues arise in our pipeline. However, if the EC2 instance fails, so does our notification system. Since our utilities are already Docker containers, separating the apps by using Amazon Elastic Container Service (ECS) and AWS Fargate would solve this issue (see Figure 5). To save compute time and costs, we can set up individual instances for each utility in Fargate that will only run when called. Fargate also connects with Amazon CloudWatch, providing better user access to monitor logs.

Figure 5: Moving utilities from EC2 to AWS Fargate to lower usage costs

Additional Improvements

The AWS ecosystem offers over 175 different services. The following services are worth mentioning as they could also be beneficial to this or future projects.

Converting the Amazon RDS MySQL database to Amazon Aurora would improve performance and cost effectiveness.  Aurora is a relational database in Amazon RDS that is designed for the cloud and has PostgreSQL and MySQL compatible variants. It can be up to five times faster than the standard MySQL database, matching the performance of commercial databases while retaining its open source cost efficiency.  When using Relational Databases in Amazon RDS you pay for storage, I/O, and data transfer.  Aurora has slightly higher rates; however, it also stores the data across multiple AWS Availability Zones (AZs) to strengthen the durability of your data.  With Amazon Aurora Serverless, you pay for compute power on an as-needed basis and you do not need to pre-calculate your storage sizes as it will automatically scale as needed and can be configured to shut off when it isn’t being used.

UltraWarm nodes for Amazon Elasticsearch Service were announced at re:Invent. They use S3 and caching to improve the performance of indices that are not currently being written to and are queried less frequently.  We could store the older read-only Elasticsearch indices and their logs using UltraWarm, significantly lowering costs per GiB. UltraWarm is still in preview and is not yet recommended for production, but when ready, it would help this pipeline.

Conclusion

Going to re:Invent made clear how rapidly technology is evolving, and thus how critical it is to make sure projects are designed to handle future upgrades. Building pipelines as microservices as opposed to monolithic applications is a great way to accomplish this. By separating our utilities I’ve shown how each part of our pipeline can be improved using the latest AWS services.  When designing projects -- whether you use Amazon’s services or not -- I recommend you adhere to the AWS Well-Architected Framework to obtain the best solutions for your needs.


Need Help Developing a Cloud Analytics Solution?

Cloud-2Our team of certified cloud developers build machine learning solutions that scale to meet changing business requirements.

  • Certified AWS Solutions Architects
  • Certified Google Cloud Professional Data Engineers
  • We can develop within a client’s cloud environment or host a client instance in our own cloud environment

Related

Automating Data Pipelines and Network Entity Detection

Data Engineering with Discipline

What is Data Wrangling and Why Does it Take So Long


About the Author

Todd Gerdy Data Engineer Todd Gerdy brings his 10 years of problem-solving skills from the Audio/Visual industry to Elder Research. He actively uses these skills to help both the Data Science and the Software Engineering teams in implementing efficient software solutions for clients. He has the AWS Solutions Architect - Associate Certification.