Data Engineering with Discipline

Victor Diloreto

March 1, 2019

BLOG_Data Engineering with Discipline

As a data science consultancy, we frequently run into difficult data infrastructure challenges at our clients across multiple industries. To solve a business problem or get decision-making insights from data, we often must start by helping to clean up and organize the data architecture so we can build data science and machine learning (ML) models. This process of getting the data ready for the application of the science is called data engineering.

Data engineering is critical to being able to deploy a data analytics system that is robust and reliable in solving a real business problem. Creating the right data architecture facilitates transforming data into actionable insights to improve the decision-making process and create a competitive advantage.
The first step is to work with peers in IT or systems roles to identify the needed immutable data store(s) (IDS) to act as the root or source information for the analytics to reference (Figure 1).

Figure 1-Data engineering process flow

Figure 1. Data Engineering process flow

The IDS can be thought of as a read-only data source as the analytics will rely on this information and not change it during any part of the analytics process. Immutable Data Stores capture source or raw data, usually from external data providers (or at least, external to the analytics team) and are characterized by concerns such as provenance (the origin of the data), governance, and security.

The next phase, owned by the data engineer working in concert with the data scientist, is building the extraction and transforms to make the data operable for analytics. This Extract Transform and Load (ETL) phase is denoted by the yellow gears in Figure 1.  An example transformation is down-sampling – where one removes a high proportion of common cases from the source data to better balance the proportion of rare classes represented.  Part of this process is also called data wrangling, and it is often  said to cost data science professionals as much as 80% of their time, leaving only 20% for exploration and modeling.

The relationship of the data engineer to the data scientist is the same as that of the software developer to a user or feature owner in an agile process. It is important to have an accurate user perspective when developing the appropriate ETL in order to make sure we address the right needs. A data engineer acting alone may make ETL decisions that are not required by the data science. That’s why at Elder Research we use an Agile Data Science process

The output from the ETL forms a new data store called the analytic data store (ADS). This is a read/write store that the data scientist begins to work with to refine what data to use to build models (machine learning, text mining, etc.). The actual subset of the ADS used as model input is called the analytic base table (ABT). For the majority of use cases there are benefits to storing the data in an ADS. Very few applications are truly real-time; a single ETL process that runs nightly is enough for data that is updated from source daily, monthly or quarterly. Additionally, when you can design an ETL process for the data independently from its end use, you can optimize for the performance of the overall pipeline and cache the results for many uses. Multiple models can be trained from the same ABT. There are two results from this process that require oversight by the data engineer:

  1. During modeling, any data that does not exist efficiently in the ADS must be re-worked into the ETL by the data engineer to be recycled into the ADS in its most efficient form.
  2. Results from the modeling process must find their way back into the ADS – denoted by the Scoring box and arrow in the diagram.

The data engineer works with the data scientist and the business owners to derive the best way to visualize the results of the model. The visualization phase can use multiple tools – spanning the spectrum from basic spreadsheets to dedicated software packages – to drive charts, geospatial heat maps, bar graphs, and much more.

Now that we have described the data engineering process dryly, let’s use a baking analogy to understand things a little better.  “Baking a cake with data” might also clarify the differences between the IDS, ADS, and ABT.

Figure 2-analogy-data-stores-differences

The Enterprise Data Warehouse, or EDW, team helps source and collect the ingredients like eggs, butter, sugar, and cocoa powder. Eggs come in a dozen, sugar in 5-pound (or 1 kilogram) bags – way too much for one cake. In total the raw ingredients represent the IDS.

The data engineer prepares the ingredients by measuring (filtering) and making things like caramel and ganache (joins) and puts all the ingredients and pre-prepped components ready for baking on the counter. The measured and prepped ingredients represent the ADS.

Lastly, the data scientist combines all ingredients into the baking pan (ABT) and bakes the cake to perfection (the model build).

This analogy also shows how the data engineer straddles the line between data science and traditional IT and data warehousing. Data engineers need to work closely with the warehousing team to make sure the ingredients needed for data science are in stock and of high quality. And, they also need to be skilled in the needs of the data scientist, and familiar with the nuance of baking.  In summary, the data engineer owns the data pipeline described by the ETL, ADS, and Visualization boxes in the diagram, and their work is critical to the success of data science.

Author's note: I would like to give a shout out to John Dimeo, Elder Research Software Architect, for his support on this blog.

thumbnail-mining-your-own-business-ebookDownload our Ebook to learn about key considerations and best practices for leading a data analytics initiative. This eBook includes Chapter 3 of Mining Your Own Business titled “Leading a Data Analytics Initiative” which covers the key challenges and considerations for business leaders employing analytics to provide data-drive insight.


Top 3 Objectives Before Starting an Analytics Project

Why Data Literacy in the C-Suite Matters

Hiring a Data Analytics Consultant

Building a High-Functioning Analytics Team

About the Author

Victor Diloreto Vic Diloreto leads the software engineering group at Elder Research. In this role, Vic is chartered with the continuing support of our data science service to clients where software is needed in data preparations and/or visualizations. Vic is also leading the efforts to convert select portions of Elder Research’s intellectual property library into standalone products. Prior to Elder Research, Vic was a Senior Director of Technology at MegaPath, where he led a team in charge of architecture, design and operations of advanced communication solutions. Prior to MegaPath, Vic was VP of Engineering and CTO of telecommunication start-up sentitO Networks, where he led hardware and software developments. In this capacity, he assisted in raising 60M in venture financing. Vic also had roles within BNR/Northern Telecom and Pulsecom in engineering and management positions – most notably leading the technical aspects of the sale of Pulsecom’s Broadband division to ECI Telecom for 61M.