As a data science consultancy, we frequently run into difficult data infrastructure challenges at our clients across multiple industries. To solve a business problem or get decision-making insights from data, we often must start by helping to clean up and organize the data architecture so we can build data science and machine learning (ML) models. This process of getting the data ready for the application of the science is called data engineering.
Data engineering is critical to being able to deploy a data analytics system that is robust and reliable in solving a real business problem. Creating the right data architecture facilitates transforming data into actionable insights to improve the decision-making process and create a competitive advantage.
The first step is to work with peers in IT or systems roles to identify the needed immutable data store(s) (IDS) to act as the root or source information for the analytics to reference (Figure 1).
The IDS can be thought of as a read-only data source as the analytics will rely on this information and not change it during any part of the analytics process. Immutable Data Stores capture source or raw data, usually from external data providers (or at least, external to the analytics team) and are characterized by concerns such as provenance (the origin of the data), governance, and security.
The next phase, owned by the data engineer working in concert with the data scientist, is building the extraction and transforms to make the data operable for analytics. This Extract Transform and Load (ETL) phase is denoted by the yellow gears in Figure 1. An example transformation is down-sampling – where one removes a high proportion of common cases from the source data to better balance the proportion of rare classes represented. Part of this process is also called data wrangling, and it is often said to cost data science professionals as much as 80% of their time, leaving only 20% for exploration and modeling.
The relationship of the data engineer to the data scientist is the same as that of the software developer to a user or feature owner in an agile process. It is important to have an accurate user perspective when developing the appropriate ETL in order to make sure we address the right needs. A data engineer acting alone may make ETL decisions that are not required by the data science. That’s why at Elder Research we use an Agile Data Science process.
The output from the ETL forms a new data store called the analytic data store (ADS). This is a read/write store that the data scientist begins to work with to refine what data to use to build models (machine learning, text mining, etc.). The actual subset of the ADS used as model input is called the analytic base table (ABT). For the majority of use cases there are benefits to storing the data in an ADS. Very few applications are truly real-time; a single ETL process that runs nightly is enough for data that is updated from source daily, monthly or quarterly. Additionally, when you can design an ETL process for the data independently from its end use, you can optimize for the performance of the overall pipeline and cache the results for many uses. Multiple models can be trained from the same ABT. There are two results from this process that require oversight by the data engineer:
- During modeling, any data that does not exist efficiently in the ADS must be re-worked into the ETL by the data engineer to be recycled into the ADS in its most efficient form.
- Results from the modeling process must find their way back into the ADS – denoted by the Scoring box and arrow in the diagram.
The data engineer works with the data scientist and the business owners to derive the best way to visualize the results of the model. The visualization phase can use multiple tools – spanning the spectrum from basic spreadsheets to dedicated software packages – to drive charts, geospatial heat maps, bar graphs, and much more.
Now that we have described the data engineering process dryly, let’s use a baking analogy to understand things a little better. “Baking a cake with data” might also clarify the differences between the IDS, ADS, and ABT.
The Enterprise Data Warehouse, or EDW, team helps source and collect the ingredients like eggs, butter, sugar, and cocoa powder. Eggs come in a dozen, sugar in 5-pound (or 1 kilogram) bags – way too much for one cake. In total the raw ingredients represent the IDS.
The data engineer prepares the ingredients by measuring (filtering) and making things like caramel and ganache (joins) and puts all the ingredients and pre-prepped components ready for baking on the counter. The measured and prepped ingredients represent the ADS.
Lastly, the data scientist combines all ingredients into the baking pan (ABT) and bakes the cake to perfection (the model build).
This analogy also shows how the data engineer straddles the line between data science and traditional IT and data warehousing. Data engineers need to work closely with the warehousing team to make sure the ingredients needed for data science are in stock and of high quality. And, they also need to be skilled in the needs of the data scientist, and familiar with the nuance of baking. In summary, the data engineer owns the data pipeline described by the ETL, ADS, and Visualization boxes in the diagram, and their work is critical to the success of data science.
Author’s note: I would like to give a shout out to John Dimeo, Elder Research Software Architect, for his support on this blog.