Data Engineering is the discipline of designing, building, and maintaining a robust infrastructure for collecting, transforming, storing, and serving data for the purposes of machine learning, analytic reporting, and/or decision management. Data Engineering is the enabler for efficient and operationalized Data Science.
A Key Component to Successful Data Science
Poor data organization and preparation for Data Science leads to long delays in, and possibly failed, analytics projects. Elder Research employs a disciplined multi-phase data engineering process to properly organize and prepare the data to provide the highest potential for success from the Data Science.
Phases of Data Engineering
The phases of data engineering are represented by a data pipeline.
We refer to source data stores as Immutable Data Stores (IDS). From both a process and a technology standpoint, source data is usually not changed, or mutated, by the data engineer. Elder Research data engineers ensure the client’s IDS is developed appropriately. As organizations continue to grow in the size of data they collect and manage, non-traditional, non-relational “append only” data stores are often used. These technologies are often referred to as “NoSQL” despite many supporting SQL-style queries. These data stores are optimized to efficiently read and write large volumes of data, but sacrifice performance on, or in some cases don’t allow, updates and deletes.
Extract, Load, and Transform (ETL) is a process to extract source data from the IDS, transform it, and then load it into the Analytic Data Store (ADS). If the data is too big for traditional data stores, it’s usually too big for most modeling and analytic algorithms. Elder Research data engineers use various tools and techniques such as down-sampling or aggregation to transform the IDS into a new mutable data store used for analytical analysis.
Once the IDS is transformed, the resultant mutable data is loaded to a secondary store called the Analytic Data Store. This secondary store facilitates data fusion (data that has been enriched by other data sources) and rendering data for visualization and reporting. The ADS holds the Analytic Base Table (ABT) used to build the model and represents the dividing line between data engineering and data science. The ABT drives the modelling process and is owned by the data scientist.
The modeling process continues to transform the data through a process called scoring. Examples include generating predictions, automating a business process, or through “online learning” where subject matter experts make corrections to the training data to improve the next model iteration. The Modeling/Scoring results are sent to the ADS to store them for use in the visualization and reporting phase.
Another feedback loop exists between the modelling process managed by the data scientist and the data pipeline managed by the data engineer. When additional data is required from the IDS to support the modelling process, the data scientist provides feedback to the data engineer to enhance the ETL process order to make the new data available in the ADS.
Visualization and Reporting
The data engineer uses the data and model results stored in the ADS to build focused reports and visuals to enable end users to make informed business decisions. Elder Research works with clients to make results accessible and understandable to stakeholders so they can take action and make more informed decisions. This can range from spreadsheet/email delivery to enterprise-class visualization tools where the results from the ADS are delivered to the client. We have experience delivering results in most visualization tools. Examples include:
- RADR –Elder Research’s own visualization tool