Data Engineering is the discipline of designing, building, and maintaining a robust infrastructure for collecting, transforming, storing, and serving data for use in machine learning, analytic reporting, and decision management. Data Engineering enables efficient and operationalized Data Science.
A Key Component to Successful Data Science
Data that is not properly organized and processed leads to long delays and failed analytics projects. Our disciplined multi-phase data engineering process organizes and prepares your data to provide the highest potential for success from data science initiatives.
Phases of Data Engineering
The phases of data engineering are represented by a data pipeline. Our team supports you during each step in the process, from IDS Assessment to Data Visualization and Reporting.
We refer to source data stores as Immutable Data Stores (IDS). From both a process and a technology standpoint, source data is usually not changed, or mutated, by the data engineer. Our data engineers ensure your IDS is developed appropriately. Non-traditional, non-relational “append only” data stores are often used as the size of the data you collect and manage grows. These technologies, often referred to as NoSQL, are optimized to efficiently read and write large volumes of data, but sacrifice performance on, or in some cases don’t allow, updates and deletes.
Extract, Load, and Transform (ETL) is a process to extract source data from the IDS, transform it for use in analytics, and then load it to the Analytic Data Store (ADS). Data that is too big for traditional data stores is usually too big for most modeling and analytic algorithms. Our data engineers use tools and techniques such as down-sampling or aggregation to transform the IDS into a new mutable data store used for analytics.
Once the IDS is transformed, the resultant mutable data is loaded to the Analytic Data Store. This secondary store facilitates data fusion (data that has been enriched by other data sources) and rendering data for visualization and reporting. The ADS holds the Analytic Base Table (ABT) used to build the model and represents the dividing line between data engineering and data science. The ABT drives the modelling process and is owned by the data scientist.
Modeling continues to transform the data through a process called scoring. Examples include generating predictions, automating a business process, or “online learning” where subject matter experts make corrections to the training data to improve the next model iteration. The modeling/scoring results are stored in the ADS for use in the visualization and reporting phase.
When additional data is required from the IDS to support the modelling process, the data scientist asks the data engineer to enhance the ETL process to make the new data available in the ADS.
Visualization and Reporting
Data engineers use the results stored in the ADS to build focused reports and visuals that enable end users to make informed business decisions. We work with your team to make results accessible and understandable to stakeholders so they can take action and inform decisions. This can range from spreadsheet or email delivery to enterprise-class visualization tools. We have experience delivering results using our own proprietary tools (such as RADR) and most off-the-shelf visualization tools and libraries.Learn More About Data Visualization