Machine Learning Engineers & Operations: Where DevOps Meet Data Science

As the field of Data Science develops and matures, new branches emerge, and traditional roles are redefined and broken into new ones.
Author:

Raymond Eid

Date Published:
October 13, 2022

The role of Data Engineer broke off from Data Scientist as a distinct discipline, and now we are witnessing the emergence (and hype) of another new role and discipline:  Machine Learning Operations (ML Ops) Engineer (or ML Engineer).

This special breed of engineers connects business analytics to action. I would argue that ML Engineers will soon the be the most critical technical role in your organization.

To explain, note that we face three critical realities:

Organizations are drowning in data

Tools to manage and analyze data are emerging faster than ever

ML Engineers who support ML Operations are the people that are responsible for resolving #1 and #2

Drowning in Data

Data -> Information -> Intelligence.

The “->” is not equality; there is tremendous compression at each stage.  I have noticed tables never get smaller. New information is continually gathered with the possibility of a new insight. In practice, this is overwhelming for most organizations. We now measure data in zettabytes. (A what?) That’s a trillion gigabytes. Seagate expects there to be 100ZB of data by 2023.

The Right Tools

Not long ago most companies had a separate “database person”, “IT person” and “reports person.”  In the last 10 years, as analytics demand has increased, the number of tools to manage and analyze data have increased by over an order of magnitude. Further, this pace is increasing. If you’re not asking for help yet, note that migrations to cloud-based storage and services add an additional layer of complication including privacy and security.

If your organization wishes to bring data from a public website to its internal databases, you will need at least a handful of tools to automate the process in an efficient, repeatable way.

To obtain csv links from the website (without an API),

a web scraping tool can automate the process of clicking and downloading the public data.
step image

Once the raw data is in hand,

data cleaning and wrangling needs to be done in your coding language of choice.
step image

After data preparation,

your next challenge is to load this data into the database tables using SQL.
step image

Then, you’re all set to start making visualizations and comprehensive dashboards,

if your group has a business license for a particular dashboard software; otherwise, you may experience choice paralysis with all the modern visualization tools at your disposal.
step image

MLOps is at the crossroads of multiple disciplines.

You will want to refresh this data at a certain cadence, say every week, to keep downstream dashboards and models up to date. Again, there are many potential tools to choose from, such as the popular Databricks and Snowflake. Databricks enables you to create workflows that package the associated scripts into a concise, repeatable job. Snowflake allows you to run scheduled tasks, known as Cron Jobs.

ML Engineers are charged with staying on top of this rapidly-evolving stack of technology.

ML Engineers

ML-forward organizations seek to deploy their predictive models for a continuous cycle of data ingestion, training, and deployment, instead of relying on a static model that may become stale over time. From this need has emerged the ML Engineer, an individual with Data Science chops combined with technical expertise in DevOps – that is, the deployment, maintenance, and monitoring of a product.

They allow Data Scientists to focus on their expertise — developing a robust model for inference — while the ML Engineer focuses on continuously deploying and monitoring those predictive models.

The many roles of an AI Engineer.  Source: Statistics.com

 

The ML Engineer essentially keeps modeling assets in continuous deployment — monitoring and retraining them — much as a DevOps Engineer supports software systems and products. The Machine Learning Engineer adheres to the best practices of the more established DevOps Engineer field, including the Continuous Integration and Continuous Delivery (CI/CD) principles.

DevOps is a portmanteau of software development (Dev) and IT operations (Ops). DevOps employs the cultural philosophies, practices, and tools that increase an organization’s ability to deliver applications and services at high velocity (see AWS). DevOps Engineers create, integrate, and deploy software systems, as well as managing code releases.

ML Ops in Practice

Data Science consultants build models to address a client’s specific business need, such as a neural network to detect fraudulent applications for government aid. But the data relationships captured may only represent a snapshot in time, and the quality of predictions may degrade over time as the data changes. This could be due to changes in the underlying population generating the input data, new categories emerging in existing data, or macro-economic or regulatory changes, etc. To ensure that the most accurate version of a model is in deployment, the ML Engineer employs a variety of strategies to avoid obstacles, such as data drift.

The challenge of Drift highlights the advantages that the ML Ops framework provides via continuous model retraining and deployment. Drift is a change from the baseline and occurs when a model degrades in quality over time, usually due to a divergence between a model’s training and serving data.

Let’s look at an example for a model trained to predict fraud.  And, to dramatize drift, we’ll assume it’s a new (hard) problem, where very little labeled data is available, so the model will need monitoring and maintenance more than most.  The figure below depicts the quality of this deployed model over time, where the x-axis is time and the y-axis is the F-1 score (one measure of model prediction accuracy that equally balances the two types of errors – false alarms, and false dismissals).

Fraud model.  Source: https://ml-ops.org/content/mlops-principles#monitoring

The model is trained with early known fraud cases (and vastly more common non-fraud) cases, and is initially quite precise out of sample on new data. But its performance decayed rapidly, due to many possible causes — such as data drift, fraudsters coming up with new ways to trick the system, or transactions arriving from companies previously unseen by the model triggering security errors.

The ML Operations framework anticipates drift as a fact of life and prepares for it; engineers set a performance threshold at τ (the green line in the graph), and if the score dips below the line, the model retraining process is triggered, using updated data which includes transactions learned about during the months in which the earlier model was deployed.

Each organization has its unique criteria for setting this performance threshold. Redeploying models can be resource-intensive and divert the Data Science team’s time from other worthwhile tasks. Yet this fraud model could be at the centerpiece of a project and an obsolete model means an obsolete solution to the problem the project is trying to address. What’s at stake? Depending on your task, it could be protecting private citizens’ assets by preventing bank fraud or recovering several millions (or more) of government money by identifying individuals who defrauded pandemic-related loan applications.

The ML Ops pipeline incorporates additional data and model steps into the loop.

The ML Ops framework treats a Machine Learning model as a first-class citizen worthy of constant testing, monitoring, and periodic re-deployment. The framework allows ML-forward organizations to rely on their Data Science products to conduct business by ensuring accurate results on data arriving at a fast pace and in large volumes.

Anti-fraud modeling is a widely adopted area of use, but ML Ops can address a vast set of business challenges, including unemployment insurance, predictive maintenance, public health monitoring, sales forecasting, and e-commerce attribution.

Conclusion/Thoughts

For a wide range of organizations, from Government Agencies to Startups, integrating Data Science products into the Machine Learning Operations framework will save time and resources.

Thanks to multiple Cloud environments supporting Data Science assets, less time is spent manually uploading models, repartitioning new data into training and testing subsets, or creating data pipelines. The sustained rhythm of monitoring, retraining, and deployment that ML Ops has adopted from the DevOps philosophy allows team members to focus on their specialization, be it creating the ML model, maintaining the data pipeline, or monitoring deployed model metrics – and seamlessly handle the high-velocity data arriving.

To Learn More

To meet increasing demand for ML Ops expertise, Elder Research has developed an ML Ops training course; courses are available on edX — available at this link. This series is offered on three platforms: AWS, Azure, and Google Cloud.