It is one thing to develop an analytics solution that is a technical success, but quite another to ensure that solution becomes a business success. It is a challenge to make an analytic model work in a production environment, as it requires teamwork from Information Technology, Data Management, Analytics, and Business units.
The goals of the process are threefold:
- Build models with repeatable, reliable results that do not depend on any single person or working environment to operate.
- Make model results available to end-users in a timely and useable manner.
- Monitor model performance on an ongoing basis to ensure quality and alert analysts to any degradation over time.
There are different strategies for achieving these goals. In this article, we briefly touch on the first of two required components – automating the model scoring process through model management and monitoring.
The ultimate solution depends on the organization’s goals and available resources.
What is Model Management and Monitoring?
Analysts need to create repeatable model runs in a timely manner. This is the most important requirement when operationalizing a model as strategies for visualization and deployment depend on a strong foundation of model management. By monitoring models, managers know when to retrain a model being used in production.
Model Management involves:
- Versioning – maintaining an approved model across changes over time.
- Scheduling– running the model on an ongoing basis to obtain timely results.
Model Monitoring involves observing and reporting:
- Model Run Times – Is the model getting bogged down on large data?
- Model Performance – Is it consistent with the scores we observed during training? Changes in performance could indicate a shift in the underlying data population.
What are Some Approaches for Model Management and Monitoring?
Baseline: Managing Models Manually
At minimum, an analyst runs a script manually each time results need to be updated. Though simple, this approach suffers when the demand for the model increases, placing undue responsibility on one person. This baseline approach can work however, if the time between model runs is large (quarterly or annually) and the model code (including scripts, streams, etc.) is readily available for multiple people to use and run as needed.
Using Commercial Tools
Each of the Gartner Leaders in the Magic Quadrant for Analytics (SAS, IBM, KNIME, and Rapid Miner) have a package for deploying and operationalizing models. SAS Model Manager is the most well-developed of the group, and has capabilities for model management (versioning, scheduling, etc.) as well as model monitoring (champion vs. challenger, etc.) The other tools are strong on the model management side, but have fewer capabilities for model monitoring. If an organization has significant investment in a commercial analytics platform, using that tool to push models into operation often provides the best value for the effort.
In recent years, open-source analytics platforms have been increasing in popularity. This demand is driven by their greatly lower prices and somewhat greater dynamic functionality. Many of these packages integrate with an open standard for models called Predictive Markup Modeling Language, or PMML. Zementis Adapa, for example, is a low-cost software solution for operationalizing models stored in the PMML format. One major caveat is that many commercial vendors do not fully support the PMML standard, and those that do often inject proprietary features in non-standard fields.
Custom Software Solutions
The most common solution (after manual model management) for operationalizing models is to employ custom software. By definition, these solutions can take on many different forms depending on the situation. We highlight three different strategies we have seen in multiple organizations.
Using Commercial Tools and Windows Scheduler
This solution could be dubbed “Model Management Lite” as it replaces the commercial model management tools with a collection of system scripts and tools. One common solution is to use Windows Scheduler to run a Windows batch process on a regular basis. This approach has many of the strengths of a commercial system due its reliance on a commercial platform but lacks the integration and ease of use that a commercial tool provides. However, it is low in cost and can be implemented within almost every IT environment.
Database Stored Procedures
The most efficient way of automatically scoring data is to do in-database scoring as the data is updated. This can be achieved in a database through the use of a stored procedure or materialized view. Every database system, whether a traditional relational database (RDBMS) or a “Big Data” NoSQL solution, has API hooks for integrating custom scoring into a database. This makes sense to use for organizations with high-volume data and a mature IT capability.
Custom Software Application
Custom scoring software can integrate model operationalization into a middleware software component (or other automated scoring mechanism) as a standalone tool or as part of a larger system. The specifics of this approach will depend on the goals of the system and the available IT environment.