The Problem with Random Stratified Partitioning

When an organization invests in data science, they need to have confidence that the predictive models will be robust; that is, actually work when applied to future cases. However, this critical requirement is too often poorly addressed. Schoolbook answers are partly to blame. Consider this innocuous quote from Investopedia:

“Researchers use stratified random sampling to obtain a sample population that best represents the entire population being studied. Stratified random sampling involves first dividing a population into subpopulations and then applying random sampling methods to each subpopulation to form a test group.”

Surely, we would want stratified random training and test samples (partitions) to represent well the population for our predictive models. This compelling thought is referenced and repeated in text books and by the producers of machine learning tools (commercial and open-sourced). But, this partitioning method is woefully inadequate in virtually all real-world settings for predictive model builds, simply because the real-world future cases will not be “stratified and balanced” like the historical randomized partitions used to train the model. The “future cases” are the “population” that our test partitions need to represent, and we in fact have no such partition!

Let’s step back and reconsider…

Why Test at All?

We test something to make sure it works, before we deploy it in to a production environment. For example, students are tested to ensure they are adequately prepared to pass to the next grade. A chef tests a dish before serving it to guests to make sure it will meet expectations. A car is tested at the end of the assembly line to add confidence it is ready for sale. A new predictive machine learning algorithm is tested to verify that it will make correct inferences about the relationship of the inputs to the future outputs. A predictive model is tested during model build to ensure it will work consistently on future cases.

Testing in Machine Learning (ML)

There are two distinctly different testing cases for models produced by machine learning algorithms:

General Use Case: Testing of a novel new algorithm developed to enhance the arsenal of machine learning tools available to the data science community
Specific Use Case: Testing of a new model for deployment in an organization to solve a business problem

The second is downstream from the first. The purpose of the first is to make sure the algorithm infers its models in a consistent way and is usually performed by the original developers of the technique. The purpose of the second is to make sure the implemented model will deliver future estimates, classifications, or recommendations in practice as well as we claim it will. Since these purposes are different, the tests required are also different.

Developing Novel New ML Algorithms

If one aims to develop a new machine learning algorithm, such as a new type of neural network, one tests it to ensure that it will produce essentially equivalent models on similar but different data partitions that vary only by random chance. If it doesn’t, then it is not suitable for general use. This case requires that the test partition must be equivalent to, but separate from, the training partition:

It should be a stratified random partition from the same population as the training data[1]
Test partition observations do not appear in the training partition.

In order to trust the algorithm, developers must confirm that each time they re-randomize the partitions they end up with statistically equivalent models. Commonly-used machine learning algorithms have all passed this critical test, at least when given a sufficiently large number of samples. As builders of practical models we also use stratified random partitions to measure overfit when we have noisy or correlated features or small sample sizes, but this does not assure us of the models’ performance in future time periods, products, or markets.

Building a Practical Model

When developing a model for deployment in an organization to improve decision making, we must test the accuracy of results on data that the model has not seen before — on data that will “surprise” the model as harshly as the future will. It is impossible to create a stratified random test partition from the population to which the model will be applied, because it does not yet exist. Stratified random partitions from past populations will not be equivalent to the populations the future will produce. We are asking the model to extrapolate to a future population—which is inherently dangerous.

So, what can be done if using the insight from a stratified random partitioning technique (sampling by historical ratios) is not useful to assure future applicability? We can simulate it from different snapshots of historical data. Simulate the passage of time over history rather than a homogenous set of random partitions of the observations and choose the model that has the most stable performance metrics over the known time series. When considering several Machine Learning algorithms or alternative sets of input features, we can rank them based on performance at each time period and select the model that consistently outranks other models from one time period to the next. In essence, we are testing how well the model will perform over the passage of unforgiving time, which is exactly what we need to do!

More Anticipated Extrapolations Become Reasonable

This concept of testing by distinct consecutive time periods is powerful, but it is not the only useful data partitioning strategy that provides practical value for model deployment. Consider some use cases regarding the model’s ability to extrapolate to new types of observations that are different from historical ones:

If an organization is planning to expand to a new geographical region, it can be helpful to measure model performance on each particular historical geographical region when that region is excluded from the training data. The same logic applies when we anticipate expansion to new market segments.
If an organization is planning to develop a new product, is it valuable to estimate how well the model will extrapolate to new products. Partitioning and testing by current products enables early insight on how the model will perform on future products.
Our measurable objective is to test the model’s ability to extrapolate to difficulties we expect the future to bring. Random stratified partitioning does nothing to help us with this, but partitioning by the relevant segment does!

Impact on Model Applicability

Popular machine learning algorithms have already been tested by their creators (and the wider community, over time) to perform inferential learning consistently. However, our foremost objective in applied data science is to determine if the model for our specific use case will extrapolate to future cases.

Why don’t some models extend to future cases if they work on past cases? The short answer is that the real world can be a very messy place, with relationships and entities changing seemingly spontaneously over many dimensions, especially time and space. In real world problems, we never have the complete set of causal factors for the outcome we wish to predict. Instead, we have a plethora of proxy features (features that are only statistically correlated to the true causal factors), and those only for a subset of the true causal factors. The relationships of most proxy features to the actual causal factors can change chaotically. Testing by explicitly different time periods of the data helps ferret out the factors whose relationships to the target outcome are unstable. If we develop features that work well in every time period and geography, the model is more tuned to causal factors because measures of each proxy’s importance will diminish over time and place while causal factors remain consistently important. So, a model developed and tested in this way will be more reliable than one tested only on homogenous random partitions.

Practical Application

Note that most of what is commonly called model build is in fact model specification, namely selecting:

The best model inputs
The best transformations of those inputs and the dependent variable
The best model type, including the optimum hyper-parameters (model tuning parameters).

This is what model testing is all about: finding the optimum model specification. If we can demonstrate that the model specification consistently produces a well-performing model regardless of which natural partition of data it is trained on, then upon completion, we can confidently fit it to all the data, including the test data.

Cross-validation is an appropriate scheme for testing practical models. Cross-validation should be configured to recursively work through the folds (partitions) of the data, building and testing for each fold separately. The first set of folds should be distinctly different time periods, each anchored by an “as of” date, where:

the inputs are what data was available on that date
the target outcome is what happened after that “as-of” date.

This exercise will demonstrate how the model will perform on (extrapolate to) new time periods. Likewise, if we hope to extrapolate to new products or markets, we can partition by these criteria to prepare to meet this objective.

As a Danish politician once inadvertently quipped, “It is difficult to predict, especially the future”. Testing on random (stratified or not) partitions of the data cannot give you confidence that the model will work on future data. But, testing on partitions that post-date the training partitions can.

The Problem with Random Stratified Partitioning

Author:

Date Published:

Why Test at All?

Testing in Machine Learning (ML)

Developing Novel New ML Algorithms

Building a Practical Model

More Anticipated Extrapolations Become Reasonable

Impact on Model Applicability

Practical Application

Want to Learn More?

The Problem with Random Stratified Partitioning

Author:

Date Published:

Why Test at All?

Testing in Machine Learning (ML)

Developing Novel New ML Algorithms

Building a Practical Model

More Anticipated Extrapolations Become Reasonable

Impact on Model Applicability

Practical Application

Want to Learn More?

Author

Mike Thurber

Principal Scientist