In this series of short blog posts, we explore common biases that beset analytics projects. Bias can seriously impair the success of analytics in an organization, so understanding what to watch for is crucial. In this second post, we discuss a manifestation of one of the most prevalent and significant kinds of statistical biases: selection bias. We describe what it is, how it pervasive it may be, some specific examples of how it may manifest, and how to mitigate it.
What is Selection Bias and Why Does it Occur?
Selection (or “sampling”) bias occurs in an “active,” sense when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see. That is, active selection bias occurs when a subset of the data are systematically (i.e., non-randomly) excluded from analysis.
Data Scientists focus great attention on fashioning samples of data that relate to a problem of interest. They then create models developed from these samples of “found” (or observational) data to make inferences and decisions about larger populations. Examples of these samples include customer segments, types of insurance claims, or selected sensor readings from IoT connected devices. Almost always, one does not have the luxury of designing the experiment; that is, of deciding where the data samples will be located beforehand, as one does in a classical statistical experiment, so there is no guarantee that the space is covered evenly, or even that parts are covered at all.
Why is Selection Bias Important?
To the modeling algorithm, the samples of data that it is shown are all that it knows about the world. Crafting subsets of data is necessary and fundamental to the practice of Data Science, since sampling will help a predictive modeling algorithm to “learn” the patterns associated with outcomes of interest (e.g., fraudulent claims, anomalous sensor readings).
Outside this active scoping, there is a much larger, “passive” variety of selection bias that undergirds many common cognitive and statistical biases. In brief, there is a limit to what can be known about a population, and recorded in a database. However, discussing that “passive” selection bias is outside the bounds of this brief introduction. (We even have to make sampling decisions for blog posts!)
An innocent example of active selection bias occurs when the population of interest changes with time. So, the initial sample that was carefully crafted no longer represents the broader population. This is why, for example, the US Government conducts a census at regular intervals, to provide government agencies with vital demographic information about the population at a given point in time. But that information becomes stale, as do the economic models built upon it. Continuing to use the out-of-date sample actively introduces bias into the data.
At other times, samples may be selected such that they knowingly (or unknowingly) exclude or discriminate against categories of people, as has been reported extensively in popular press regarding social (and other) algorithmic biases. Ultimately, if the samples used by Data Scientists are biased, then their conclusions will be inaccurate (and possibly even harmful).
Some Practical Examples of Active Selection Bias
As noted above, data sampling is fundamental to the practice of Data Science, potentially leading to active selection bias for avoidable reasons. How can you know if you have encountered active selection bias in your analytics work? Here are some examples that we have seen at Elder Research with real clients:
- Sampling from the Top: while investigating the first and last few rows of a dataset can be useful for exploratory analyses, this relies on the assumption that the sample of data from the top of the file is distributed the same as in the entire dataset. Inferences drawn from the top of an ordered dataset can lead to false conclusions, due to selection bias introduced by the ordering. For example, SAS Enterprise Miner will calculate sample statistics and create exploratory charts on only the first 20,000 rows of data in a dataset. If the dataset is ordered and much larger than 20,000 rows, then the inferences drawn from exploring the first 20,000 rows will not be representative of the population. This can lead to inaccurate conclusions that may have impacts that are not evident until model building has already begun. This truncation to the top “N” rows of a dataset is not unique to Enterprise Miner; similar examples can be found in exploratory solutions that are implemented in SPSS Modeler and even R.
- Sampling a Single Element of a Class: our founder, John Elder, tells the story of a direct-marketing contact who had a dataset of charitable giving that contained few examples of donors (about 1% of the total) in over a million rows. Under the circumstances, the client leveraged all the donor examples, and sampled every 10th non-responder until they had created a tractable model building dataset with 100,000 rows. This downsampling is common practice in Data Science, and advisable, since it helps the algorithm to identify patterns related to the rare outcomes of interest. However, the large dataset was ordered by ZIP code, so the 100,000th non-donors was reached before it sampled from the final state. Since the rare outcomes of interest came from across the entire country, the decision-tree model quickly picked up on the fact that entities were likely to be good if they came from a certain Arctic state!
- Sampling Downstream of Existing Processes: a recent client hired us to build a predictive model to enhance their fraud investigations. Currently, the client uses an extensive, expert-driven rules system to identify fraudulent transactions. Our client wanted to use our model as a second opinion to their rules engine, to help reduce the number of valid customers that were falsely classified as fraudulent. The data we received for modeling included transactions labeled as both Good and Bad. However, the results of the rules engine introduced selection bias into our modeling dataset. Transactions that exceeded the rejection threshold in the rules engine were only ever labeled as fraudulent. Even if some of these failures were “false alarms,” the data could not show it, and the model could only ever learn them as “Bad.”
- Sampling in an “Opt-In/Out” Environment: Elder Research was hired by a major software development company to analyze log files and identify usage patterns. Our customer hoped to understand workflows that may enhance their customer’s efficiency, or sequences that may have led to software crashes. Users either opted-in to the program to always provide feedback logs on their sessions, or only provided data when their software crashed. We intentionally decided to sample data only from the opted-in user base, and excluded data from crash-only users. Although useful, this decision knowingly introduces sampling bias. Inferences for the opted-in subset of users are technically only applicable to users who willingly opt-in for feedback.
How can Selection Bias be Mitigated?
There are strategies that data scientists and business stakeholders can adopt to mitigate the effects of selection bias:
- Sampled data should closely represent the full population of interest, or at least represent the cases to which it will ultimately be applied. We recommend a stratified sample, where each important input category in the data is sampled separately and then those subsets are joined according to their appropriate proportions.
- If the prediction problem involves classification of an outcome (e.g., fraudulent or not fraudulent), then any data sample used for validating the model should have approximately the same balance of outcome classes as the population.
- Once the data sample is created, the sampling strategy should be documented, and any limitations of the strategy should be stated. This documentation will highlight the potential for selection bias once the model is built and deployed.
- Clearly articulate the business question the model will answer and secure agreement about this question from all stakeholders. This prevents misappropriation of model results to answer other questions that may be related, but are out of scope due to selection bias.
- Predictive model results should be actively monitored after deployment. Decreasing performance over time may indicate changes in the underlying data that will require model retraining.
In this post, we presented an overview of active selection/sampling bias that may result in the regular practice of Data Science. Creating samples is a necessary part of building predictive models, but modelers must use caution to construct those subsets without bias. These samples must represent the population as a whole, if inferences drawn on the sample are to be properly applied to cases outside of the sample. Through careful and intentional crafting of data subsets for model building, and communication about the limitations of these subsets, Data Scientists can mitigate the effects of this common form of bias.
In the next blog post in this series, we will look at Linearity Bias; a common cognitive bias that can lead to misunderstanding and costly errors in estimation
Download the eBook Top 10 Data Mining Mistakes to learn about other mistakes to avoid in your analytics projects.
Read part one of this blog series Statistical & Cognitive Biases in Data Science: What is Bias?
Read the blog Avoid Reinforcement Bias When Fishing in the Same Pond.