This is the first in a series of short blog posts where we explore common varieties of bias that can beset analytics projects. Bias has serious ramifications for the success of analytics in any organization. Understanding the nature of bias is crucial for understanding the extent of a model’s accuracy. In this first post, we discuss what bias is, why it occurs, and why it matters (a lot).
What is Bias?
Bias has several definitions, and its common usage is decidedly negative. We typically use it to mean systematic favoritism of a group. Generally speaking, “bias” is derived from the ancient Greek word that describes an oblique line (i.e., a deviation from the horizontal). In Data Science, bias is a deviation from expectation in the data. More fundamentally, bias refers to an error in the data. But, the error is often subtle or goes unnoticed. So, why does bias occur in the first place?
Over the next posts in this series, we will briefly define and describe common statistical and cognitive biases, as listed below:
- Selection (or sample) Bias
- Seasonal Bias
- Linearity Bias
- Confirmation Bias
- Recall Bias
- Survivor Bias
- Observer Bias
- Reinforcement Bias
We will also describe why each of these biases poses unique Data Science challenges.
Why does Bias Occur?
Bias occurs because of sampling and estimation. If we could know everything about all the entities in our data (e.g., customers, insurance claims, software sessions), and could store information on all possible entities, our data would have no bias. Additionally, humans are poor intuitive statisticians and their estimations are often inaccurate. These problems are so pernicious they are commonly found even in carefully constructed, controlled statistical experiments.
But, Data Science is not conducted in carefully controlled conditions; it must work with “found data” -- data collected for a purpose other than modeling. That data is very likely to have biases.
Why does Bias Matter?
Predictive models only “see” the world through the data used for training. In fact, they “know” of no other reality. When those data are biased, model accuracy and fidelity are compromised. Biased models can limit credibility with important stakeholders. At worst, biased models will actively discriminate against certain groups of people. Being aware of these risks allows a Data Scientist to better eliminate bias. The resulting higher-quality models improves analytics adoption and enhances value from analytics investment.
In the next installment, we will take a brief look at selection bias, and how your data may (or more likely, may not) represent what you think it does.
Request a consultation to speak with an experienced data analytics consultant.
 Kahneman, D. Thinking Fast and Slow. Farrar, Straus, and Giroux, New York, NY (2011), pg. 112.
Download the eBook Top 10 Data Mining Mistakes.
Read the blog Avoid Reinforcement Bias When Fishing in the Same Pond.