Blog

Statistical & Cognitive Biases in Data Science: What is Bias?

Will Goodrum

July 21, 2017

BLOG_What-is-bias.jpgThis is the first in a series of short blog posts where we explore common varieties of bias that can beset analytics projects. Bias has serious ramifications for the success of analytics in any organization. Understanding the nature of bias is crucial for understanding the extent of a model’s accuracy. In this first post, we discuss what bias is, why it occurs, and why it matters (a lot).

What is Bias?

Bias has several definitions, and its common usage is decidedly negative. We typically use it to mean systematic favoritism of a group. Generally speaking, “bias” is derived from the ancient Greek word that describes an oblique line (i.e., a deviation from the horizontal). In Data Science, bias is a deviation from expectation in the data. More fundamentally, bias refers to an error in the data. But, the error is often subtle or goes unnoticed. So, why does bias occur in the first place?

Over the next posts in this series, we will briefly define and describe common statistical and cognitive biases, as listed below:what-is-bias.jpg

  • Selection (or sample) Bias
  • Seasonal Bias
  • Linearity Bias
  • Confirmation Bias
  • Recall Bias
  • Survivor Bias
  • Observer Bias
  • Reinforcement Bias

We will also describe why each of these biases poses unique Data Science challenges.

Why does Bias Occur?

Bias occurs because of sampling and estimation. If we could know everything about all the entities in our data (e.g., customers, insurance claims, software sessions), and could store information on all possible entities, our data would have no bias. Additionally, humans are poor intuitive statisticians and their estimations are often inaccurate[1]. These problems are so pernicious they are commonly found even in carefully constructed, controlled statistical experiments.

But, Data Science is not conducted in carefully controlled conditions; it must work with “found data” -- data collected for a purpose other than modeling. That data is very likely to have biases.

Why does Bias Matter?

Predictive models only “see” the world through the data used for training. In fact, they “know” of no other reality.  When those data are biased, model accuracy and fidelity are compromised. Biased models can limit credibility with important stakeholders. At worst, biased models will actively discriminate against certain groups of people. Being aware of these risks allows a Data Scientist to better eliminate bias.  The resulting higher-quality models improves analytics adoption and enhances value from analytics investment.

In the next installment, we will take a brief look at selection bias, and how your data may (or more likely, may not) represent what you think it does.

Request a consultation to speak with an experienced data analytics consultant.


[1] Kahneman, D. Thinking Fast and Slow. Farrar, Straus, and Giroux, New York, NY (2011), pg. 112.


Related

Download the eBook Top 10 Data Mining Mistakes.

Read the blog Avoid Reinforcement Bias When Fishing in the Same Pond.


About the Author

Will Goodrum Data Scientist Will Goodrum has a decade of experience applying numerical analysis and engineering to solve practical problems and generate value for customers. Previously, Dr. Goodrum worked in an engineering software firm, helping medium-to-large scale customers across industrial sectors develop superior products and reduce their time-to-market. As a graduate student, he applied statistical modeling and physics-based simulation to estimate the impact of policy decisions on lifetime maintenance costs for a regional transportation authority. Will holds a B.S. in Mechanical Engineering from the University of Virginia, and a PhD in Engineering from Cambridge University.