I have identified five primary reasons why analytical models fail:
- Poor Organizational Support
- Missing Causes
- Model Overfit
- Data Problems
- False Beliefs
In this post, we will consider how and why missing causes in the data for training a model may result in incorrect inferences or failures.
Correlation, Not Causation
The data sets that we use to build predictive analytics (machine learning) models contain information that relates to outcomes or events of interest. What they may not contain is the full set of causes of those events. If the models are used strictly for prediction, and not to explain why or to find a way to influence the outcome, the models may be quite predictive. Otherwise, using the model will likely not provide the desired result. For example, a machine learning model can predict heart failure rates in individuals based on their physician visits related to cardiac care. If we try to reduce heart failure by removing cardiac care, our model will predict lower heart failure rates but will be badly wrong!
Consider some more realistic examples. If we are looking to:
- Stop illegal activity (e.g., fraud)
- Encourage desirable behavior (e.g., purchases, retention)
- Identify rare events of interest (e.g., diagnose disease or machine failures)
then we are hoping to find the motivations, factors, or events that cause these events to occur. Causes are the levers that, if pulled (modified), will influence the outcome. If we only have correlated information, we may vainly make changes that will miss the desired impact and often even have an adverse impact.
When causes are missing from our training data sets, the risk cuts in two directions:
- Incorrect Inferences: We assume that we have all the data we need and learn incorrect things with the model.
- Failure to Launch: We assume that because we know we are missing causal information, we abandon the initiative, often when the model would still be accurate enough to produce substantial value.
The Four Quadrants of Missing Information
Information relevant to fitting a model can be grouped four ways, as shown in the figure below.
As this chart demonstrates, missingness may be assessed in terms of the availability of the information and our awareness of our need for it. The two major ways information may be missing are by input (feature) or by row (observed outcome). If observations are missing or are not representative of the target population, we refer to this as selection bias. The focus here, however, is on completeness of the inputs.
- Known Knowns — Retailers know that promotions and on-shelf availability impact sales. Engineers know that mechanical vibration and extreme heat cycles cause equipment failures. Epidemiologists know that cold viruses spread in confined spaces more than outdoors. When building models, it is important to ensure these known factors are included in the training data. Known direct causal factors should always be included in the model. Because some causal factors may not be available, proxy correlated features may be used if they are recognized as not being causal.
- Known Unknowns — There is always information that we would like to have in our data set, but that we can never capture. For example, we may infer information about upcoming major life events (e.g., marriage, birth, death), but we typically will not know them or their timing explicitly. Additional examples include religious and cultural commitments (e.g., upcoming Hajj) which generally must be self-reported by participants. Without these “known unknowns,” the model errors increase.
- Unknown Knowns — This is the information we can find if we are insightful enough to look for it. For instance, internally available causal factors such as policy adjustments, operational priority changes, or promotions, are regrettably often not employed, because they are messy, dispersed, and hard to integrate. If the hard work of feature gathering and engineering is done, it can be included in our training data set to good effect. Also, people often assume they do not have the data for something when in fact it is available. For example, on a recent engagement with a large, non-profit health foundation, we discovered that being older than 55 was predictive of a willingness to engage with the foundation. Similarly, on a recent online identity fraud detection project, we engineered a feature to count the number of unique devices associated with a single account which proved to be more informative about fraudulent activity than device type. Adding these harder-to-get features reduces the error of our models.
- Unknown Unknowns – These are our blind spots— the data that we haven’t looked for and didn’t know we needed. Their omission is spurred by assumptions. The blind spots drive errors in our model which we falsely assumed are unexplainable and random. For decades, doctors believed stomach ulcers were caused by excessive stress. They were extremely skeptical when Drs. Barry Marshall and Robin Warren of Perth, Australia suggested that peptic ulcers were caused by h. pylori bacteria. This skepticism delayed the proper treatment of ulcers in the US until the evidence became incontrovertible and Nobel prizes were awarded to the doctors. There will always be unknown unknowns. We need to be humble enough to recognize this reality or we will delay progress on the predictive powers of our models.
An experiment is required to obtain objective evidence of any causal relationship. Usually, we unfortunately trust our tribal knowledge and intuition about causality, but a well-designed experiment is all that can positively confirm suspicions. By conducting the experiment, new relationships may be uncovered, or at least we will know with certainty that there are important unknown unknowns! Experiments require resources and permission to conduct, which can limit what is feasible.
If all causal factors were available in our model training data, our models could, theoretically, be perfect. What we commonly refer to as “random error” in our models is in fact the result of the causal factors not being included and appropriately engineered. Typically, we must make predictions far in advance of the target outcome, and causal relationships are inherently vague. If we had all the causal factors for a physical process, we might not build a statistical (machine learning) model, but instead deploy a deterministic equation or rule set! In practice, we should aspire to acquire and develop all the causal factors that are available at the time the prediction must be made.
Summary
We frequently hear from clients that their data are incomplete or that their models are weak. Some data will always be missing, and that void may include causal elements. Missing causal elements have direct implications on how the model should be used. For example, if a causal factor is missing but a correlated feature downstream from the causal factor is included, then the model may pick up on the latter as if it were causal. But manipulating the correlated feature to try to influence the target outcome will only increase model errors. Stakeholders and data scientists must collaborate to understand the processes that produce the predictive data and the outcome of interest. This dialog will reveal the relevant information and highlight missing information. The result will be greater organizational understanding and more reliable models with lower error rates.