It is a Mistake to Listen Only to the Data

Author:

John F. Elder

Date Published:
August 11, 2017

In his Top 10 Data Science Mistakes John Elder shares lessons learned from more than 20 years of data science consulting experience. Avoiding these mistakes are cornerstones to any successful analytics project. In this blog about Mistake #4 you will learn that inducing models from data has the virtue of looking at the data afresh, not constrained by old hypotheses. But, while “letting the data speak”, you must be careful not to tune out received wisdom, because often, nothing inside the data will protect one from significant, but wrong, conclusions.


Inducing models from data has the virtue of looking at the data afresh, not constrained by old hypotheses. But, while “letting the data speak”, don’t tune out received wisdom. Experience has taught this once brash analyst that those familiar with the domain are usually more vital to the solution of the problem than the technology we bring to bear.

Often, nothing inside the data will protect one from significant, but wrong, conclusions. Table 1 contains two variables about high school, averaged by state: cost and average SAT score (from about 1994). Our task, say, is to model their relationship to advise the legislature of the costs of improving our educational standing relative to nearby states. Figure 1 illustrates how the relationship between the two is significant – the Linear Regression t-statistic is over 4, for example, suggesting that such a strong relationship occurs randomly only 1/10,000 times.[1] However, the sign of the relationship is the opposite of what was expected. That is, to improve our standing (lower our SAT ranking), the graph suggests we need to reduce school funding!

Table 1: Spending and Rank of Average SAT Score by State

Figure 1:  Rank of a State (in average SAT score) vs. its spending per student
(circa 1994), and the least-squares regression estimate of their relationship

Observers of this example will often suggest adding further data – perhaps, for example, local living costs, or percent of the population in urban or rural settings — to help explain what is happening. But, the real problem is one of self-selection. The high-SAT/low-cost states are clustered mainly in the Midwest, where the test required for state universities (the best deal for one’s dollar) is not the SAT but the ACT. Only those students aspiring to attend (presumably more prestigious) out-of-state schools go to the trouble of taking an extra standardized test, and their resulting average score is certainly higher than the larger population’s would be. Additional variables in the database, in fact (other than proportion of students taking the SAT) would make the model more complex, and might obscure the fact that information external to the data is vital.

Observers of this example will often suggest adding further data – The above example employed typical “opportunistic”, or found, data. But even data generated by a designed experiment needs external information. A national defense project from the early days of Neural Networks attempted to distinguish aerial images of forests with and without tanks in them. Perfect performance was achieved on the training set, and then also on an out-of-sample set of data that had been gathered at the same time but not used for training. This was celebrated but, wisely, a confirming study was performed. New images were collected on which the models performed extremely poorly. This drove investigation into the features driving the models and revealed them to be magnitude readings from specific locations of the images; i.e., background pixels. It turns out that the day the tanks had been photographed was sunny, and that for non-tanks, cloudy![2] Even resampling the original data wouldn’t have protected against this error, as the flaw was inherent in the generating experiment.

A second tanks and networks example, from my good friend and former colleague, Dean Abbott (who’s got an excellent book). Dean had worked at a San Diego defense contractor, where they sought to distinguish tanks and trucks from any aspect angle. Radars and mechanized vehicles are bulky and expensive to move around, so they fixed the radar installation and rotated a tank and a truck on separate large, rectangular platforms. Signals were beamed at different angles and the returns were extensively processed – using polynomial network models of subsets of principle components of Fourier transforms of the signals – and great accuracy in classification was achieved. However, seeking transparency (not easy for complex, multi-stage models) Dean discovered, much to his chagrin, that the source of the key distinguishing features determining vehicle type turned out to be the bushes beside one platform![3] Further, it is suspected that the angle estimation accuracy came from the signal reflecting from the platform corners – not a feature one will encounter in the field. Again, no modeling technology alone could correct for flaws in the data, and it took careful study of how the model worked to discover its weakness.


[1] This theoretical result is confirmed by resampling using Target Shuffling; randomize the rankings and it takes about 10^4 tries before a correlation this strong is stumbled upon.

[2] PBS featured this project in a 1991 documentary series “The Machine that Changed the World”: Episode IV, “The Thinking Machine”.

[3] This excellent practice of trying to break one’s own work, is so hard to do even if one is convinced of its need, that managers should pit teams with opposite reward metrics against one another in order to proof-test solutions.