For many students, statistics is a troublesome subject, and the root of that trouble can be traced to the concept of **the null hypothesis**. In these days of big data, machine learning, and predictive analytics, formal hypothesis testing has receded in relative importance. Nonetheless, it retains considerable inertia and ability to cause difficulty – even in data science circles.

The first hypothesis test (or significance test) is often attributed to John Arbuthnot in 1710, physician to Queen Anne of England, and satirical writer. He created and popularized the figure of John Bull (pictured), the English equivalent of Uncle Sam.

Arbuthnot studied the numbers of male and female births in England, and remarked on the fact that in nearly every year, male births exceeded female births by a slight proportion. He calculated that the probability of this happening by chance was infinitesimal, and therefore it was due to “divine providence.”

## Measurement Error

In the 1800’s, probability theory began to be developed in the context of measurement error. Astronomers had long wrestled with the fact that multiple measurements of the position of a planet or star did not yield the same answer. They quite naturally came up with the idea of using the average, but still wondered about the variation in individual measurements. Karl Friedrich Gauss applied his theory of the normal distribution to measurement errors (indeed, the normal distribution was originally termed the “error distribution”).

Benjamin Pierce (1952) took this one step further, using the error distribution to determine whether a particular observation lies outside the range of normal measurement fluctuation and is, therefore, so erroneous as to not be included in the average. In effect, he used a significance test to determine outliers.

In 1900, Karl Pearson published the chi-square test, which was used to determine whether a set of count data fit a particular theorized distribution – the departures from the theorized distribution should fit the chi-square distribution.

## Treatments and Effects

The formal consideration of measurement variability was extended to experiments and their results with William S. Gosset and R. A. Fisher. Fisher looked at the outcomes of agricultural experiments in which blocks of land were treated in different fashion. Noting that the yield within a block was not uniform, he wondered whether the differences in yield between blocks might simply be due to random variation in normal germination and growth patterns. At what point would the difference between blocks be large enough to say that it couldn’t have happened by chance?

Fisher addressed the problem with what is now called a permutation test:

- Repeatedly shuffle the data randomly across all blocks, and for each shuffle calculate the differences in average yield across blocks.
- If the observed differences always exceed what you obtain by random shuffling, then the differences can’t be due to chance.

Fisher, somewhat arbitrarily, said that if your random shufflings produced the observed differences less than one in twenty times, random error is not the cause, and there must be a real treatment effect. This is the origin of the “p value less than 0.05” criterion.

The “null hypothesis” is the state of the world if there were no treatment effect (or, in the case of the astronomers, no unusual measurement error) – all data are produced by the same data-generating mechanism.

## Formalization

The permutation test was a useful concept because it dovetailed so nicely with Fisher’s great contribution to the design of experiments: original treatment allocation should be randomized to eliminate even unconscious bias.

In the 1930’s, hypothesis testing achieved a prominent place in the teaching of statistics that persists to this day. Ultimately, though it became hobbled by excessive formalism, and too much mathematics (as seen in the Wikipedia entry on the Neyman-Pearson proof, pictured). Losing its connection to real data analysis, hypothesis testing became a form of “silver scepter,” a poorly understood tool to lend legitimacy to research.

Its core ingredient, the p-value, became somewhat discredited. And, yet, understanding variation remains central to all analysis, and the original concepts of resampling (permutation, and, later, the bootstrap) are as relevant as ever – especially in data science. The use of resampling to test reliability and model variation has been used by Elder Research in the form of “target shuffling,” where a model is repeatedly re-run with the target variable shuffled. If you often get seemingly useful models from the randomly shuffled data, you know that your original model must be taken with a grain of salt.