What is a hypothesis test?

An introduction to hypotheses and their purpose can be found in Understanding Hypotheses.

How to draw conclusions from data is one of the most hotly debated questions in mathematics at present. The arguments have raged for about a century and show no obvious signs of being resolved. And this is not just a theoretical question: it is a matter of life and death. Newspapers are full of stories such as "scientists have shown that eating three elephants a day will help you live longer" or similar (that one was made up!). But the statistical basis of these claims is often suspect, and eating the elephants (or worrying that you should be) may cause more harm than good. In medicine, should we give patients drug X or drug Y, or treatment A or treatment B, for their condition? We can do experiments to explore these questions, but how do we interpret the results?

In this article, we explore some of the mathematics of hypothesis testing, asking what the results of a hypothesis test actually mean, and pointing out some of the fundamental difficulties involved. At the end, we introduce two other approaches which are widely used.

The null hypothesis significance testing (NHST) framework

The general situation is this: we want to find out about some aspect of the real world, and we do this by performing an experiment. From the data collected in the experiment, we want to make a deduction about reality, a process known as statistical inference. In hypothesis testing we start with the following generic question:

Is this aspect of reality in a certain default state or a different state?

We call the default state our null hypothesis (usually denoted $H_0$) and the different state the alternative hypothesis ($H_1$). So for example, we might ask "Is this drug ineffective (default state) or does it help (different state)?"; in this case, our null hypothesis would be "the drug is ineffective" and our alternative hypothesis would be "the drug helps".

We also choose a significance level for our test, which is typically 5% (or 0.05). This specifies the probability of incorrectly rejecting the null hypothesis, that is, the probability that we reject the null hypothesis in the case that it is true. Doing so is sometimes known as a "Type I error".

Let's take a simple (though unrealistic) scenario which captures all of the essential ideas to show what this means in practice. This is the scenario used in Robin's Hypothesis Testing and Powerful Hypothesis Testing. There are many other situations in which hypothesis testing is used, some of which we will mention below, but they all share key characteristics with this one.

Our simple scenario

Imagine that we have a bag with red and green balls in it. The proportion of green balls in the bag is $\pi$ (for "proportion"), but we do not know this value.

For some good reason, we believe that this proportion ought to be $\frac{1}{2}$. But we are concerned that somehow the actual proportion is different. We'd like to know whether the proportion is actually $\frac{1}{2}$.

We capture these beliefs with a pair of hypotheses:

Our null hypothesis is $H_0\colon \pi=\frac{1}{2}$. This says that the proportion is what we believe it should be.
Our alternative hypothesis is $H_1\colon \pi\ne\frac{1}{2}$. This says that the proportion has changed.

In this scenario, we are not allowed to look inside the bag to find the true value of $\pi$. We are only allowed to take a random ball out of the bag, note its colour and then replace it. We may do this as many times as we wish.

Testing our hypotheses

To test our hypotheses, we perform an experiment, collecting data in the process, do some computations using it, and arrive at a number which we call a test statistic or just statistic. In this case, our experiment will be to take a random ball out of the bag $n$ times (for some fixed number $n$ which we specify at the start), and our statistic will simply be the number of green balls observed, $X$.

We will reject our null hypothesis $H_0$ if the statistic is so extreme that it is extremely unlikely to occur if the null hypothesis is true, otherwise we do not reject $H_0$. (In the latter case, some say "we accept $H_0$"; we will discuss this point later.) What does "extremely unlikely" mean? That is what we call the significance of the test, which we are at liberty to choose. It is typically taken to be 5%, meaning that the probability of rejecting $H_0$ if the null hypothesis is true is 5% (= 0.05). (Sometimes other significance levels are used, but we will stick to 5% for the remainder of this article.)

There are two equivalent ways of deciding whether the statistic is this extreme:

We can work out the critical region for $X$, that is, those extreme values of $X$ which would lead us to reject the null hypothesis at 5% significance. (This can be done even before performing the experiment.) The probability of $X$ taking a value in this critical region, assuming that the null hypothesis is true, should be 5%, or as close at we can get to 5% without going over it. In symbols, we can say: $$\mathrm{P}(\text{$X$ in critical region} | \text{$H_0$ is true}) \le 0.05.$$ Then we reject the null hypothesis if $X$ lies in that region.
We can work out the probability of $X$ taking the value it did or a more extreme value, assuming that the null hypothesis is true. This is known as the p-value. If the p-value is less than 0.05, then we will reject the null hypothesis at 5% significance. [note 1] In symbols, we can write $$\text{p-value} = \mathrm{P}(\text{$X$ taking this or a more extreme value} | \text{$H_0$ is true}).$$

In practice, p-values are much more commonly used than critical regions, and so we will use them in this article. We do not discuss how to work out the p-value or critical region here; that depends on the nature of the experiment and the null hypothesis. Critical values can often be looked up in tables, while critical values and p-values can both be calculated using appropriate statistical software.

Other types of scenario

In our scenario above, we were testing to see whether the proportion of something was as we expected or different. We might also test for other things, for example:

Does this drug/treatment/intervention/... have any effect?
Which of these drugs/... is more effective, or are they equally effective?
Is the mean height/mass/intelligence/test score/... of this population equal to some predicted value?
Is the standard deviation of the height/mass/... equal to some predicted value?
For two distinct groups of people, is their mean height/mass/... of each group the same?
Does this group of people's heights/masses/... appear to be following the probability distribution we expect?
Do these two populations' heights/masses/... appear to have the same distribution as each other?
Do this population's heights and weights appear to be correlated?

Each of these can be expressed in the form of a null hypothesis ("they do follow the expected distribution", for example) and an alternative hypothesis ("they do not"). One can then perform an experiment to obtain a test statistic, and use that to work out a p-value. The test involved could be, for example, a t-test, a chi-squared test, a Wilcoxon signed-rank test, a Whitney U test, and so on; each of the above scenarios has an appropriate test, and there are many others which are not listed here.

Interpreting the results

What does the result of a null hypothesis significance test mean? What do "the null hypothesis is accepted" and "the null hypothesis is rejected" mean? In this section, we look at some of the significant difficulties associated with this NHST approach; in the final section, we describe some alternative approaches.

The key question that hypothesis testing (NHST) answers

The question we have actually answered with our p-value is "Given that the null hypothesis is true, what is the probability of obtaining these results (or more extreme) by chance alone?"

Many would argue that the key question should actually be: "Given these results, what is the probability that the null hypothesis is true?" This is a very different question indeed, though it is superficially similar.

We will bear this in mind as we go on, and consider some examples to show how different the answers can be.

What a hypothesis test does not tell us

You may be familiar with the idea that "correlation does not imply causation", in other words, just because two features are correlated does not mean that one causes the other. A similar warning applies to hypothesis tests: just because a hypothesis shows statistical significance, it does not necessarily mean that there is a material significance to the results. It could be, but it could also be due to a statistical fluke in this set of data. To be more confident that there is any reality to the results, one would want to perform more experiments or come up with an underlying explanation of why the results are as we see (or both).

We are also going to assume that other factors (such as bias or confounding) have already been addressed; these could otherwise influence the results.

A non-significant result

If the p-value is greater than 0.05, we do not reject the null hypothesis. Does this mean that the null hypothesis is true or likely to be true? Maybe, but maybe not. We still have two possibilities: either the null hypothesis is true, or the null hypothesis is false.

It could be that the null hypothesis is true. In this case, we would have to be unlucky to get a significant p-value, so most of the time, we will end up accepting the null hypothesis. (If the null hypothesis is true, we would reject it with a probability of only 0.05.)
On the other hand, it could be that the alternative hypothesis is true, but we did not use a large enough sample to obtain a significant result (or we were just unlucky). In such a case, we could say that our test was insensitive. In this situation (the alternative hypothesis is true but we do not reject the null hypothesis), we say that we have made a Type II error. The probability of this happening depends on the sample size and on how different the true $\pi$ is from $\frac{1}{2}$ (or whatever our null hypothesis says), as is explored in Powerful Hypothesis Testing.

Distinguishing between these two possibilities is not straightforward. For this reason, some statisticians do not like saying that we accept the null hypothesis when the p-value is larger than 0.05, but prefer to say "we have not rejected the null hypothesis". For a discussion of how we could distinguish between these possibilities, see Dienes (2014); this paper also includes a discussion on using confidence intervals instead of p-values to test hypotheses. This again draws our attention back to the question that we presented earlier: "Given these results, what is the probability that the null hypothesis is true?"

A significant result

If the p-value is less than 0.05, we reject the null hypothesis. Does this mean that the alternative hypothesis is true? Again, maybe, but maybe not. We consider the same two possibilities as before.

It could be that the null hypothesis is true. In this case, we reject the null hypothesis with a probability of $0.05=\frac{1}{20}$, that is, one time in 20 (at a significance level of 5%), so we were just unlucky.
On the other hand, the alternative hypothesis could indeed be true. Either the sample was large enough to obtain a significant result, or the sample size wasn't that large, but we were just lucky.

So when we reject the null hypothesis, it means that we have evidence to support the alternative hypothesis, but does not guarantee that the alternative hypothesis is true: we are working in a world of probability, not of certainty.

Let's dive in a little deeper into this uncertainty. One common approach, which implicitly appears in many published articles in the sciences and social sciences, is to look at the size of the p-value: the smaller the p-value, the more significant the result is considered to be (and hence the more likely the alternative hypothesis is). So a p-value of 0.003 would be considered strong evidence for the alternative hypothesis.

We could do better still by attempting to quantify our uncertainty. We can ask:

Given these results, what is the probability that the alternative hypothesis is true?

This is almost the same as our earlier question, but we are now asking for the null hypothesis to be false; the answer to this is 1 minus the probability that the null hypothesis is true. This is "obviously" the right question to ask: we really want to know how likely it is that a drug is effective, or that a proposed government policy will help rather than harm, and so on. Knowing the probability of obtaining these results if the null hypothesis is true (the p-value) seems less important.

To calculate the probability, we can draw a tree diagram to represent this situation:

Using this tree diagram, we can work out the probabilities of $H_0$ being true or $H_1$ being true given our experimental results. To avoid the expressions becoming unwieldy, we will write $H_0$ for "$\text{$H_0$ true}$", $H_1$ for "$\text{$H_1$ true}$" and "$\text{p}^+$" for "observed p-value or more extreme". Then we can write (conditional) probabilities on the branches of the tree diagram leading to our observed p-value: [note 2]

The two routes which give our observed p-value (or more extreme) have the following probabilities:

$$\begin{align*}

\mathrm{P}(H_0\cap \text{p}^+) &=

\mathrm{P}(H_0) \times \mathrm{P}(\text{p}^+ | H_0) \\

\mathrm{P}(H_1\cap \text{p}^+) &=

\mathrm{P}(H_1) \times \mathrm{P}(\text{p}^+ | H_1)

\end{align*}$$

(Recall that $\mathrm{P}(H_0\cap \text{p}^+)$ means "the probability of $H_0$ being true and the p-value being that observed or more extreme".)

We can therefore work out the probability of the alternative hypothesis being true given the observed p-value, using conditional probability:

$$\begin{align*}

\mathrm{P}(H_1|\text{p}^+) &=

\frac{\mathrm{P}(H_1\cap \text{p}^+)}{\mathrm{P}(\text{p}^+)} \\

&= \frac{\mathrm{P}(H_1\cap \text{p}^+)}{\mathrm{P}(H_0\cap\text{p}^+)+\mathrm{P}(H_1\cap\text{p}^+)} \\

&= \frac{\mathrm{P}(H_1) \times \mathrm{P}(\text{p}^+ | H_1)}{\mathrm{P}(H_0) \times \mathrm{P}(\text{p}^+ | H_0) + \mathrm{P}(H_1) \times \mathrm{P}(\text{p}^+ | H_1)}

\end{align*}$$

Though this is a mouthful, it is a calculation which only involves the four probabilities on the above tree diagram. (This is an example of Bayes' Theorem, discussed further in this resource.)

However, we immediately hit a big difficulty if we try to calculate this for a given experiment. We know $\mathrm{P}(\text{p}^+ | H_0)$: this is just the p-value itself. (The p-value tells us the probability of obtaining a result at least this extreme given that the null hypothesis is true.) But we don't know the probability of the null hypothesis being true or false (that is, $\mathrm{P}(H_0)$ and $\mathrm{P}(H_1)=1-\mathrm{P}(H_0)$), nor do we know the probability of the observed result if the alternative hypothesis is true ($P(\text{p}^+|H_1)$), as knowing that the proportion of greens is not $\frac{1}{2}$ does not tell us what it actually is. (Similar issues apply to all the other contexts of hypothesis testing listed above.) So we are quite stuck: in the null hypothesis significance testing model, it is impossible to give a numerical answer to our key question: "Given our results, what is the probability that the alternative hypothesis is true?" This is because we don't know two of the three probabilities that we need in order to answer the question.

An example might highlight the issue a little better. Let us suppose that we are trying to work out whether a coin is biased (alternative hypothesis), or whether the probability of heads is exactly $\frac{1}{2}$ (null hypothesis). We toss the coin 50 times and obtain a p-value of 0.02. Do we now believe that the coin is biased? Most people believe that coins are not biased, and so are much more likely to attribute this result to chance or poor coin-tossing technique than to the coin being biased.

On the other hand, consider a case of a road planner who introduces a traffic-calming feature to reduce the number of fatalities along a certain stretch of road. The null hypothesis is that there is no change in fatality rate, while the alternative hypothesis is that the fatality rate has decreased. A hypothesis test is performed on data collected for 24 months before and 24 months after the feature is built. Again, the p-value was 0.02. Do we believe that the alternative hypothesis is true? In this case, we are more likely to believe that the alternative hypothesis is true, because it makes a lot of sense that this feature will reduce the number of fatalities.

Our "instinctive" responses to these results are tied up with assigning values to the unknown probabilities in the formula above. For the coin, we would probably take $\mathrm{P}(H_0)$ to be close to 1, say $0.99$, as we think it is very unlikely that the coin is biased, and $\mathrm{P}(\text{p}^+|H_1)$ will be, say, $0.1$: if the coin is biased, the bias is not likely to be very large, and so it is only a bit more likely that the result will be significant in this case. Putting these figures into the formula above gives:

$$\mathrm{P}(H_1|\text{p}^+) = \frac{0.01 \times 0.1}{0.99 \times 0.02 + 0.01 \times 0.1} \approx 0.05,$$

that is, we are still very doubtful that this coin is biased, even after performing the experiment. Note that in this case, the probability of these results given that the null hypothesis is true is 0.02, whereas the probability that the null hypothesis is true given these results is $1-0.05=0.95$, which is very different. This shows how dramatically different the answers to the two questions can be.

On the other hand, for the fatalities situation, we might assume quite the opposite: we are pretty confident that the traffic-calming feature will help, so we might take $\mathrm{P}(H_0)$ to be $0.4$, and $\mathrm{P}(\text{p}^+|H_1)$ will be, say, $0.25$ (though the traffic-calming may help, the impact may be relatively small). Putting these figures into the formula gives:

$$\mathrm{P}(H_1|\text{p}^+) = \frac{0.6 \times 0.25}{0.4 \times 0.02 + 0.6 \times 0.25} \approx 0.95,$$

so we are now much more convinced that the traffic-calming feature is helping than we were before we had the data. This time, the probability of these results given that the null hypothesis is true is still 0.02, whereas the probability that the null hypothesis is true given these results is $1-0.95=0.05$, which is not that different.

This approach may seem very disturbing, as we have to make assumptions about what we believe before we do the hypothesis test. But as we have seen, we cannot answer our key question without making such assumptions.

Other approaches and some warnings

The discussion in the last section leads to an approach to hypothesis testing and interpretation of data known as Bayesian inference, named after the 18th century mathematician Thomas Bayes. It acknowledges that we have to make our beliefs explicit before we can interpret data, and proposes that our interpretations will be more meaningful if we do so. It allows us to use all the data we have gathered, and does not require us to specify the sample size before performing experiments: we can continually make use of new data as we obtain it. (To understand why this is an issue, have a look at Robin's Hypothesis Testing.) One technical introduction to this approach is by Spiegelhalter and Rice; there are, of course, many other overviews available. A readable popular book by Nate Silver explains the elements of elementary Bayesian inference, and how he applied the ideas to many areas of life including gambling on sports events and predicting election results. In the last few decades, Bayesian approaches have started appearing in an increasing number of papers in the scientific literature.

The approach we started with, null-hypothesis significance testing, is actually a composite of two different approaches developed in the 20th century.

The first approach was developed by Fisher and others. This approach had a null hypothesis and p-values; one could only accept or reject the null hypothesis, and there was no alternative hypothesis. Another issue with this approach that we did not mention earlier is that there is no such thing as "the" p-value for an experimental result. The p-value, which gives the probability of obtaining this test statistic or more extreme assuming that the null hypothesis is true, depends on the test statistic used. There may be some very natural ones, for example, the number of green balls drawn in our above example. But if we had used some other statistic, for example the length of the longest consecutive sequence of greens drawn, then we would obtain a different p-value from our experiment. Which is the "correct" or "best" statistic to use? In some scenarios, such as ours, there is a clear answer (the total number of greens drawn), but in other scenarios it is not so clear. Therefore in some cases, the p-value can be somewhat misleading: it may come out at 0.03, say, but a different, equally sensible-looking statistic might give a p-value of 0.07. There is a skill in choosing the test statistic that is most suitable for testing for different sorts of departures from the null hypothesis, for example the "longest consecutive sequence" might be appropriate for testing a different null hypothesis: that the draws are independent.

Another issue with the Fisher approach is that it doesn't necessarily answer the question we want to ask. We may be testing a drug or some sort of intervention (a change in policy or teaching method or ...), and we want to know how we should act. We do a trial and want to know whether we should prescribe this drug or implement this intervention. So we want to know whether prescribing this drug is better than the existing ones or whether the intervention is effective. But the Fisher approach only tells us whether the results of our experiment are unlikely to have happened by chance if there is no real effect, it does not tell us whether the alternative hypothesis is (much) more likely to be true than the null hypothesis.

A different approach was developed by Neyman and Pearson, based on the idea of likelihoods: they introduced the alternative hypothesis, which could be specified either as we have done ($\pi\ne\frac{1}{2}$), or as a precise alternative to the null hypothesis. Their view is that we want to know the answer to the question: "Which of these two hypotheses should we assume to be true in the way we act? Should we assume that the null hypothesis is true or the alternative hypothesis?" For example, if we are testing a new medicine, we want to know: "Should we prescribe the new medicine or remaining with the current one?"

As a practical example, in our green balls scenario, the alternative hypothesis might be $H_1\colon \pi=0.6$ if we had some reason to be interested in this value. We then ask the question: given the observed data, what is the likelihood of $H_0$ given this data, and what is the likelihood of $H_1$? [note 3] The ratio of these two, the likelihood ratio, tells us how many more times $H_0$ is likely to be true than $H_1$. If this is small enough - below some threshold such as $\frac{1}{5}$ - then we accept $H_1$, otherwise we accept $H_0$, and we should act according to the hypothesis we have accepted. It turns out that we can choose the threshold value so that the probability of rejecting $H_0$ incorrectly is still our chosen significance level (say 5%). This approach also allows us to talk about the power of the test (as explored in Powerful Hypothesis Testing), which means the probability of the null hypothesis being rejected if the alternative hypothesis is true. This approach does require us to be able to specify a meaningful alternative hypothesis, and a precise type of alternative can only be chosen by thinking about the actual physical context. Statistics does not exist in a theoretical vacuum, though, so this is not necessarily a bad thing. This approach has the additional benefit that it does not suffer from the test statistic problem mentioned above in the context of p-values: the null hypothesis and alternative hypothesis together determine exactly what test statistic should be used.

These great statisticians argued for decades about which was the better approach, and their argument was never resolved. The composite of these two approaches that we see in the (UK) school curriculum dominated statistical inference throughout most of the 20th century and into the 21st. With the Bayesian approach now regaining popularity (it actually predates the current null hypothesis approach), and the validity of the NHST approach being questioned ever more, it will be interesting to see how the statistics battles develop over the coming years.

One final warning is in order. Hypothesis testing makes some basic assumptions. It assumes that our model of the situation is correct; if it is not, then our data will not follow the behaviour we expect, and so our analysis will be unreliable. It also assumes that we collect data in an unbiased manner, which may well not be as straightforward as it sounds. We generally require the individual measurements to be independent, and if we ask people questions, we assume - usually in vain - that the answers we receive are all honest. Since this collection of assumptions is essentially impossible to achieve in practice, in spite of our best efforts, it is always worth treating results of hypothesis tests with a degree of caution. Nevertheless, hypothesis tests are a very useful tool in the statistician's armoury, and are used regularly in practice.

Now that you have read all about Hypothesis Testing, you might like to test your understanding by exploring Hypothetical Shorts.

Notes

Because our test is two-tailed (in the alternative hypothesis, the true proportion could be less than $\frac{1}{2}$ or more than $\frac{1}{2}$), we must be careful when calculating the p-value: we calculate the probability of the observed outcome or more extreme occurring, and then double the answer to account for the other tail. We could also compare the probability of the value or more extreme to 0.025 instead of 0.05, but that would not be called a p-value.

Likewise, when we determine the critical region, we will have two parts: a tail with large values of $X$ and a tail with small values of $X$; we require that the probability of $X$ lying in the large-value tail is as close as possible to 0.025 without going over it, and the same for the probability of $X$ lying in the small-value tail.
There are complications here when working with two-tail tests as opposed to one-tail tests. We will ignore this problem, as it does not significantly affect the overall discussion.
"Likelihood" is a technical term. For a discrete test statistic $X$, the likelihood of $H_0$ given the data $X=x$ means $P(X=x|H_0)$, in other words, how likely would this data be if $H_0$ were true. It is not the probability of $H_0$ being true given the data.

Or search by topic

Number and algebra

Geometry and measure

Probability and statistics

Working mathematically

Advanced mathematics

For younger learners