Published 2018 Revised 2019

How to draw conclusions from data is one of the most hotly debated questions in mathematics at present. The arguments have raged for about a century and show no obvious signs of being resolved. And this is not just a theoretical question: it is a matter of life and death. Newspapers are full of stories such as "scientists have shown that eating three elephants a day will help you live longer" or similar (that one was made up!). But the statistical basis of these claims is often suspect, and eating the elephants (or worrying that you should be) may cause more harm than good. In medicine, should we give patients drug X or drug Y, or treatment A or treatment B, for their condition? We can do experiments to explore these questions, but how do we interpret the results?

The general situation is this: we want to find out about some aspect of the real world, and we do this by performing an experiment. From the data collected in the experiment, we want to make a deduction about reality, a process known as

We call the default state our

We also choose a

Let's take a simple (though unrealistic) scenario which captures all of the essential ideas to show what this means in practice. This is the scenario used in Robin's Hypothesis Testing and Powerful Hypothesis Testing. There are many other situations in which hypothesis testing is used, some of which we will mention below, but they all share key characteristics with this one.

Imagine that we have a bag with red and green balls in it. The proportion of green balls in the bag is $\pi$ (for "proportion"), but we do not know this value.

For some good reason, we believe that this proportion ought to be $\frac{1}{2}$. But we are concerned that somehow the actual proportion is different. We'd like to know whether the proportion is actually $\frac{1}{2}$.

We capture these beliefs with a pair of hypotheses:

- Our null hypothesis is $H_0\colon \pi=\frac{1}{2}$. This says that the proportion is what we believe it should be.
- Our alternative hypothesis is $H_1\colon \pi\ne\frac{1}{2}$. This says that the proportion has changed.

To test our hypotheses, we perform an experiment, collecting data in the process, do some computations using it, and arrive at a number which we call a

We will reject our null hypothesis $H_0$ if the statistic is so extreme that it is

There are two equivalent ways of deciding whether the statistic is this extreme:

- We can work out the
*critical region*for $X$, that is, those extreme values of $X$ which would lead us to reject the null hypothesis at 5% significance. (This can be done even before performing the experiment.) The probability of $X$ taking a value in this critical region, assuming that the null hypothesis is true, should be 5%, or as close at we can get to 5% without going over it. In symbols, we can say: $$\mathrm{P}(\text{$X$ in critical region} | \text{$H_0$ is true}) \le 0.05.$$ Then we reject the null hypothesis if $X$ lies in that region. - We can work out the probability of $X$ taking the value it did or a more extreme value, assuming that the null hypothesis is true. This is known as the
*p-value*. If the p-value is less than 0.05, then we will reject the null hypothesis at 5% significance. [note 1] In symbols, we can write $$\text{p-value} = \mathrm{P}(\text{$X$ taking this or a more extreme value} | \text{$H_0$ is true}).$$

In our scenario above, we were testing to see whether the proportion of something was as we expected or different. We might also test for other things, for example:

- Does this drug/treatment/intervention/... have any effect?
- Which of these drugs/... is more effective, or are they equally effective?
- Is the mean height/mass/intelligence/test score/... of this population equal to some predicted value?
- Is the standard deviation of the height/mass/... equal to some predicted value?
- For two distinct groups of people, is their mean height/mass/... of each group the same?
- Does this group of people's heights/masses/... appear to be following the probability distribution we expect?
- Do these two populations' heights/masses/... appear to have the same distribution as each other?
- Do this population's heights and weights appear to be correlated?

What does the result of a null hypothesis significance test mean? What do "the null hypothesis is accepted" and "the null hypothesis is rejected" mean? In this section, we look at some of the significant difficulties associated with this NHST approach; in the final section, we describe some alternative approaches.

The question we have actually answered with our p-value is "Given that the null hypothesis is true, what is the probability of obtaining these results (or more extreme) by chance alone?"

Many would argue that the key question should actually be: "Given these results, what is the probability that the null hypothesis is true?" This is a very different question indeed, though it is superficially similar.

We will bear this in mind as we go on, and consider some examples to show how different the answers can be.

You may be familiar with the idea that "correlation does not imply causation", in other words, just because two features are correlated does not mean that one causes the other. A similar warning applies to hypothesis tests: just because a hypothesis shows statistical significance, it does not necessarily mean that there is a material significance to the results. It could be, but it could also be due to a statistical fluke in this set of data. To be more confident that there is any reality to the results, one would want to perform more experiments or come up with an underlying explanation of why the results are as we see (or both).

We are also going to assume that other factors (such as bias or confounding) have already been addressed; these could otherwise influence the results.

If the p-value is greater than 0.05, we do not reject the null hypothesis. Does this mean that the null hypothesis is true or likely to be true? Maybe, but maybe not. We still have two possibilities: either the null hypothesis is true, or the null hypothesis is false.

- It could be that the null hypothesis is true. In this case, we would have to be unlucky to get a significant p-value, so most of the time, we will end up accepting the null hypothesis. (If the null hypothesis is true, we would reject it with a probability of only 0.05.)

- On the other hand, it could be that the alternative hypothesis is true, but we did not use a large enough sample to obtain a significant result (or we were just unlucky). In such a case, we could say that our test was
*insensitive*. In this situation (the alternative hypothesis is true but we do not reject the null hypothesis), we say that we have made a*Type II error*. The probability of this happening depends on the sample size and on how different the true $\pi$ is from $\frac{1}{2}$ (or whatever our null hypothesis says), as is explored in Powerful Hypothesis Testing.

If the p-value is less than 0.05, we reject the null hypothesis. Does this mean that the alternative hypothesis is true? Again, maybe, but maybe not. We consider the same two possibilities as before.

- It could be that the null hypothesis is true. In this case, we reject the null hypothesis with a probability of $0.05=\frac{1}{20}$, that is, one time in 20 (at a significance level of 5%), so we were just unlucky.

- On the other hand, the alternative hypothesis could indeed be true. Either the sample was large enough to obtain a significant result, or the sample size wasn't that large, but we were just lucky.

Let's dive in a little deeper into this uncertainty. One common approach, which implicitly appears in many published articles in the sciences and social sciences, is to look at the size of the p-value: the smaller the p-value, the more significant the result is considered to be (and hence the more likely the alternative hypothesis is). So a p-value of 0.003 would be considered strong evidence for the alternative hypothesis.

We could do better still by attempting to quantify our uncertainty. We can ask:

Given these results, what is the probability that the alternative hypothesis is true?

This is almost the same as our earlier question, but we are now asking for the null hypothesis to be false; the answer to this is 1 minus the probability that the null hypothesis is true. This is "obviously" the right question to ask: we really want to know how likely it is that a drug is effective, or that a proposed government policy will help rather than harm, and so on. Knowing the probability of obtaining these results if the null hypothesis is true (the p-value) seems less important.

To calculate the probability, we can draw a tree diagram to represent this situation:

Using this tree diagram, we can work out the probabilities of $H_0$ being true or $H_1$ being true given our experimental results. To avoid the expressions becoming unwieldy, we will write $H_0$ for "$\text{$H_0$ true}$", $H_1$ for "$\text{$H_1$ true}$" and "$\text{p}^+$" for "observed p-value or more extreme". Then we can write (conditional) probabilities on the branches of the tree diagram leading to our observed p-value: [note 2]

The two routes which give our observed p-value (or more extreme) have the following probabilities:

$$\begin{align*}

\mathrm{P}(H_0\cap \text{p}^+) &=

\mathrm{P}(H_0) \times \mathrm{P}(\text{p}^+ | H_0) \\

\mathrm{P}(H_1\cap \text{p}^+) &=

\mathrm{P}(H_1) \times \mathrm{P}(\text{p}^+ | H_1)

\end{align*}$$

(Recall that $\mathrm{P}(H_0\cap \text{p}^+)$ means "the probability of $H_0$ being true **and** the p-value being that observed or more extreme".)

We can therefore work out the probability of the alternative hypothesis being true given the observed p-value, using conditional probability:

$$\begin{align*}

\mathrm{P}(H_1|\text{p}^+) &=

\frac{\mathrm{P}(H_1\cap \text{p}^+)}{\mathrm{P}(\text{p}^+)} \\

&= \frac{\mathrm{P}(H_1\cap \text{p}^+)}{\mathrm{P}(H_0\cap\text{p}^+)+\mathrm{P}(H_1\cap\text{p}^+)} \\

&= \frac{\mathrm{P}(H_1) \times \mathrm{P}(\text{p}^+ | H_1)}{\mathrm{P}(H_0) \times \mathrm{P}(\text{p}^+ | H_0) + \mathrm{P}(H_1) \times \mathrm{P}(\text{p}^+ | H_1)}

\end{align*}$$

Though this is a mouthful, it is a calculation which only involves the four probabilities on the above tree diagram. (This is an example of *Bayes' Theorem*, discussed further in this resource.)

However, we immediately hit a big difficulty if we try to calculate this for a given experiment. We know $\mathrm{P}(\text{p}^+ | H_0)$: this is just the p-value itself. (The p-value tells us the probability of obtaining a result at least this extreme given that the null hypothesis is true.) But we don't know the probability of the null hypothesis being true or false (that is,
$\mathrm{P}(H_0)$ and $\mathrm{P}(H_1)=1-\mathrm{P}(H_0)$), nor do we know the probability of the observed result if the alternative hypothesis is true ($P(\text{p}^+|H_1)$), as knowing that the proportion of greens is not $\frac{1}{2}$ does not tell us what it actually is. (Similar issues apply to all the other contexts of hypothesis testing listed above.) So we are quite stuck: in
the null hypothesis significance testing model, it is impossible to give a numerical answer to our key question: "Given our results, what is the probability that the alternative hypothesis is true?" This is because we don't know two of the three probabilities that we need in order to answer the question.

An example might highlight the issue a little better. Let us suppose that we are trying to work out whether a coin is biased (alternative hypothesis), or whether the probability of heads is exactly $\frac{1}{2}$ (null hypothesis). We toss the coin 50 times and obtain a p-value of 0.02. Do we now believe that the coin is biased? Most people believe that coins are not
biased, and so are much more likely to attribute this result to chance or poor coin-tossing technique than to the coin being biased.

On the other hand, consider a case of a road planner who introduces a traffic-calming feature to reduce the number of fatalities along a certain stretch of road. The null hypothesis is that there is no change in fatality rate, while the alternative hypothesis is that the fatality rate has decreased. A hypothesis test is performed on data collected for 24 months before and 24 months
after the feature is built. Again, the p-value was 0.02. Do we believe that the alternative hypothesis is true? In this case, we are more likely to believe that the alternative hypothesis is true, because it makes a lot of sense that this feature will reduce the number of fatalities.

Our "instinctive" responses to these results are tied up with assigning values to the unknown probabilities in the formula above. For the coin, we would probably take $\mathrm{P}(H_0)$ to be close to 1, say $0.99$, as we think it is very unlikely that the coin is biased, and $\mathrm{P}(\text{p}^+|H_1)$ will be, say, $0.1$: if the coin is biased, the bias is not likely to be very large, and
so it is only a bit more likely that the result will be significant in this case. Putting these figures into the formula above gives:

$$\mathrm{P}(H_1|\text{p}^+) = \frac{0.01 \times 0.1}{0.99 \times 0.02 + 0.01 \times 0.1} \approx 0.05,$$

that is, we are still very doubtful that this coin is biased, even after performing the experiment. Note that in this case, the probability of these results given that the null hypothesis is true is 0.02, whereas the probability that the null hypothesis is true given these results is $1-0.05=0.95$, which is very different. This shows how dramatically different the answers to the two
questions can be.

On the other hand, for the fatalities situation, we might assume quite the opposite: we are pretty confident that the traffic-calming feature will help, so we might take $\mathrm{P}(H_0)$ to be $0.4$, and $\mathrm{P}(\text{p}^+|H_1)$ will be, say, $0.25$ (though the traffic-calming may help, the impact may be relatively small). Putting these figures into the formula gives:

$$\mathrm{P}(H_1|\text{p}^+) = \frac{0.6 \times 0.25}{0.4 \times 0.02 + 0.6 \times 0.25} \approx 0.95,$$

so we are now much more convinced that the traffic-calming feature is helping than we were before we had the data. This time, the probability of these results given that the null hypothesis is true is still 0.02, whereas the probability that the null hypothesis is true given these results is $1-0.95=0.05$, which is not that different.

This approach may seem very disturbing, as we have to make assumptions about what we believe before we do the hypothesis test. But as we have seen, we cannot answer our key question without making such assumptions.

The discussion in the last section leads to an approach to hypothesis testing and interpretation of data known as

The approach we started with, null-hypothesis significance testing, is actually a composite of two different approaches developed in the 20th century.

The first approach was developed by Fisher and others. This approach had a null hypothesis and p-values; one could only accept or reject the null hypothesis, and there was no alternative hypothesis. Another issue with this approach that we did not mention earlier is that there is no such thing as "the" p-value for an experimental result. The p-value, which gives the probability of obtaining this test statistic or more extreme assuming that the null hypothesis is true, depends on the test statistic used. There may be some very natural ones, for example, the number of green balls drawn in our above example. But if we had used some other statistic, for example the length of the longest consecutive sequence of greens drawn, then we would obtain a different p-value from our experiment. Which is the "correct" or "best" statistic to use? In some scenarios, such as ours, there is a clear answer (the total number of greens drawn), but in other scenarios it is not so clear. Therefore in some cases, the p-value can be somewhat misleading: it may come out at 0.03, say, but a different, equally sensible-looking statistic might give a p-value of 0.07. There is a skill in choosing the test statistic that is most suitable for testing for different sorts of departures from the null hypothesis, for example the "longest consecutive sequence" might be appropriate for testing a different null hypothesis: that the draws are independent.

Another issue with the Fisher approach is that it doesn't necessarily answer the question we want to ask. We may be testing a drug or some sort of intervention (a change in policy or teaching method or ...), and we want to know how we should act. We do a trial and want to know whether we should prescribe this drug or implement this intervention. So we want to know whether prescribing this drug is better than the existing ones or whether the intervention is effective. But the Fisher approach only tells us whether the results of our experiment are unlikely to have happened by chance if there is no real effect, it does not tell us whether the alternative hypothesis is (much) more likely to be true than the null hypothesis.

A different approach was developed by Neyman and Pearson, based on the idea of likelihoods: they introduced the alternative hypothesis, which could be specified either as we have done ($\pi\ne\frac{1}{2}$), or as a precise alternative to the null hypothesis. Their view is that we want to know the answer to the question: "Which of these two hypotheses should we assume to be true in the way we act? Should we assume that the null hypothesis is true or the alternative hypothesis?" For example, if we are testing a new medicine, we want to know: "Should we prescribe the new medicine or remaining with the current one?"

As a practical example, in our green balls scenario, the alternative hypothesis might be $H_1\colon \pi=0.6$ if we had some reason to be interested in this value. We then ask the question: given the observed data, what is the likelihood of $H_0$ given this data, and what is the likelihood of $H_1$? [note 3] The ratio of these two, the likelihood ratio, tells us how many more times $H_0$ is likely to be true than $H_1$. If this is small enough - below some threshold such as $\frac{1}{5}$ - then we accept $H_1$, otherwise we accept $H_0$, and we should act according to the hypothesis we have accepted. It turns out that we can choose the threshold value so that the probability of rejecting $H_0$ incorrectly is still our chosen significance level (say 5%). This approach also allows us to talk about the power of the test (as explored in Powerful Hypothesis Testing), which means the probability of the null hypothesis being rejected if the alternative hypothesis is true. This approach does require us to be able to specify a meaningful alternative hypothesis, and a precise type of alternative can only be chosen by thinking about the actual physical context. Statistics does not exist in a theoretical vacuum, though, so this is not necessarily a bad thing. This approach has the additional benefit that it does not suffer from the test statistic problem mentioned above in the context of p-values: the null hypothesis and alternative hypothesis together determine exactly what test statistic should be used.

These great statisticians argued for decades about which was the better approach, and their argument was never resolved. The composite of these two approaches that we see in the (UK) school curriculum dominated statistical inference throughout most of the 20th century and into the 21st. With the Bayesian approach now regaining popularity (it actually predates the current null hypothesis approach), and the validity of the NHST approach being questioned ever more, it will be interesting to see how the statistics battles develop over the coming years.

One final warning is in order. Hypothesis testing makes some basic assumptions. It assumes that our model of the situation is correct; if it is not, then our data will not follow the behaviour we expect, and so our analysis will be unreliable. It also assumes that we collect data in an unbiased manner, which may well not be as straightforward as it sounds. We generally require the individual measurements to be independent, and if we ask people questions, we assume - usually in vain - that the answers we receive are all honest. Since this collection of assumptions is essentially impossible to achieve in practice, in spite of our best efforts, it is always worth treating results of hypothesis tests with a degree of caution. Nevertheless, hypothesis tests are a very useful tool in the statistician's armoury, and are used regularly in practice.

- Because our test is two-tailed (in the alternative hypothesis, the true proportion could be less than $\frac{1}{2}$ or more than $\frac{1}{2}$), we must be careful when calculating the p-value: we calculate the probability of the observed outcome or more extreme occurring, and then double the answer to account for the other tail. We could also compare the
probability of the value or more extreme to 0.025 instead of 0.05, but that would not be called a p-value.

Likewise, when we determine the critical region, we will have two parts: a tail with large values of $X$ and a tail with small values of $X$; we require that the probability of $X$ lying in the large-value tail is as close as possible to 0.025 without going over it, and the same for the probability of $X$ lying in the small-value tail.

- There are complications here when working with two-tail tests as opposed to one-tail tests. We will ignore this problem, as it does not significantly affect the overall discussion.

- "Likelihood" is a technical term. For a discrete test statistic $X$, the likelihood of $H_0$ given the data $X=x$ means $P(X=x|H_0)$, in other words, how likely would this data be if $H_0$ were true. It is
*not*the probability of $H_0$ being true given the data.

Dienes, Z. (2014) Using Bayes to get the most out of non-significant results.

Spiegelhalter, D. and Rice, K. (2009) Bayesian statistics.

Silver, N. (2012) The signal and the noise.