# Hypothetical shorts

## Problem

**Are the following statements about hypothesis testing true or false?**

Give convincing reasons why that is the case. If a statement is false, can you give an example to show why? And if so, is there some sense in which the statement is "usually" true, but there are just a few special cases where it is false, or is it "usually" false?

*If you have not met p-values before, you could look at the article What is a Hypothesis Test?*

- A significance level of 5% means that there is a 5% probability of getting a test statistic in the critical region if the null hypothesis is true.
- A significance level of 5% means that there is a 5% probability of the null hypothesis being true if the test statistic lies in the critical region.
- The p-value of an experiment gives the probability of the null hypothesis being true.
- If the p-value is less than 0.05, then the alternative hypothesis is true.
- If the p-value is less than 0.05, then the alternative hypothesis is more likely to be true than the null hypothesis.
- The closer the p-value is to 1, the greater the probability that the null hypothesis is true.
- If we have a larger sample size, we will get a more reliable result from the hypothesis test.
- If we repeat an experiment and we get a p-value less than 0.05 in either experiment, then we must reject the null hypothesis.
- If we do not get a significant result from our experiment, we should go on increasing our sample size until we do.

The XKCD cartoon Significant provides a nice illustration of the idea in question 8.

*This resource is part of the collection Statistics - Maths of Real Life*

## Getting Started

Some of the statements are subtle and easy to misinterpret. Sometimes, the interactives mentioned above will answer a different question from the one being asked. For example, statement 4 begins "If the p-value is less than 0.05, ...", but you cannot control the p-value in the simulations (why not?). Instead, you might want to think about other representations of the situation first, before or instead of running a simulation. Tree diagrams will be particularly helpful here.

## Student Solutions

*A significance level of 5% means that there is a 5% probability of getting a test statistic in the critical region if the null hypothesis is true.*

This is at least approximately true, and in some cases it is true. But it may be that the critical region has a probability of a little less than 0.05, for example with a binomial distribution test where it is impossible obtain a probability of exactly 0.05.

Furthermore, for the 5% probability to be correct, we also have to assume that all of our assumptions are correct. This includes: we have the right model of the situation and there are no unaccounted-for factors; our observations are independent (if we are assuming them to be); we have no systematic bias; if we are asking people questions, their answers are honest, and so on. In practice, this is never the case; we just do our best to minimise these factors.

*A significance level of 5% means that there is a 5% probability of the null hypothesis being true if the test statistic lies in the critical region.*

This is false: the order of dependency is the wrong way round. See the previous question.

*The p-value of an experiment gives the probability of the null hypothesis being true.*

False. We have little idea of the probability of the null hypothesis being true. The p-value tells us only the probability of obtaining this result or a more extreme one if the null hypothesis is true. Even in a Bayesian setup, the posterior probability of the null hypothesis being true is not equal to the p-value.

*If the p-value is less than 0.05, then the alternative hypothesis is true.*

We certainly cannot say anything as certain as this! We can only talk in terms of probabilities.

*If the p-value is less than 0.05, then the alternative hypothesis is more likely to be true than the null hypothesis.*

This depends at least upon the p-value and the prior beliefs about the probability of the alternative hypothesis being true. See What is a Hypothesis Test?

*The closer the p-value is to 1, the greater the probability that the null hypothesis is true.*

It may well be true, but for a somewhat subtle reason. If the null hypothesis is true, the p-value could take values in the range 0 to 1, and roughly*p*of the time, it will be less than*p*, so larger values and smaller values are equally likely if the null hypothesis is true. If the alternative hypothesis is true, though, then smaller p-values are more likely than larger p-values. So this seems likely to be true. To prove it, though, one could use the results derived in What is a Hypothesis Test?, together with a determination of $\mathrm{P}(\mathrm{p}^+|H_1)$ for different p-values. Drawing a graph showing either $\mathrm{P}(H_1|\mathrm{p}^+)$ or $\mathrm{P}(H_0|\mathrm{p}^+) = 1 - \mathrm{P}(H_1|\mathrm{p}^+)$ against the p-value would show whether - for this case of $\mathrm{P}(\mathrm{p}^+|H_1)$ at least - this statement is true.

*If we have a larger sample size, we will get a more reliable result from the hypothesis test.*

We need to be clear what we mean by the word "reliable". Assuming that we mean something like "the probability of correctly accepting $H_0$ if it's true, and the probability of rejecting $H_0$ if the alternative hypothesis is true", then in general, a larger sample size will lead to a more reliable result. This can also be expressed in terms of the probabilities of a Type I or Type II error.

*If we repeat an experiment and we get a p-value less than 0.05 in either experiment, then we must reject the null hypothesis.*

As an extreme case (illustrated in the referenced XKCD cartoon), if we repeat the experiment 20 times, it is likely that we will obtain a p-value less that 0.05 at least once. (The probability is over 0.6 - why?) So we cannot simply repeat an experiment multiple times and reject the null hypothesis if any of the p-values are less than 0.05: we need to take into account the fact that we are repeating the experiment and therefore need a p-value somewhat smaller than 0.05 to reject the null hypothesis at a significance level of 5%. Statisticians have calculated exactly how small the p-value would need to be in this case.

*If we do not get a significant result from our experiment, we should go on increasing our sample size until we do.*

We can do a new experiment with a larger sample size, but (as discussed in the previous question) we need to be very careful about interpreting results when we do repeat experiments. In particular, we cannot simply repeat the experiment until we get a small p-value for that particular experiment and interpret it to mean that there is evidence for the alternative hypothesis.

## Teachers' Resources

### Why do this problem?

This task will challenge students to think about the meaning of hypothesis tests and challenge some common misconceptions about them. Hypothesis tests are used (and misused) in many areas of science and social science, so becoming aware of these issues at an early stage will help to protect them from falling into serious traps later on. The questions are grouped thematically.

### Possible approach

You may wish to use the problem Stats Statements before this one, which is about probability and statistics more generally. You may also want to look at some or all of Robin's Hypothesis Testing, Powerful Hypothesis Testing or What is a Hypothesis Test? before or after working on this problem. You might also have an initial discussion about these questions, then work on some or all of the other resources, and return to these questions afterwards. In that way, students' minds will be more attuned to some of the issues involved as they think more deeply about hypothesis
testing.

This problem is very well suited to discussion. As there are no calculations involved (at least superficially), it is very easy for all students to get into this problem at a level which suits them. One approach might be to ask for an immediate, instinctive response to the questions before asking them to assess them in more detail. What factors do they need to take into account in order to answer
the questions? Does it depend on the specific hypothesis test? Might there be any exceptions? And then to reflect: were their gut-feelings right or wrong? Were there any surprises? (Note that question 5 is certainly true if question 4 is true, but they might both be false.)

This problem offers a good chance to practise explaining complicated ideas in statistics. Students could try to explain their thoughts verbally to each other. Giving a good explanation requires a sound analysis of the statistics. Does the audience think that the explanation is sound or convincing?

This problem gives an opportunity to discuss the idea of conditional (if ... then ...) statements, as several of the questions use them. It also provides a context in which to explore the power of counterexamples: for example, constructing a single example in which '*the p-value is less than 0.05, but the alternative hypothesis is false*' would show that the statement '*If the
p-value is less than 0.05, then the alternative hypothesis is true*' cannot be true.

It is important to note that this problem is likely to raise many questions about hypothesis testing. An exploration of such questions will lead to a stronger understanding of this important area of statistics, which can only impact positively on students' future use of this material in their school studies and beyond. Many of the questions are addressed, directly or indirectly, in the
other resources mentioned above.

### Key questions

This problem may feel very easy to some students, who might just give an "obvious" answer, while others might find some of the questions too sophisticated to tackle. To encourage students to think their way through each statement, you could pose questions such as:

- Can you give a concrete example of a hypothesis test?
- What might happen in this test?
- What if the null hypothesis is actually true/false - what might happen then?

### Possible extension

Students could be asked how we could do further experiments (questions 7 and 8) while still obtaining valid results. This question has a qualitative answer, but can also (with more sophistication) be answered quantitatively.

### Possible support

Students could use the interactivities in Robin's Hypothesis Testing or Powerful Hypothesis Testing to gain a sense of what p-values mean in a hypothesis test. In this way, they can control the parameters of an experiment and observe the results, repeating as many times as they need to in order to
give meaningful answers to the questions.

If students need support with understanding conditional statements, they could explore Iffy Logic.