Robin's hypothesis testing

How many trials should we do in order to accept or reject our null hypothesis?

Age

16 to 18

Challenge level

Being curious Being collaborative Being resourceful Being resilient

Problem

Robin has a bag containing red and green balls. Robin wants to test the following hypotheses, where $\pi$ is the proportion of green balls in the bag:

$H_0\colon \pi=\frac{1}{2}$ and $H_1\colon \pi\ne\frac{1}{2}$

Robin is allowed to take out a ball at random, note its colour and then replace it: this is called a trial. Robin can do lots of trials, but each trial has a certain cost.

Robin wants to test these hypotheses as cheaply as possible, so suggests the following approach:

"I will do at most 50 trials. If the p-value* drops below 0.05 at any point, then I will stop and reject the null hypothesis at the 5% significance level, otherwise I will accept it."

Robin tells you about this plan. What advice could you give to Robin?

Warning - the computer needs a little bit of thinking time to do the simulations!

In this simulation, you can:

specify the number of green and red balls actually in the bag (and the true ratio is shown with a green dashed line on the graph) - note that in a real experiment we would not know this!
specify the number of trials (up to 200)
specify the proportion for the null hypothesis (which we took to be $\frac{1}{2}$ above)
choose whether to show the proportion of green balls after each ball is picked
choose whether to show the p-value after each ball is picked*
rerun the simulation ("Repeat experiment")

The "Final p-value" shows the p-value at the end of the experiment, and the orange lines are at 0.1, 0.05 and 0.01.

Here are some questions you could consider as you think about Robin's approach:

What do you notice about the patterns of proportions and p-values? Is there anything which is the same every time or most times you run the simulation?
If we repeat the experiment lots of times, how often does $H_0$ get rejected using Robin's approach? Does the answer to this depend on how many trials we perform?
Does the answer change if you change the true proportion of greens in the bag?
What would happen if you changed the hypothesised proportion $\pi$?
What would happen if you changed the significance level from 5% to 10% or 1%?

You may want to ask and explore other questions as well.

Rejecting $H_0$ when it is true is called a Type I error.

* To read more about p-values, have a look at What is a Hypothesis Test? The p-values here are calculated like this: after $k$ trials, we find twice the probability of obtaining this number of greens or a more extreme number in $k$ trials, assuming that $H_0$ is true. The graph shows how this p-value changes with $k$.

This resource was inspired by the controversy surrounding a paper published in Nature Communications, as discussed by Casper Albers here.

This resource is part of the collection Statistics - Maths of Real Life

Student Solutions

Does Robin's approach work?

If we run the simulation with 50 trials, 2 red balls and 2 green balls, with $H_0\colon\pi=\frac{1}{2}$, we discover that about 5% of the time, the final p-value is less than 0.05. It might take a lot of experiments to get an accurate percentage: I did 100 experiments, and 3 times the final p-value was less than 0.05. (This is what the significance means: it is the probability that the null hypothesis will be rejected given that it is true.)

I then did another 100 experiments, counting the number of times the p-value went below 0.05: it was a total of 18 times out of 100. This suggests that the probability of the p-value going below 0.05 is much higher than 0.05, and so Robin's approach is likely to reject the null hypothesis even when there is insufficient evidence to do so.

In fact, there is a theorem which says that if the null hypothesis is true and we keep doing trials for ever, the probability that the p-value will go below 0.05 at some point is 1. (This is clearly true if the null hypothesis is false, as the proportion of green balls will tend to the true proportion and so the p-value will tend to 0. The amazing result is that this statement is true even if the null hypothesis is true.) So if we allow ourselves to do lots of trials, Robin's approach gradually becomes even worse. For example, when I experimented with 200 trials, the p-value went below 0.05 on 23 occasions out of 100. This seems a little worse than with 50 trials, but not by that much. It turns out that one needs to do a huge number trials to reach, say, a probability of 0.5 of obtaining a p-value less than 0.05 at some point.

Fixing Robin's approach

There is a way that we could sometimes stop early and thereby save money. Let's say that we decide that we're going to do 50 trials. If we reach the 45th trial, say, and see that it is impossible for the p-value to drop below 0.05 by the 50th trial, we can stop and accept the null hypothesis. This would take a little calculation, but could save Robin some money without invalidating the conclusion.

There are also more sophisticated ways of analysing a sequence of trials such as these, which can allow one to reject the null hypothesis earlier if it is wrong. One needs to take account of the above problems, and adjust the calculations of p-values as one goes to ensure that the probability of incorrectly rejecting $H_0$ is still only 5%. This technique is known as sequential analysis, and is very important in modern statistics.

Changing the conditions

If we change the true proportion of green balls and the hypothesised proportion $\pi$ to match it, then we still see similar behaviour to that observed earlier.

If, though, we change the true proportion to be something other than $\pi$, say we have 3 green balls and 2 red balls, with $H_0\colon \pi=\frac{1}{2}$ still, we observe that $H_0$ is rejected much more frequently. In my experiments, $H_0$ was rejected 20 times out of 100 in this case. This is good, as in this case we know that $H_0$ is not correct.

The more extreme the difference between the hypothesised $\pi$ and the true proportion, the more frequently $H_0$ is rejected.

Teachers' Resources

Why do this problem?

This problem is designed to help students understand the meaning of hypothesis tests, and in particular why it is necessary to fully specify the experiment - in particular, the sample size - before we begin, otherwise our results may be meaningless. There is an important technique called sequential testing which allows one to stop an experiment early while the results remain valid, but significant care must be taken in this situation, as shown by this resource. (Bayesian inference has an alternative approach to this, but that is another story entirely.)

In this resource, we use a binomial test, but the principles are more generally applicable. The solution section provides a more detailed explanation of these ideas.

Possible approach

Students would benefit from having some exposure to hypothesis testing before looking at this simulation. It would also be very helpful for them to have access to the simulation themselves so that they can explore it.

To put the problem in a real-world context as opposed to picking balls from a bag, you could ask students to suggest real-life contexts where we would want to or have to limit the number of trials in an experiment. For example, we could be doing laboratory experiments, and all of the materials involved are expensive. Or we might be trialling a new drug, and it costs a large amount to test it on a person, or there are only a limited number of people with the condition the drug is designed to treat. It might be that this is an experiment on animals, and we wish to limit the number of animals we are working with for ethical reasons. Another reason (which is related to the cost reason) is that each trial takes a large amount of time, perhaps a day or two, so it is not feasible to do very large numbers of trials.

You could then explain that Robin, the experimenter, has suggested a way of saving money, as described in the problem. Your students, as budding statisticians, will need to consider Robin's proposed method, and explain why it is good and will save money, or why it is broken and will potentially give a misleading answer.

Students may require guidance as to how to use the simulation. For example, they could begin with 2 red balls, 2 green balls, $H_0\colon \pi=\frac{1}{2}$ and 50 trials, hide the p-values graph, and just note the proportion of the experiments in which $H_0$ is rejected based on the final p-value. They could then repeat this but note the proportion of the experiments in which the p-value ever drops below 0.05. What does this suggest?

Students could then go on to change some of the parameters in a systematic fashion, exploring whether their initial ideas hold true more generally.

Key questions

Is it necessary to specify the number of trials in advance?
What would happen if we didn't?

Possible extension

Is there any way of stopping the experiment early and still obtaining useful results?
What is the benefit of doing more trials? Surely we would still only reject $H_0$ 5% of the time? You can use the simulation to explore this.

Possible support

There are several things which can be changed in the simulation, and it is easy to get lost. Students will benefit from being systematic, and guiding them to structure their exploration and recording of results will help them to understand what is happening.

Or search by topic

Number and algebra

Geometry and measure

Probability and statistics

Working mathematically

Advanced mathematics

For younger learners