Powerful hypothesis testing
How effective are hypothesis tests at showing that our null hypothesis is wrong?
Robin has a bag containing red and green balls. Robin wants to test the following hypotheses, where $\pi$ is the proportion of green balls in the bag:
Robin is allowed to take out a ball at random, note its colour and then replace it: this is called a trial. Robin can do as many trials as desired.
Robin uses the following approach:
If the null hypothesis is false, what is the probability that the null hypothesis will be rejected?
You can explore this question with the following simulation.
Warning - the computer needs a little bit of thinking time to do the simulations!
In this simulation, you can:
Now try changing the settings. Can you predict what will happen as a result of your changes?
Here are some further questions you could consider:
If Robin wants to be 90% certain of rejecting the null hypothesis if it is wrong, how many trials are needed?
You may want to ask and explore other questions as well.
The probability of correctly rejecting $H_0$ when it is false is called the power of the test. Accepting $H_0$ when it is false is called a Type II error.
This resource is part of the collection Statistics - Maths of Real Life
$H_0\colon \pi=\frac{1}{2}$ and $H_1\colon \pi\ne\frac{1}{2}$
Robin is allowed to take out a ball at random, note its colour and then replace it: this is called a trial. Robin can do as many trials as desired.
Robin uses the following approach:
"I will do exactly 50 trials. If the p-value* is less than 0.05, then I will reject the null hypothesis at the 5% significance level, otherwise I will accept it."
If the null hypothesis is false, what is the probability that the null hypothesis will be rejected?
You can explore this question with the following simulation.
Warning - the computer needs a little bit of thinking time to do the simulations!
In this simulation, you can:
- specify the number of green and red balls actually in the bag - note that in a real experiment we would not know this!
- specify the number of trials per experiment (up to 200)
- specify the proportion for the null hypothesis (which we took to be $\frac{1}{2}$ above)
- repeat the experiment
Now try changing the settings. Can you predict what will happen as a result of your changes?
Here are some further questions you could consider:
- What is the probability of $H_0$ being rejected?
- If $H_0$ is rejected, how likely is it that the alternative hypothesis $H_1$ is true?
- the true proportion of greens in the bag changes?
- the significance level changes?
- the hypothesised proportion $\pi$ changes?
If Robin wants to be 90% certain of rejecting the null hypothesis if it is wrong, how many trials are needed?
You may want to ask and explore other questions as well.
The probability of correctly rejecting $H_0$ when it is false is called the power of the test. Accepting $H_0$ when it is false is called a Type II error.
* If you want to read about what p-values are, have a look at What is a Hypothesis Test?. In this case, the p-value is calculated like this: after all of the trials, we find twice the probability of obtaining this number of greens or a more extreme number, assuming that $H_0$ is true. For more on the effect of different ways
of choosing the number of trials to perform, see Robin's Hypothesis Testing.
This resource is part of the collection Statistics - Maths of Real Life
It is important to be systematic and record your results as you go. What information will you need to record for each simulation so that you can decide what factors affect the probability of $H_0$ being rejected?
It makes sense to change one factor at a time when exploring how different factors do or don't affect the probability of $H_0$ being rejected.
How many times will you run the experiment before recording the proportion of times that $H_0$ was rejected? Do you need to decide this before you start?
It makes sense to change one factor at a time when exploring how different factors do or don't affect the probability of $H_0$ being rejected.
How many times will you run the experiment before recording the proportion of times that $H_0$ was rejected? Do you need to decide this before you start?
Here are some comments on the questions in the problem (but not full solutions):
What is the probability of $H_0$ being rejected?
Do your answers change if the true proportion of greens in the bag changes?
What would happen if you changed the hypothesised proportion $\pi$?
What would happen if you changed the significance level of the test from 5% to 10% or 1%?
This depends on the proportion in $H_0$, the true proportion, the number of trials and the significance level. We can get evidence from the simulation, or we can work theoretically. In general, we would expect that the greater the difference between $\pi$ in $H_0$ and the true proportion, the greater the probability of $H_0$ being rejected (the null hypothesis is "more wrong"); the greater the number of trials, the greater the probability of rejection (the sample proportion will be more likely to be close to the true proportion), and as the significance level is raised, the probability of $H_0$ being rejected will also increase (as we are reducing the range of acceptance).
The probability of rejecting $H_0$ in this problem can be calculated as follows. Let the hypothesised proportion be $\pi_0$ and the true proportion be $\pi_1$. Let $X$ be the number of greens observed after $n$ trials. Under the null hypothesis with significance level $\alpha$ (so typically $\alpha=0.05$), $X\sim \mathrm{B}(n,\pi_0)$, and the null hypothesis will be rejected if $X$ lies in the critical region, which is $Xx_2$, where $x_1$ is the largest integer for which $\mathrm{P}(Xx_2|H_0)\le \alpha/2$. We can then calculate these probabilities given that $H_1$ is true, so that $X\sim \mathrm{B}(n,\pi_1)$ and deduce that the probability of $H_0$ being rejected is $\mathrm{P}(Xx_2|H_1)$. These calculations can be easily performed by computer.
Note that it is only possible to perform this calculation if we know the actual proportion. But if we know the actual proportion, why are we doing a hypothesis test?! This makes the power of a test a somewhat difficult idea. We could, though, be more specific, and say that we are testing $H_0\colon \pi=0.5$ against $H_1\colon \pi=0.6$, and ask which of these hypotheses is more likely to be true. This is a different way of performing hypothesis testing, which is dealt with in the article [yet to be written].
If $H_0$ is rejected, how likely is it that the alternative hypothesis $H_1$ is true?
A tree diagram will help here: we have two possibilities, $H_0$ is true and $H_1$ is true. And for each of these, either $H_0$ will be accepted or rejected. So we have, looking at the tree diagram [which would be nice to draw]
$$\mathrm{P}(\text{$H_1$ true} | \text{$H_0$ rejected}) = \frac{\mathrm{P}(\text{$H_1$ true} \cap \text{$H_0$ rejected})}{\mathrm{P}(\text{$H_1$ true} \cap \text{$H_0$ rejected})+\mathrm{P}(\text{$H_0$ true} \cap \text{$H_0$ rejected})} = \frac{\mathrm{P}(\text{$H_0$ rejected} | \text{$H_1$ true})\mathrm{P}(\text{$H_1$ true})}{\mathrm{P}(\text{$H_0$ rejected} | \text{$H_1$ true})\mathrm{P}(\text{$H_1$ true})+\mathrm{P}(\text{$H_0$ rejected} | \text{$H_0$ true})\mathrm{P}(\text{$H_0$ true})}.$$
But we don't know the majority of probabilities in this calculation! We only know that $\mathrm{P}(\text{$H_0$ rejected} | \text{$H_0$ true})$ is the significance of the test, which we have chosen. So without some idea of how likely it is that $H_1$ is true, and some idea of the probability of rejecting $H_0$ if $H_1$ is true, we cannot say how likely it is that $H_1$ is true even if we reject $H_0$! Likewise, we cannot say how likely it is that $H_0$ is true if we accept it.
If Robin wants to be 90% certain of rejecting the null hypothesis if it is wrong, how many trials are needed?
This again depends on the actual proportion of green balls. If, though, Robin assumes what the actual proportion might be, we can then use the above calculations, trying different values of $n$ until we find one that is large enough so that $\mathrm{P}(Xx_2|H_1)>0.9$.
Remembering that each trial costs a certain amount, what is the best number of trials to perform? (And what does "best" mean?)
This is a hard question! It depends on what is most important to Robin. It is a balance between getting the "correct" answer, avoiding the "wrong" answer, the cost of the trials, and the assumed alternative hypothesis actual proportion.
What is the probability of $H_0$ being rejected?
Do your answers change if the true proportion of greens in the bag changes?
What would happen if you changed the hypothesised proportion $\pi$?
What would happen if you changed the significance level of the test from 5% to 10% or 1%?
This depends on the proportion in $H_0$, the true proportion, the number of trials and the significance level. We can get evidence from the simulation, or we can work theoretically. In general, we would expect that the greater the difference between $\pi$ in $H_0$ and the true proportion, the greater the probability of $H_0$ being rejected (the null hypothesis is "more wrong"); the greater the number of trials, the greater the probability of rejection (the sample proportion will be more likely to be close to the true proportion), and as the significance level is raised, the probability of $H_0$ being rejected will also increase (as we are reducing the range of acceptance).
The probability of rejecting $H_0$ in this problem can be calculated as follows. Let the hypothesised proportion be $\pi_0$ and the true proportion be $\pi_1$. Let $X$ be the number of greens observed after $n$ trials. Under the null hypothesis with significance level $\alpha$ (so typically $\alpha=0.05$), $X\sim \mathrm{B}(n,\pi_0)$, and the null hypothesis will be rejected if $X$ lies in the critical region, which is $Xx_2$, where $x_1$ is the largest integer for which $\mathrm{P}(Xx_2|H_0)\le \alpha/2$. We can then calculate these probabilities given that $H_1$ is true, so that $X\sim \mathrm{B}(n,\pi_1)$ and deduce that the probability of $H_0$ being rejected is $\mathrm{P}(Xx_2|H_1)$. These calculations can be easily performed by computer.
Note that it is only possible to perform this calculation if we know the actual proportion. But if we know the actual proportion, why are we doing a hypothesis test?! This makes the power of a test a somewhat difficult idea. We could, though, be more specific, and say that we are testing $H_0\colon \pi=0.5$ against $H_1\colon \pi=0.6$, and ask which of these hypotheses is more likely to be true. This is a different way of performing hypothesis testing, which is dealt with in the article [yet to be written].
If $H_0$ is rejected, how likely is it that the alternative hypothesis $H_1$ is true?
A tree diagram will help here: we have two possibilities, $H_0$ is true and $H_1$ is true. And for each of these, either $H_0$ will be accepted or rejected. So we have, looking at the tree diagram [which would be nice to draw]
$$\mathrm{P}(\text{$H_1$ true} | \text{$H_0$ rejected}) = \frac{\mathrm{P}(\text{$H_1$ true} \cap \text{$H_0$ rejected})}{\mathrm{P}(\text{$H_1$ true} \cap \text{$H_0$ rejected})+\mathrm{P}(\text{$H_0$ true} \cap \text{$H_0$ rejected})} = \frac{\mathrm{P}(\text{$H_0$ rejected} | \text{$H_1$ true})\mathrm{P}(\text{$H_1$ true})}{\mathrm{P}(\text{$H_0$ rejected} | \text{$H_1$ true})\mathrm{P}(\text{$H_1$ true})+\mathrm{P}(\text{$H_0$ rejected} | \text{$H_0$ true})\mathrm{P}(\text{$H_0$ true})}.$$
But we don't know the majority of probabilities in this calculation! We only know that $\mathrm{P}(\text{$H_0$ rejected} | \text{$H_0$ true})$ is the significance of the test, which we have chosen. So without some idea of how likely it is that $H_1$ is true, and some idea of the probability of rejecting $H_0$ if $H_1$ is true, we cannot say how likely it is that $H_1$ is true even if we reject $H_0$! Likewise, we cannot say how likely it is that $H_0$ is true if we accept it.
If Robin wants to be 90% certain of rejecting the null hypothesis if it is wrong, how many trials are needed?
This again depends on the actual proportion of green balls. If, though, Robin assumes what the actual proportion might be, we can then use the above calculations, trying different values of $n$ until we find one that is large enough so that $\mathrm{P}(Xx_2|H_1)>0.9$.
Remembering that each trial costs a certain amount, what is the best number of trials to perform? (And what does "best" mean?)
This is a hard question! It depends on what is most important to Robin. It is a balance between getting the "correct" answer, avoiding the "wrong" answer, the cost of the trials, and the assumed alternative hypothesis actual proportion.
Why do this problem?
This problem is designed to help students understand that the power of a test depends on a variety of factors. It is thus a far more intricate question than that of handling the significance of a test. It can also lead to an understand that interpreting the result of a hypothesis test is not straightforward: what does a non-significant result actually mean? Is it that the null hypothesis is true, or that the experiment was simply not powerful enough to discover that it is false? The distinction between these possibilities is crucial in many areas where hypothesis testing is performed: it is too easy to incorrectly assert that the null hypothesis is true (or likely to be true). This links in well with the activity Hypothetical Shorts.
As an extension, it is also possible to work out algebraically the probability of rejecting the null hypothesis if it is false; it is important, though, to also develop a sense of how different factors affect the answer.
In this resource, we use a binomial hypothesis test for the simplicity of description, but the principles are applicable more generally.
Possible approach
Students would benefit from having some exposure to hypothesis testing before looking at this simulation. It would also be very helpful for them to have access to the simulation themselves so that they can explore it.
The problem could be posed in a real-world context as opposed to picking balls from a bag: you could ask students to suggest real-life contexts where we would be interested in distinguishing between two competing hypotheses. For example, we could be trying to find out whether a new drug is better than the standard one, or whether eating certain foods for breakfast or doing a certain amount of exercise improves students' chances of passing a particular test. The former would lead to a decision about whether to use the drug in future, while the latter might affect advice on how best to prepare for tests. Nevertheless, the theoretical ideas are subtle enough that it is probably simpler to work with abstract coloured balls for the actual activity.
You could then explain that Robin, the experimenter, wants to know how likely it is that the experiment will successfully reject the null hypothesis if it is false. (Robin knows that it will reject the null hypothesis if it is true with a probability of 5%, the significance level.)
Students may require guidance as to how to use the simulation. For example, they could begin with the default of 2 red balls, 3 green balls, $H_0\colon \pi=\frac{1}{2}$ and 50 trials, and note the proportion of the experiments in which $H_0$ is rejected after doing some large number of experiements. (The simulation provides this figure for students.) They could then do this again with a different proportion of red and green balls and note what changes. It would be good to ask students to make a prediction before they rerun the simulation, and compare their prediction with the actual results.
Students could then go on to change some of the parameters in a systematic fashion and consider the questions provided.
Key questions
- What does a significant result (one with the p-value below 0.05) tell us?
- What factors affect the probability of obtaining a significant result if the null hypothesis is false?
- What does a non-significant result (one with the p-value above 0.05) tell us?
Possible extension
- Can you theoretically work out the probability of obtaining a significant result if the null hypothesis is false?
Possible support
Students will benefit from being systematic when working with the simulation and recording their results as they go. There are several factors involved, and adjusting just one factor at a time is a wise thing to do.
To work out the answer to the question of what a significant result means, students may need prompting to use a tree diagram.