# Stats statements

- Half of the students taking a test score less than the average mark.

- Nobody scores higher than the average mark in a test.

- In a large population of animals, about half of the adult animals are heavier than the average adult weight.

- Suppose that in a game you can only score an even number of points: 0, 2, 10, 50. So, the average score over a series of games is an even number.

- A random process is defined by a certain (unknown) probability distribution. The standard deviation of the random process is not larger than the range of the observed data.

- A random process is defined by a certain probability distribution. The standard deviation of the random process is not larger than half of the maximum theoretical range of the observed data.

- The chance of observing an outcome more than three standard deviations from the mean is less than 1 in 100.

- I repeat an experiment with a random numerical outcome many times. Eventually the average of my outcomes will be within 1% of the theoretical average outcome.

- The chance of observing an outcome more than ten standard deviations from the mean is not more than 1%.

- If two statistical processes are uncorrelated then they must be independent.

You can view these on cards if you like.

*This resource is part of the collection Statistics - Maths of Real Life*

Statistics is full of many powerful results, but also full of many traps for the unwary. One of the main challenges in becoming a successful statistician is to understand how to perform calculations correctly in situations which seem either obvious or confusing. Knowledge of a statistical technique does not necessarily confer knowledge about when that technique can meaningfully be used! Statisticians are very careful in the language used to set up problems; this requires particular care when assessing whether events are likely or not likely to occur.

In very advanced statistics, mathematicians use the fascinating concept of 'almost surely'. An event will not occur 'almost surely' if the event 'could' occur in principle, but the probability of the event happening is exactly zero. Whilst this might sound impossible, it can occur in a situation in which there is a continuum of possibilities. To visualise this, imagine throwing an infinitely fine dart at a number line. What is the chance of hitting the exact value of $\pi$? Could you hit $\pi$ in principle?

These questions are not necessarily designed to have a 'right' or 'wrong' answer -- there are various shades of grey. The purpose is to get you thinking about the statistics involved. You can do this at any level of statistics (ignore any parts which seem too complicated)

Can you think of concrete examples (which may or may not be simple) where the statements in the questions do or do not happen?

Russell from Willenhall School Sports College gave answers to five of the parts of this problem using a good mix of examples and results from distributions. Other contributions came from anonymous solution submitters and from teachers attending the Goldman Sachs Teacher Inspiration Day .

1) This doesn't have to be true. For example, in the set of results $0,0,58,72,51,63,60,56$ only $2$ out of $8$ got less than the average mark of $45$ because of the two extreme cases of the two people that put their name on the paper and then left! It is true if the results are normally (or symmetrically) distributed. The less symmetrical the distribution, the less likely that half the students
will be under average.

This is usually true when lots of people take a test and the result is symmetrically distributed about the mean (like the normal distribution). It is not usually true when the results are skewed with large outliers for some reason

2) This is always false unless everyone gets exactly the same mark

3) Because the population is large, the question only says 'about half' and weights of adults are likely to be normally distributed, the result is likely to be true.

4) The total score over N games will be an even number. But the average might be even or odd. For example, scoring $10$ and $20$ over $2$ games gives an average of $15$. Scoring $10$, $20$ and $30$ over $3$ games gives an average of $20$.

5) This is sometimes true. For example, when rolling a fair die the standard deviation is $\sqrt{\frac{35}{12}} \approx 1.71$. I could roll the die three times and get $3, 4, 4$. This has a range of $1$, which is less than $1.71$. It can also obviously be false. For the example of the roll of a die you are very likely to observe a range larger than the standard deviation.

For a normal $N(0,1)$ distribution, the probability of a random variable $X$being within half a standard deviation of the mean is

$$P(-0.5< X< 0.5) = \Phi(0.5) -\Phi(-0.5) =0.69-0.31=0.38$$

The chance of 3 results occurring in this range is $0.388^3 = 0.05$. From this we can see that there is a small chance that 3 or more results will lie within 1 standard deviation of each other. (although this does not show it directly, because we could in a very unlikely set of results draw 3 numbers far from the mean which just happen to be close to each other)

We think that this helps to show that in almost all situations it is very unlikely that 3 or more randomly generated numbers are within 1 standard deviation of each other.

6) This is definitely true for distributions like normal where the range of possible values is infinite. Let's look at a different distribution. For a binomial distribution $B(N, p)$ the variance is $Np(1-p)$. With a binomial distribution the smallest possible outcome is $0$ and the largest is $N$. So the theoretical maximum range is $N$. The result is true for a binomial $B(N, p)$ if

$$\sqrt{Np(1-p)}\leq \frac{1}{2}N$$

This is only true in the case that $p(1-p)\leq \frac{N}{4}$ which is only false in the special case when $N=1$ and $p=0.5$. For a dice, half the range is 3 which is bigger then the standard deviation of $1.8$. So it seems that the result can be false, but only under very special circumstances.

7) Chebyshev's inequality says that the probability that a random number is more then $k$ standard deviations from the mean is not more than $\frac{1}{k^2}$. So, in this case the probability would be $\frac{1}{9}$. This means that the result is sometimes false. For the special case of a normal distribution, the chance of being within $3$ standard deviations of the mean is $0.0027$. So, the result
is true for normal distributions.

8) This is always true by the law of large numbers, assuming that the average outcome is defined. (The precise statement of the law of large numbers is somewhat technical, but in most everyday cases this is true.)

9) This is always the case, using Chebyshev's inequality. For a normal distribution, the probability of being within 10 standard deviations is about $1.5\times 10^{-23}$. So, for most distributions it is really, really, really likely that the sample is within 10 standard deviations of the mean.

10) Although this sounds like it ought to be true, it is not. This counter example shows why. The correlation between two random variables $X$ and $Y$ with standard deviations $\sigma_X$ and $\sigma_Y$ is

$$\frac{E(XY)-E(X)E(Y)}{\sigma_X\sigma_Y}$$

So, this is zero if and only if $E(XY) = E(X)E(Y)$.

Consider rolling a die twice. Let $A$ and $B$ be the result in each case. The make two new random variables $X=A+B$ and $Y=A-B$. Then $E(XY) = E((A+B)(A-B)) = E(A^2-B^2) = E(A^2) - E(B^2)$. Since $A$ and $B$ are identically distributed, we see that $E(XY)=0$. Also, it is easy to see that $E(Y)=0$. So, the two random variables $X$ and $Y$ have correlation zero. However, they are clearly
dependent

So we have shown that correlation zero does not imply independence, although independence zero DOES imply zero correlation.

### Why do this problem?

This task will get students into statistics without them necessarily needing to engage in detailed calculation (although some students, depending on their level, might wish to calculate various statistical quantities). It will lead to a better intuitive grasp of statistics, which will inevitably lead to a better grasp of various statistical techniques. The questions are approximately listed in order of difficulty.### Possible approach

*You may wish to use the problem Statistical Shorts alongside or before this one.*

This problem is very well suited to discussion. As there are no numbers, it is very easy for all students to get into this problem at a level which suits them. One approach might be to ask for an immediate, instinctive response to the questions before asking them to assess them in more detail. Were their gut-feelings right or wrong? Are there any surprises?

To give a sound analysis will require some quantification of the concepts of the key words: 'sometimes', 'always', or 'never'.

This question gives an opportunity to explore the power of counter-examples in a mathematical analysis: for example, constructing a single example in which 'Half of the students taking a test DONT score less than the average mark' shows that the statement 'Half of the students taking a test score less than the average mark' cannot ALWAYS be true.

Assessing the meanings of 'sometimes' and 'nearly always' will be more open to discussion. This could easily lead to discussions of normal distributions, statistical testing and confidence limits.

It would be good to use this problem at an early (intuitive) stage in the study of statistics and then revisit it towards the end of a course of statistics (once computation skills are developed). Comparison of answers at these two stages would be an interesting exercise.

It is important to note that this problem is likely to raise many questions (such as the meaning of the word 'average'). All questions are valid and exploration of the issues raised will lead to a stronger intuitive understanding of statistics, which can only impact positively the subsequent learning of more formal statistical techniques.