*Why is this likely to be an ineffective survey question?*

This question will probably be answered truthfully by someone who *does* brush their teeth daily. Someone who does not brush their teeth daily, however, may or may not tell the truth, as in many parts of society, it is expected that people will brush their teeth at least daily.

*Will this allow the survey company to work out the approximate proportion of people who brush their teeth every day? You could try this case: the company surveys 1200 people; 500 answer "Yes" and 700 answer "No".*

Answer "Yes" | Answer "No" | Total | |
---|---|---|---|

Truth | (teeth-brushers) | (non-brushers) | 400 |

Lie | (non-brushers) | (teeth-brushers) | 800 |

Total | 500 |
700 |
1200 |

(We could alternatively use a two-way table with the columns being "teeth-brusher" and "not teeth-brusher; we would then have to think carefully about where the "500" and "700" fit.)

For this table, we now have to fill in the missing four numbers. Once we have decided how many truthfully "Yes", we can fill in the remaining three numbers. But how do we decide on this number?

The answer is that we would expect about $\frac{2}{6}$ of teeth-brushers to tell the truth and also $\frac{2}{6}$ of non-teeth-brushers to do so. So $\frac{2}{6}$ of the teeth-brushers will be in the top-left corner of this table and $\frac{4}{6}$ in the bottom-right. A similar thing applies to the non-teeth-brushers.

Let's say there are $B$ teeth-brushers and $A$ non-teeth-brushers. ($A$ could stand for something; we have chosen not to use $N$ for "non-teeth-brusher", as this might be confused with "number of people who answer 'No'" or "total number of people". We could also use something like $\bar B$, $B^c$ or $B'$ to indicate the complement of $B$.)

As there are 1200 people in total, $B+A=1200$. Our table now looks like this:

Answer "Yes" | Answer "No" | Total | |
---|---|---|---|

Truth | $\frac{2}{6}B$ | $\frac{2}{6}A$ | 400 |

Lie | $\frac{4}{6}A$ | $\frac{4}{6}B$ | 800 |

Total | 500 |
700 |
1200 |

So we now have three equations, looking at the columns and using what we have just said:

$$\begin{align*}

\tfrac{2}{6}B + \tfrac{4}{6}A &= 500\\

\tfrac{4}{6}B + \tfrac{2}{6}A &= 700\\

B+A &= 1200

\end{align*}$$

The third equation is the sum of the other two, so we can choose any two of these equations to find $B$ and $A$.

If we multiply the first equation by 3, we get $B+2A=1500$, so $A=300$ (using the third equation) and $B=900$. Therefore the proportion of teeth-brushers is $\frac{900}{1200}=\frac{3}{4}$.

So this works - we have found out the (approximate) proportion of teeth-brushers. We could use this approach whatever the numbers of people who said "Yes" and "No".

We could now generalise this. Let's say that there are $n$ people in total: $Y$ people say "Yes" and the rest say no, that is, $n-Y$ people say "No". (Using $N$ for the total number of people might lead to confusion with the number of people saying "No".) Then our table looks like this:

Answer "Yes" | Answer "No" | Total | |
---|---|---|---|

Truth | $\frac{2}{6}B$ | $\frac{2}{6}A$ | $\frac{2}{6}n$ |

Lie | $\frac{4}{6}A$ | $\frac{4}{6}B$ | $\frac{4}{6}n$ |

Total | $Y$ |
$n-Y$ |
$n$ |

This gives the equations

$$\begin{align*}

\tfrac{2}{6}B + \tfrac{4}{6}A &= Y\\

\tfrac{4}{6}B + \tfrac{2}{6}A &= n-Y\\

B+A &= n

\end{align*}$$

Again, multiplying the first of these equations by 3 gives $B+2A=3Y$, so subtracting the third equation gives us $A=3Y-n$, so $B=2n-3Y$. Hence the proportion of people who are teeth-brushers is $\frac{B}{n}=\frac{2n-3Y}{n}$.

So we can work out the (approximate) proportion of teeth-brushers, even though we have asked most people to lie!

(A subtle point is that we might sometimes get fractional values for $\frac{2}{6}A$ and so on, which is not physically possible. But this is only an approximation, as we do not know exactly how many people rolled a 5 or a 6, so we should not be concerned about this. The same applies if we change the probability of answering truthfully: we may then get fractional values for $A$ and $B$ as well.)

*What would be different if the question read: "If you rolled a 6, please answer the next question honestly..."?*

$$\begin{align*}

\tfrac{1}{6}B + \tfrac{5}{6}A &= Y\\

\tfrac{5}{6}B + \tfrac{1}{6}A &= n-Y\\

B+A &= n

\end{align*}$$

so multiplying the first equation by 6 gives $B+5A=6Y$; now subtracting the third equation gives $4A=6Y-n$, so $A=\frac{6Y-n}{4}$ and $B=\frac{5n-6Y}{4}$.

However, in this case, it is somewhat more likely that people will not follow the instructions, as the chances are high that the die will not land on a 6; some people may therefore decide to tell the truth rather than lie, especially if they are a non-teeth-brusher. This may well distort the results.

*What would be different if the question read: "Secretly flip a coin. If you got heads, please answer the next question honestly, otherwise please LIE"?*

$$\begin{align*}

\tfrac{1}{2}B + \tfrac{1}{2}A &= Y\\

\tfrac{1}{2}B + \tfrac{1}{2}A &= n-Y\\

B+A &= n

\end{align*}$$

Multiplying the first equation by 2 gives $B+A=2Y$, which is approximately the same as the final equation, though $2Y$ may well not equal $n$ exactly due to random effects. So we cannot subtract the third equation to find an expression for $A$ or $B$.

We see that this approach does not give us any useful information: approximately half of respondents will say "Yes" and half will say "No", irrespective of how many are teeth-brushers.

*How effective do you think this method would be? How could you work it out?*

It might also be the case that the effectiveness of this technique depends on the exact question being asked.

We would potentially work out how likely people are to follow the instructions by doing some sort of different data-gathering technique, and then comparing the results of the two. In this case, for example, some dentists could be asked to make a judgement over the course of a week on how many of their patients ithat week are regular teeth-brushers. This, though, is a biased sample, as many people do not visit a dentist on a regular basis.

We have also not worked out how the margin of error would be affected by using this method, but we could do this in principle, either theoretically or by simulation, assuming that everyone follows the instructions. To get a reasonable estimate of the proportion of the population who are teeth-brushers might require a much larger sample size using this method than just asking a direct question, but the results are likely to be better.