Do you brush your teeth every day?
How can we find out answers to questions like this if people often lie?
Sometimes we want to gather data about embarrassing, socially questionable or illegal behaviours, where people may not feel comfortable telling the truth. In this problem, we explore one way of tackling this.
A survey company wants to find out what proportion (or percentage) of people brush their teeth every day.
Do you brush your teeth every day? Tick your answer: Yes ____ No _____

Why might this be an ineffective survey question?
The survey company decides to try something different. They give the interviewee a fair 6sided die and ask them to roll it without showing the interviewer.
They then ask the following question, explaining that the interviewer will not be able to know whether or not their answer is honest:
If you rolled a 5 or 6, please answer the next question honestly, otherwise please LIE. Do you brush your teeth every day? Yes ____ No _____

 Will this allow the survey company to work out the approximate proportion of people who brush their teeth every day?
You could try this case: the company surveys 1200 people; 500 answer "Yes" and 700 answer "No".
 What would be different if the question read: "If you rolled a 6, please answer the next question honestly..."?
 What would be different if the question read: "Secretly flip a coin. If you got heads, please answer the next question honestly, otherwise please LIE"?
 How effective do you think this method would be? How could you work it out?
This is just one approach to obtaining data about difficult topics. There are many other approaches used by professional statisticians, including the CaptureRecapture method explored in Counting Fish, where different types of survey or other data sources are compared to try to fill in gaps and to improve the quality of the collected data.
This resource is part of the collection Statistics  Maths of Real Life
You might find it helpful to draw a twoway table, a Venn diagram or a tree diagram, in order to work out the proportions and any missing numbers.
For simplicity, assume that exactly $\frac{2}{6}$ of the die rolls land on a 5 or a 6.
If $T$ participants actually brush their teeth daily, how many of them tell the truth?
For simplicity, assume that exactly $\frac{2}{6}$ of the die rolls land on a 5 or a 6.
If $T$ participants actually brush their teeth daily, how many of them tell the truth?
 Why is this likely to be an ineffective survey question?
This question will probably be answered truthfully by someone who does brush their teeth daily. Someone who does not brush their teeth daily, however, may or may not tell the truth, as in many parts of society, it is expected that people will brush their teeth at least daily.
 Will this allow the survey company to work out the approximate proportion of people who brush their teeth every day? You could try this case: the company surveys 1200 people; 500 answer "Yes" and 700 answer "No".
Answer "Yes"  Answer "No"  Total  

Truth  (teethbrushers)  (nonbrushers)  400 
Lie  (nonbrushers)  (teethbrushers)  800 
Total  500  700  1200 
(We could alternatively use a twoway table with the columns being "teethbrusher" and "not teethbrusher; we would then have to think carefully about where the "500" and "700" fit.)
For this table, we now have to fill in the missing four numbers. Once we have decided how many truthfully "Yes", we can fill in the remaining three numbers. But how do we decide on this number?
The answer is that we would expect about $\frac{2}{6}$ of teethbrushers to tell the truth and also $\frac{2}{6}$ of nonteethbrushers to do so. So $\frac{2}{6}$ of the teethbrushers will be in the topleft corner of this table and $\frac{4}{6}$ in the bottomright. A similar thing applies to the nonteethbrushers.
Let's say there are $B$ teethbrushers and $A$ nonteethbrushers. ($A$ could stand for something; we have chosen not to use $N$ for "nonteethbrusher", as this might be confused with "number of people who answer 'No'" or "total number of people". We could also use something like $\bar B$, $B^c$ or $B'$ to indicate the complement of $B$.)
As there are 1200 people in total, $B+A=1200$. Our table now looks like this:
Answer "Yes"  Answer "No"  Total  

Truth  $\frac{2}{6}B$  $\frac{2}{6}A$  400 
Lie  $\frac{4}{6}A$  $\frac{4}{6}B$  800 
Total  500  700  1200 
So we now have three equations, looking at the columns and using what we have just said:
$$\begin{align*}
\tfrac{2}{6}B + \tfrac{4}{6}A &= 500\\
\tfrac{4}{6}B + \tfrac{2}{6}A &= 700\\
B+A &= 1200
\end{align*}$$
The third equation is the sum of the other two, so we can choose any two of these equations to find $B$ and $A$.
If we multiply the first equation by 3, we get $B+2A=1500$, so $A=300$ (using the third equation) and $B=900$. Therefore the proportion of teethbrushers is $\frac{900}{1200}=\frac{3}{4}$.
So this works  we have found out the (approximate) proportion of teethbrushers. We could use this approach whatever the numbers of people who said "Yes" and "No".
We could now generalise this. Let's say that there are $n$ people in total: $Y$ people say "Yes" and the rest say no, that is, $nY$ people say "No". (Using $N$ for the total number of people might lead to confusion with the number of people saying "No".) Then our table looks like this:
Answer "Yes"  Answer "No"  Total  

Truth  $\frac{2}{6}B$  $\frac{2}{6}A$  $\frac{2}{6}n$ 
Lie  $\frac{4}{6}A$  $\frac{4}{6}B$  $\frac{4}{6}n$ 
Total  $Y$  $nY$  $n$ 
This gives the equations
$$\begin{align*}
\tfrac{2}{6}B + \tfrac{4}{6}A &= Y\\
\tfrac{4}{6}B + \tfrac{2}{6}A &= nY\\
B+A &= n
\end{align*}$$
Again, multiplying the first of these equations by 3 gives $B+2A=3Y$, so subtracting the third equation gives us $A=3Yn$, so $B=2n3Y$. Hence the proportion of people who are teethbrushers is $\frac{B}{n}=\frac{2n3Y}{n}$.
So we can work out the (approximate) proportion of teethbrushers, even though we have asked most people to lie!
(A subtle point is that we might sometimes get fractional values for $\frac{2}{6}A$ and so on, which is not physically possible. But this is only an approximation, as we do not know exactly how many people rolled a 5 or a 6, so we should not be concerned about this. The same applies if we change the probability of answering truthfully: we may then get fractional values for $A$ and $B$ as well.)
 What would be different if the question read: "If you rolled a 6, please answer the next question honestly..."?
$$\begin{align*}
\tfrac{1}{6}B + \tfrac{5}{6}A &= Y\\
\tfrac{5}{6}B + \tfrac{1}{6}A &= nY\\
B+A &= n
\end{align*}$$
so multiplying the first equation by 6 gives $B+5A=6Y$; now subtracting the third equation gives $4A=6Yn$, so $A=\frac{6Yn}{4}$ and $B=\frac{5n6Y}{4}$.
However, in this case, it is somewhat more likely that people will not follow the instructions, as the chances are high that the die will not land on a 6; some people may therefore decide to tell the truth rather than lie, especially if they are a nonteethbrusher. This may well distort the results.
 What would be different if the question read: "Secretly flip a coin. If you got heads, please answer the next question honestly, otherwise please LIE"?
$$\begin{align*}
\tfrac{1}{2}B + \tfrac{1}{2}A &= Y\\
\tfrac{1}{2}B + \tfrac{1}{2}A &= nY\\
B+A &= n
\end{align*}$$
Multiplying the first equation by 2 gives $B+A=2Y$, which is approximately the same as the final equation, though $2Y$ may well not equal $n$ exactly due to random effects. So we cannot subtract the third equation to find an expression for $A$ or $B$.
We see that this approach does not give us any useful information: approximately half of respondents will say "Yes" and half will say "No", irrespective of how many are teethbrushers.
 How effective do you think this method would be? How could you work it out?
It might also be the case that the effectiveness of this technique depends on the exact question being asked.
We would potentially work out how likely people are to follow the instructions by doing some sort of different datagathering technique, and then comparing the results of the two. In this case, for example, some dentists could be asked to make a judgement over the course of a week on how many of their patients ithat week are regular teethbrushers. This, though, is a biased sample, as many people do not visit a dentist on a regular basis.
We have also not worked out how the margin of error would be affected by using this method, but we could do this in principle, either theoretically or by simulation, assuming that everyone follows the instructions. To get a reasonable estimate of the proportion of the population who are teethbrushers might require a much larger sample size using this method than just asking a direct question, but the results are likely to be better.
Why do this problem?
This problem offers an opportunity to use tree diagrams or twoway tables to analyse a useful survey technique. Some moderately sophisticated reasoning is required to find the proportions involved. The results may also surprise and intrigue students.
The issue of people answering opinion polls dishonestly may also have been a factor in some recent preelection surveys, where the polling results indicated significantly different predictions from the final results: some people may have been embarrassed to admit that they were voting for a particular person, party or position and so claimed they would be voting for a different one. This could provide an interesting discussion point about the reallife relevance of the techniques discussed in the problem.
Possible approach
You could use the context in the problem or change the context to something else appropriately embarrassing. You could start by explaining: "We're going to be looking at one of the challenges of performing surveys on difficult topics. And to get some understanding of the problem, we'll do an example survey." Then perform the survey for real with your students: ask them the question, ask them to secretly write Y or N on a slip of paper and fold it, and then collect them into a box.
Then ask the students to secretly and honestly answer the question "Did you tell the truth? Write T for Truth and L for Lie." Again, they write their answer on a slip of paper which they fold and then collect.
You can then count and announce the number of Yes and No responses, and also the number of Truth and Lie responses. (An alternative version, closer to many types of real survey, would be to ask them to also write their name on the original response.
After collecting the T/L responses, you could then publicly shred the original responses without looking at them, to avoid actual embarrassment.)
There is then a chance to discuss the question of how we could work out the true proportion of people who brush their teeth every day (or whatever your question was) when so many people will lie about it; it will also be far worse if they are being asked by an interviewer in person.
It is worth saying at the start that you plan to share one specific technique later in the lesson, but it is far from being the only one, and statisticians often need to be quite creative to gather useful data. Therefore they are not looking to "guess what's in the teacher's head", but genuinely trying to come up with approaches to tackle this problem.
The students may well come up with some interesting ideas which could be followed up, either in this lesson or in a later lesson. (They can be tested with questions like: Would you follow the instructions exactly as given if it were a really sensitive question, such as "do you ..."? Having anonymous responses to this question, as before, might give the class an idea of the likely effectiveness of the suggested techniques.)
You can then introduce the dice technique, and give the example presented in the problem for students to work on and discuss.
Key questions
 Does this approach give us a good estimate of the true proportion of people surveyed who brush their teeth every day?
 How likely are interviewees to follow the instructions as given?
Possible extension
 Another possible approach is to ask interviewees "If your birthday is in January or February, AND you brush your teeth every day, then tick YES, otherwise tick no." How effective would this be?
 To get a better understanding of the margin of error inherent in this method, students could write a simulation. (See A Wellstirred Sample for more on margins of error.) How much worse is this method than a case where people are likely to answer honestly?
 What is the optimal implementation of this method? More precisely, let the probability that the interviewee is asked to tell the truth be $\tau$ (for "truth"). If the true proportion of people who pick their nose once a week is $\pi$, what would be the optimal value of $\tau$ to choose? What would be the optimal value of $\tau$ if you don't know the value of $\pi$? And what might we mean by the word "optimal" here? Remember also that if $\tau$ is too close to 0 or 1, then people are more likely to not follow the instructions, so there is a balance to be struck...
Possible support
Encourage students to draw a twoway table, a Venn diagram or a tree diagram, in order to obtain all the necessary numbers.