Matching Criminals
DNA profiling is an invaluable tool for the police. However, when it comes to probability, things aren't always as straightforward as they seem.
Problem
This resource is part of the collection Probability and Evidence.
This problem considers a surprising result that can have some real life implications. Read the text below, and then take a look at the questions that follow.
DNA profiling is an invaluable tool for the police and the courts to identify criminals. This is because it's extremely accurate: the chance that a random person will match a DNA profile taken from a crime scene is estimated to be less than 1 in a billion.
However, when it comes to chance and probability, things aren't always as straightforward as they seem. You have to very careful to do your sums properly before jumping to conclusions.
As an example, let's take a controversy sparked by the discovery of a lab worker in the US state of Arizona in 2001. While she was sifting through a database containing DNA profiles of 65,000 convicted criminals the lab worker found two matching profiles. The two convicts to whom the profiles belonged were unrelated. Subsequent research showed that there were many more matching pairs in the database.
This surprised people, because the chance of one profile matching a given profile is so small. Some people suggested that something must be seriously wrong with the technology used to analyse and compare DNA samples. This would pose a terrible problem because if the technology gives rise to unnaturally many matches, then many innocent people might be wrongly convicted of crimes.
However, when it comes to chance and probability, things aren't always as straightforward as they seem. You have to very careful to do your sums properly before jumping to conclusions.
As an example, let's take a controversy sparked by the discovery of a lab worker in the US state of Arizona in 2001. While she was sifting through a database containing DNA profiles of 65,000 convicted criminals the lab worker found two matching profiles. The two convicts to whom the profiles belonged were unrelated. Subsequent research showed that there were many more matching pairs in the database.
This surprised people, because the chance of one profile matching a given profile is so small. Some people suggested that something must be seriously wrong with the technology used to analyse and compare DNA samples. This would pose a terrible problem because if the technology gives rise to unnaturally many matches, then many innocent people might be wrongly convicted of crimes.
We can model a similar situation where the probability of any two random people matching is 1 in 225, and we have a database of 30 people. Each person is given a random number between 1 and 225; this represents their DNA profile.
Imgine you are one of the group of 30 people on the database.
What is the probability that someone else has the same number as you?
The hidden text below may be able to get you started:
Consider the other people in order. What is the probability that the first person has a different number to you?
What is the probability that the second person has a different number to you?
What is the probability that both the first and second people have different numbers to you?
What is the probability that the first three people all have different numbers to you?
What is the probability that all 29 other people have different numbers to you?
What is the probability that at least one other person has the same number as you?
How likely do you think it is that there will be at least one match amongst the 30 people in the database?
What is this probability?
You may want to take a look at Same Number to help answer this question.
Why is it so much more likely that two people will share the same number than someone sharing your number?
Does this help to explain why so many pairs were found in the Arizona database?
Student Solutions
We received lots of solutions to this problem, so thank you to Sakura from De Stafford, Harry, Nicollete, Jay, Kieran, Harry and Riya from St Stephens School Carramar in Australia, Alexi from the United Kingdom, Weida from Churston Ferrers Grammar School and James from West Island School in Hong Kong for sending us your solutions!
What is the probability that someone else has the same number as you?
Alexi sent us the following solution to the initial question in this problem:
The first part of this question can be seen by first looking at the probability of not matching with any of the other $29$ people, chosen at random. This is simply $\frac{224}{225}$, obtained from the fact that there is a $\frac{1}{225}$ chance of matching with any specified person on the database, so the chance of that not happening is $1-\frac{1}{225}$.
This is essentially a "with replacement" problem, so the chances of matching with any of the $29$ other people is not conditional on whether or not you've matched with any of the other $28$. Therefore, since they are independent, the probability of not matching with any of the $29$ is equal to $(\frac{224}{225})^{29}$, which is obtained by multiplying together the individual probabilities of not matching each specific person on the database. The answer is therefore $0.879$, to 3 significant figures, that you won't match with anybody else. The chance of matching with at least one other person is therefore one minus this, which gives an answer of $0.121$.
James sent us a more formalised solution to the initial problem:
To calculate the probability that somebody else has the same number as you, we can apply the Total Probability Theorem. From the definition of the theorem, we know that $$
P(P) + P(P') = 1
$$ where $P$ is an event and $P'$ is the event that $P$ does not happen (the converse of $P$).
For this problem, let $P$ be the event that another person has the same number, so $P'$ is the event that all people have different numbers to you. We now know that $$
P(P) = 1 - P(P')
$$To find the $P(\text{All people have different numbers to you})$, we have to consider every single person. In this sense, we have to consider the probability that your number is different from all $29$ other people. To do this, we can use the multiplication rule of probabilities. That is, that the probability that both events $A$ and $B$ occur is equal to the probability that event $A$ occurs times the probability that event $B$ occurs: $$
P(\text{$A$ and $B$ occur}) = P(\text{A occurs})\times P(\text{$B$ occurs})
$$From all of this, we know that $$
\begin{align}P(\text{Two random people have the same number}) &= \frac{1}{225} \\
P(\text{Two random people have different numbers}) &= \frac{224}{225} \\
P(\text{All people have different numbers to you}) &= \left(\frac{224}{225}\right)^{29} = 0.879 \end{align}
$$Hence, we know that $$
P(\text{Another person has the same number}) = 1 - \left(\frac{224}{225}\right)^{29} = 1 - 0.879 = 0.121
$$
How likely do you think it is that there will be at least one match amongst the 30 people in the database?
Alexi gave us the following estimate of this probability:
Before calculating the chance of at least one match, it is worth thinking about what range we are expecting the answer to fall into. It is a (reasonably) well known fact that if you have $23$ people in a room, the chance of at least two of them sharing a birthday is just over half, despite the probability of two people chosen at random having approximately a $\frac{1}{365}$ chance of sharing a birthday. This is called the "Birthday Paradox". In this case we've increased the "people in the room" to $30$, and reduced $365$ to $225$, so we should definitely expect a probability greater than $0.5$, and I will use an estimate of $0.6-0.95$ as a wide interval range where I expect the true probability to fall.
What is this probability?
Alexi sent us the following solution to this part of the problem:
First, let us examine the probability of one person, chosen at random, not matching with at least one other person. We have seen already that this is $(\frac{224}{225})^{29}$. Now let us randomly pick another person in the group, putting the person we know didn't match with anybody to one side because we know they can't possibly match with anybody. The probability of this new person not matching with anybody is $(\frac{223}{224})^{28}$, as there are now $28$ people in the room not including the person we chose at random, instead of $29$ as before, and the numbers on the numerator and denominator have both decreased by one. This is because the total available numbers between $1$ and $225$ has decreased by one because we've taken out the number which the first person had which we know wasn't a match for anybody else. The numerator for this problem is always equal to the value of the denominator minus one, hence $223$ in this case. We can continue to take out one person at a time, finding the probability that they don't match with anybody, and multiply all these probabilities together to find the probability that no two people match.
We can simplify the number we get which is $(\frac{224}{225})^{29} \times (\frac{223}{224})^{28} \times \dots \times (\frac{196}{197})^1$ by cancelling similar terms in the numerator and denominator, leaving us with $\frac{224!}{195! \cdot 225^{29}}$.
Thus the probability of no two people matching is one minus this number, which is $0.868$ to 3 significant figures.
Weida had a different way of constructing a solution for finding this probability:
It is easier (although this is subjective) to find the probability that there are no matches in the database, and reason that $$
P(\text{there is at least one match}) = 1 - P(\text{there are no matches})
$$When calculating the probability that there are no matches amongst the $30$ people, notice that this time if we consider the people in order, the next person cannot have the same number as any of the people who came before, unlike when we calculated the probability that every other individual than me did not have the same number as me. Thus $$
\begin{align}P(\text{2nd person's number does not match 1st person's}) &= \frac{225-1}{225} = \frac{224}{225}, \\
P(\text{3rd person's number does not match 1st or 2nd person's}) &= \frac{225-2}{225} = \frac{223}{225}, \\
P(\text{4th person's number does not match 1st, 2nd or 3rd person's}) &= \frac{225-3}{225} = \frac{222}{225},\end{align}
$$ and so on, up to $$
P(\text{30th person's number does not match anyone else's}) = \frac{225-29}{225} = \frac{196}{225}
$$Therefore $$
\begin{align}P(\text{no matches amongst the 30 people}) &= \frac{224}{225} \times\frac{223}{225} \times\frac{222}{225} \times ”¦ \times\frac{197}{225} \times\frac{196}{225} \\
&= \frac{224\times 223\times ...\times197\times196}{225^{29}} \\
&= \frac{224!}{225^{29} \times 195!}\end{align}
$$ (as there are 29 terms in the multiplication). So we have that $$
P(\text{at least one match amongst the 30 people}) = 1 - \frac{224!}{225^{29} \times 195!}
$$which is approximately $86.8\%$.
Why is it so much more likely that two people will share the same number than someone sharing your number?
James offered us the following explanation:
It is much more likely that there exists a pair of two people who share the same number, because the converse requires all numbers to be different, whereas the converse of the probability that someone shares your number only requires the other people to be different from only you.
Weida offered us this explanation:
The probability that everyone has a different number to each other is much smaller than the probability that everyone has a different number to me, because in the first case adding another person eliminates one number from the collection of numbers that the next person can have, but in the second case each other person can have any of the $224$ numbers that I did not choose.
Alexi also gave us an explanation, using a visual analogy:
The probability of one specific person sharing a number is always going to be a lot smaller than the probability of any two or more people sharing a number. A simple way of thinking about it is that the probability of the latter is going to include the probability of the former, as well as the different combinations with other people too. Visually it may help to imagine the case with one specific person sharing a number with somebody else as being like one person connected with a string to everybody else in the room, whereas "any two or more" is like every single person being connected to every single other person. The string represents the possibility of sharing a number, so there's clearly a lot more string and a lot more possibilities for matches in the second case!
Does this help to explain why so many pairs were found in the Arizona database?
Alexi gave us the following explanation:
The Arizona case is simply an extrapolation of these results, with $225$ replaced with one billion and $30$ replaced with $65,000$. The individual probability may be extremely low, but when you factor in all possible combinations of people, the chance of at least one match occurring is actually quite high.
James gave us this explanation:
Yes, this explains why there were so many pairs found in the Arizona database, because there are so many possible pairs of people who could match. The number of possible pairings here is:$$
\begin{pmatrix}
65000 \\
2
\end{pmatrix}
=
\frac{65000!}{64998! \times 2!} = 2,112,467,500
$$ So with $2,112,467,500$ possible pairs of people, it becomes quite likely that there will be some matches in the database.