# Is your DNA unique?

## Problem

As you may know, DNA is made up of of four different bases:

-Adenine (A)

-Cytosine (C)

-Guanine (G)

-Thymine (T)

Suppose that the bases are randomly distributed along a single strand of the DNA:

i) If my DNA single strand is 10 bases in length, what is the probability that it contains only a single adenine?

ii) If my DNA single strand is 150 bases in length, what is the probability of a 30% cytosine content?

iii) If my DNA single strand is 1000 bases in length, what is the probability of getting at least 5 thymines in a row, as least once?

iv) The human genome is approximated 6 billion bases in length. What is the probability that another individual has the same genetic composition as me?

v) The bacterial restriction enzyme BamHI cuts DNA at the site GGATCC. If I digest my genome with this enzyme, how many cuts would I expect to occur?

DNA sequencing is a very laborious task, and requires expensive machinery and complicated computational power. DNA fingerprinting is a technique carried out by forensic scientists in order to match a sample of DNA to a number of suspects - this is commonly used in identifying a person from among a number of suspects who may have been at a crime scene.

However, since the sequencing of the entire human genome is so difficult, a different approach must be adopted: it has been found that most of the human genome is largely identical between individuals, except for single bases which are particularly varied in a population. These single bases occur approximately once among every 1000 bases. By comparing these particular sites between individual
samples of DNA, it is much more rapid to identify to a high degree of accuracy whether the two DNA samples are identical.

vi) If approximately 1 in 1000 bases is variable, what is the probability of an individual having the same genetic composition as me?

vii) How many of these variable sites should be investigated to identify a suspect to 99.99% probability?

viii) If we remember that DNA occurs as homologous chromosomes, and that these variable sites occur in the same places across a pair of homologous chromosomes, how many of the sites should be investigated such that the probability of a misidentification is smaller than 1 in 1,000,000?

## Student Solutions

This problem makes heavy use of combinatorics:

i) We are asked the probability of a single adenine among 10 bases. If the adenine were in the the first base in the sequence, the 9 following bases could be any of the other three types. Thus the probability of this is:

$$p(ANNNNNNNNNN) = \left(\frac{1}{4}\right)\left(\frac{3}{4}\right)^9 = 0.0188$$

However, it is also possible that the Adenine could have been in the any of the other positions instead. Thus the probability is increased tenfold. We can express this possibility of placing the adenine in multiple places by using the Combinations notation: $^{10}C_1$ indicates that we wish to place 1 adenine among 10 bases.

Thus, overall the probability we require is:

$$p(one\ adenine) = ^{10}C_1\left(\frac{1}{4}\right)\left(\frac{3}{4}\right)^9 = 0.188$$

ii) A 30% cytosine content implies the need for 45 cytosines from among the 150 bases.

Thus,

$$p(45C) = ^{150}C_{45}\left(\frac{1}{4}\right)^{45}\left(\frac{3}{4}\right)^{105} = 0.0272$$

iii) We are asked for the probability that there is at least one chain of at least 5 Thymines among 1000 bases.

To tackle this, we must realise that a group of 5 Thymines has 996 possible locations within 1000 bases, and that the remaining 995 bases can be of any sort.

Thus,

$$p = ^{996}C_{1}\left({1}{4}\right)^5 = 0.973$$

iv) The probability of an individual having the same genetic composition as me implies that their every base must be identical in type and placement as mine.

Therefore:

$$p(same) = \left(\frac{1}{4}\right)^{6,000,000,000} = \text{exceptionally small!}$$

v) The probability of a random 6 base sequence of DNA forming GGATCC is $\left(\frac{1}{4}\right)^6$. If we simplistically say that the 6 billion base-pair human genome is composed of 1 billion different possible sites, then the number of expected sites with the correct restriction sequence is:

$$\left(\frac{1}{4}\right)^6\times 1,000,000 = 2.44 \times 10^5$$

vi) If only ever 1000 bases vary across a population, then there are only 6 million variable sites in the genome. Thus, the probability of an individual being identical to me is:

$$ \left(\frac{1}{4}\right)^{6,000,000} = \text{very small}$$

vii) We wish to find the number of sites necessary for it to be possible to match an individual to a 99.99% probability to a piece of DNA. Thus, we want the possibility of the two samples of DNA being the same by chance as 0.01%.

$$p = \left(\frac{1}{4}\right)^n = \frac{0.01}{100}$$

$$n = \frac{ln(10,000)}{ln(4)} = 6.62$$

Therefore, at least 7 of the variable sites should be investigated.

viii) As before, a misidentification occurs when the two DNA samples are the same purely by chance. We want the probability of this happening to be less than 1 in 1,000,000. However, since the same variable sites are present in the same place on homologous chromosomes, the probability of two individuals being identical at both these loci is $\frac{1}{4} \times \frac{1}{4} = \frac{1}{16}$.

$$\therefore \left(\frac{1}{16}\right)^n = \frac{1}{1,000,000}$$

$$n = \frac{ln(1,000,000)}{ln(16)} = 4.98$$

Therefore, at least 5 sites should be investigated.