Published 2018 Revised 2019

Mathematicians have been thinking about random events and random processes for many centuries. Throughout the 19th century, they developed very effective approaches, and as a result were able to apply the theory of probability to many important problems. One of the key ideas is that of a random variable. Mathematicians were generally not that concerned about the precise meanings of terms, because it was clear what they intended. One such example was the term "random quantity", introduced by the outstanding Russian mathematician Chebyshev. The meaning of this was, in some sense, taken as given: it was a numerical quantity which behaved in a random fashion, and the precise nature of the randomness could be described in terms of probabilities.

Mathematics changed dramatically at the end of the 19th and start of the 20th century. For various reasons which would take us too far afield here, a group of mathematicians decided that mathematics needed to be put on a solid foundation, and set theory became the heart of this foundation. Most of mathematics was eventually swept up in this movement, and as a result, it became expected that everything would be given a very precise definition. The theory of probability was no exception, and in the 1930s, another Russian mathematician, Kolmogorov, succeeded in doing this for the idea of a "random variable" (as well as for much of the rest of probability theory). In this article, we will give a simplified explanation of the modern approach. (We will also indicate where we have simplified things.)

We first need the concept of a

$$\Omega=\{\text{HHH},\text{HHT},\text{HTH},\text{HTT},\text{THH},\text{THT},\text{TTH},\text{TTT}\}.$$

We then have lots of possible events, consisting of all possible subsets of the sample space $\Omega$ (there are $2^8$ of them in total). For example, we could consider events such as:

$$\begin{gather*}

&\{\text{HHT}\}\\

&\{\text{HHT}, \text{HTH}\}\\

&\{\}\quad\text{(the empty set)}\\

&\Omega\quad\text{(the whole set)}

\end{gather*}$$

The second ingredient for a probability space is a

If the coin is assumed to be unbiased, then we would have, for example,

$$\begin{align*}

\mathrm{P}(\{\text{HHT}\}) &= \tfrac{1}{8}\\

\mathrm{P}(\{\text{HHT}, \text{HTH}, \text{THH}\}) &= \tfrac{3}{8}\\

\mathrm{P}(\{\text{HHH},\text{TTT}\}) &= \tfrac{2}{8}\\

\mathrm{P}(\{\}) &= 0\\

\end{align*}$$

Note that we don't talk about the probability of individual outcomes, but only of

Once we have a probability space, we can define a

or we could let $Y$ be the absolute difference between the number of heads and the number of tails, giving this diagram:

We could come up with many other random variables for this particular probability space, such as "$\sqrt{37}$ if the first flip is a head and $-\pi$ if it is a tail"; the probability space (or experiment) itself does not tell us what random variable to use, though some may be more natural than others.

Once we have random variables, there are events naturally related to them. A typical event will be something like "$X$ is equal to this number". For example, we could consider events such as $X=0$, $X\ge2$, $Y=0$ and so on, as shown in these diagrams:

Technically, $X=0$ is shorthand for the set $\{\omega\in\Omega:X(\omega)=0\}$. But as that is quite unwieldy, we usually just shorten it to $X=0$, leaving out any explicit reference to the sample space.

Since we know about the probability of an event (through the probability function $\mathrm{P}$), we can now talk about $\mathrm{P}(X=0)$: it is the probability of this event. In this case, we see that

$$\begin{align*}

\mathrm{P}(X=0) &= \tfrac{1}{8} \\

\mathrm{P}(X\ge2) &= \tfrac{4}{8} \\

\mathrm{P}(Y=0) &= 0

\end{align*}$$

The example above has a finite sample space, and things are quite straightforward there. Technical difficulties begin to surface when we work with infinite sample spaces. The same essential ideas apply in this case, but we have to be more careful with some of the technical details. For example, it is still the case that a random variable $X$ is a function from the sample space to the real numbers, and $\mathrm{P}(X>5)$ still means the probability of the event $X>5$.

This approach to random variables turns out to be a very useful way to think about what is going on. The technical details for the continuous random variable case require an area of mathematics called

- This is a simplification of the full definition of a probability space, and does not work in all cases; we actually have to specify the events (subsets of $\Omega$) on which the probability function is defined. It turns out that, in general, it is impossible to consistently define the probability function on all possible events. This is closely related to the Banach-Tarski paradox, which you might find interesting to explore.
- There is actually one other requirement, which is that the summability extends to an infinite list of events. That is, if $A_1$, $A_2$, ... are an infinite list of pairwise-disjoint events, then $\mathrm{P}(A_1\cup A_2\cup \cdots)=\mathrm{P}(A_1)+\mathrm{P}(A_2)+\cdots$.
- When working with infinite sample spaces, for example with continuous random variables, the probability of any particular outcome may well be zero. We cannot add up infinitely many zeros to get something non-zero, so we work with the probability of sets of outcomes (events) instead.
- This is a slight simplification: we require the function to be "well-behaved" in a certain technical way (it has to be "measurable"). Pretty much every function we can write down explicitly is measurable.