Create Next App

In mathematics, probability gives us a precise way to reason about uncertainty, allowing us to describe and manipulate uncertain outcomes using clear, consistent rules. In machine learning, statistics, and probabilistic modelling, nearly every concept builds on these foundations, from estimating unknown quantities and making predictions to learning from data and quantifying uncertainty in model outputs.

This post builds our understanding of probability theory from the ground up. We focus on formal mathematical rules, explaining each with simple examples and diagrams.

By the end, you will be comfortable with the fundamentals of probability theory, capable of reading and manipulating mathematical notation, and understand how these ideas form a single, consistent structure that underpins our current tools.

Definitions of Outcomes, Events, and Event Spaces

Before using probability terminology for reasoning and decision-making, we must first define uncertainty. Uncertainty arises with imperfect or unknown information, making outcomes unpredictable.

An experiment is any process with an unknown outcome. Tossing a coin, rolling a die, or drawing a card from a deck are typical examples. These are called experiments because of their randomness. Before performing the experiment, we do not know which outcome will occur.

The outcome space of such an experiment is denoted by $\Omega$ , and is the set of all possible outcomes of the experiment. For a fair six-sided die,

\Omega = \{1, 2, 3, 4, 5, 6\}.

Each element of $\Omega$ represents one complete, mutually exclusive description of the outcome. To illustrate this concept, consider a simple thought experiment: imagine a die with faces numbered $1$ to $6$ . It is impossible for the die to land on two different numbers simultaneously; if the die shows a $3$ , it cannot, at the same time, show a $5$ . This mutual exclusivity is crucial in defining each element in $\Omega$ as representing exactly one outcome. When the experiment is run, exactly one outcome in $\Omega$ occurs.

Importantly, an event is not a single outcome, but a set of outcomes. An event tells us about the result of the experiment. All of the following are examples of events in the dice rolling experiment. Each statement corresponds to an actual result:

“The die shows an even number” corresponds to the set $\{2, 4, 6\}$ .
“The die shows a six” corresponds to the set $\{6\}$ .
“The die shows a number greater than four” corresponds to the set $\{5, 6\}$ .

The empty event, denoted $\varnothing$ , contains no outcomes and represents a statement that can never be true for that experiment.

At this point, it is important to specify which sets of outcomes we treat as meaningful and why. In principle, any subset of the outcome space $\Omega$ could be considered an event, but in more complex settings, not all subsets are practical or relevant for our purposes. To address this, we introduce the event space $\mathcal{S}$ , a collection of events. The event space specifies exactly the sets of outcomes to which we assign probabilities and reflects the questions we are prepared to model and reason about.

For simple, finite cases like dice rolls or cards, the event space $\mathcal{S}$ usually contains all subsets of $\Omega$ . That is, every event present in the outcome space is a useful outcome that we want to assign a probability to. In complex situations, however, $\mathcal{S}$ is chosen more carefully to exclude meaningless questions.

The event space must satisfy three basic properties:

It contains the empty event $\varnothing$ and the trivial event $\Omega$ .
It is closed under union: if $A$ and $B$ are events, then $A \cup B$ is also an event.
It is closed under complementation: if $A$ is an event, then $\Omega \setminus A$ is also an event.

From left to right, the panels show the union $A \cup B$, the intersection $A \cap B$, and the complement $\Omega \setminus A$. Each is simply another region within the same outcome space $\Omega$, illustrating that basic set operations keep us inside the event space and allow probabilities to be defined consistently.

These requirements ensure that once we decide which questions are meaningful, we can also ask subsequent questions. If we can ask whether event $A$ happened and whether event $B$ happened, we must also be able to ask whether at least one of these events happened, or whether event $A$ did not happen. Closure also guarantees that probability theory is logically stable under such reasoning.

To help solidify this concept, imagine an event space that is not closed under basic operations such as unions or complements. You might define two valid events, yet find that combining them produces a set that is no longer considered an event. This leads to an immediate problem — probabilities cannot be assigned consistently. Statements "A or B occurs" or "A does not occur" become ill-defined, even though they arise naturally in reasoning. Closure ensures that whenever we form such combinations, the result remains within the event space, allowing probability to behave coherently and avoiding contradictions.

Probability Distributions and Axioms

Once we have decided which events we care about, we can quantify their likelihood of occurring. To do that, we first need to understand probability distributions and some of the axioms (fundamental statements or rules accepted as true without proof) of probability. Think of these axioms like the rules of a game: they provide the structure within which everything else operates.

A probability distribution assigns a number to each event in a way that reflects how plausible that event is. Importantly, these numbers are not arbitrary. They must obey a small set of basic rules that prevent contradictions and ensure that probabilities behave sensibly.

This is just an example of a Gaussian probability distribution, and its only purpose (for now) is illustration.

Formally, a probability distribution $P$ over $(\Omega, \mathcal{S})$ is a function that maps each event in $\mathcal{S}$ to a real number, subject to the following axioms:

P(A) \ge 0 \quad \text{for all } A \in \mathcal{S},

P(\Omega) = 1,

and if $A$ and $B$ are disjoint events,

P(A \cup B) = P(A) + P(B).

The first axiom rules out negative probabilities. The second fixes the probability of the entire outcome space to one, expressing the fact that the experiment must produce some outcome in $\Omega$ . The third axiom states that if two events are disjoint, the probability that either occurs is the sum of their probabilities, since there is no overlap to count twice.

Together, these axioms are minimal but sufficient. They do not aim to describe every aspect of probability, only to guarantee internal consistency. With no redundancy and no extra assumptions, they provide just enough structure to support the whole theory.

Useful consequences follow from these axioms. In particular, since $\varnothing$ and $\Omega$ are complements,

P(\varnothing) = 0.

This means that there is a 0% chance of outcomes within the empty event from occurring - they are impossible.

Furthermore, for any two events, whether disjoint or not, the probability of either or both of these events occurring can be calculated as

P(A \cup B) = P(A) + P(B) - P(A \cap B).

This formula arises because outcomes in $A \cap B$ are counted twice — once in $P(A)$ and once in $P(B)$ . Subtracting $P(A \cap B)$ removes the duplicate contribution.

To illustrate why this is necessary, consider the following example:

Adding $P(A)$ and $P(B)$ results in us counting the intersection $P(A \cap B)$ twice, meaning we have to subtract it to arrive at the correct result.

For a fair die, let $A = \{2,4,6\}$ and $B = \{4,5,6\}$ . Then $P(A) = 3/6$ , $P(B) = 3/6$ , and $P(A \cap B) = 2/6$ . Applying the formula gives $P(A \cup B) = 4/6$ , which corresponds to the four outcomes $\{2,4,5,6\}$ .

Interpreting Probability

So far, we have defined probability as a numerical system with rules that guide our questions and answers. To understand what probabilities represent, we rely on frequentist or subjective interpretations. A practical decision, such as betting on whether it will rain tomorrow, can highlight the differences between these interpretations.

The frequentist interpretation defines probability in terms of long-run behaviour. If we repeat an experiment many times under identical conditions, the probability of an event is the fraction of times it occurs as the number of repetitions increases. In our rainfall example, a frequentist might consider historical weather data to estimate the probability of rain.

Frequentist probabilities are calculated as follows: If an event $A$ occurs $k$ times in $n$ independent repetitions of the experiment, we estimate

P(A) \approx \frac{k \text{ occurences}}{n \text{ experiments}}.

As $n$ increases, the empirical frequency should stabilise and converge to a fixed value that we call the probability of the event $A$ .

On the other hand, the subjective interpretation measures the degree of belief an agent holds based on current information. It expresses how plausible an event seems, not how often it would occur if repeated. In the context of predicting tomorrow’s rain, a subjective interpretation might include the meteorologist's insights or recent changes in atmospheric conditions that are not captured in the historical data.

While it may seem that this interpretation is neither useful nor mathematically sound, it is essential when repeating an experiment is impossible or meaningless, such as predicting tomorrow’s weather or assessing whether a specific system will fail.

Both interpretations use the same mathematical rules. The frequentist view relies on repetition. The subjective view treats probability as a tool for reasoning about uncertainty, allowing beliefs to update in light of new information. Humans use this reasoning all the time, even when we do not state beliefs or assign numerical values to them.

Conditional Probability

In many situations, we gain partial information about possible outcomes before an experiment is complete, which, in turn, affects the probabilities we assign to events. Conditional probability formalises how these probabilities change once we restrict attention to outcomes consistent with the partial information we have obtained.

For events $A$ and $B$ , the conditional probability of $A$ given $B$ is defined as

P(A \mid B) = \frac{P(A \cap B)}{P(B)}.

This definition may be better understood graphically. The event $B$ restricts the outcome space to a smaller region. Within that restricted space, we ask what fraction of outcomes also belong to $A$ . Consider the following example, in which we calculate the probability of drawing a heart card, given that we will draw a red card.

In this example, we draw a card from a standard deck. Let $B$ be the event that the card is red, and $A$ the event that it is a heart. Since a heart card is a red card, we condition on $B$ and restrict our attention to the 26 red cards. Among those 26 cards, 13 are hearts, so the conditional probability is $13/26 = 1/2$.

The Chain Rule and Bayes' Rule

Conditional probability is useful as it allows us to break down joint events (two or more events happening at the same time) into separate, sequential pieces. Specifically, this can be achieved through the chain rule, which splits a joint probability into more granular, useful components.

For two events $A$ and $B$ , the chain rule can be applied as

P(A \cap B) = P(A \mid B) P(B).

This follows directly from the definition of conditional probability (the denominator is just brought over to the other side of the equation!). The probability that both events occur is the probability that $B$ occurs, multiplied by the probability that $A$ occurs within the subset of outcomes where $B$ is true.

The chain rule can also be applied to scenarios with more than two joint events. For example, for three events $A$ , $B$ , and $C$ ,

P(A \cap B \cap C) = P(A \mid B, C) P(B \mid C) P(C).

This decomposition can be extended to any number of events and provides a systematic way to construct complex probabilities from simpler conditional components.

Bayes' Rule

One of the most useful consequences of the chain rule is Bayes' rule. The key idea is that the same joint probability can be written in two different ways. From the chain rule, we have both $P(A \cap B) = P(A \mid B) P(B)$ and $P(A \cap B) = P(B \mid A) P(A)$ .

Since these expressions describe the same event, they must be equal. Solving this equality for $P(A \mid B)$ gives Bayes' rule, which shows how to update the probability of $A$ (the hypothesis) after observing $B$ (the evidence), resulting in

P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}.

In Bayes' rule, $P(A)$ is called the prior probability. It represents how likely we believe event $A$ is before observing $B$ , based only on the information available up to that point. The term $P(B \mid A)$ measures how compatible the observation $B$ is with the assumption that $A$ is true. Together, these two quantities form the numerator and express how strongly $A$ explains the observed event $B$ .

The denominator $P(B)$ acts as a normalising constant. It accounts for all possible ways in which $B$ could occur and rescales the numerator so that the resulting value of $P(A \mid B)$ lies between 0 and 1. In this way, Bayes' rule converts an unnormalised score into a valid probability that reflects our updated belief about $A$ after observing $B$ .

Although I can try to explain this here in great detail, the very best explanation I have found for Bayes' rule is 3blue1brown's YouTube video. Please watch it and support him — he has done so much for the mathematics and machine learning community!

View on YouTube

Random Variables

Reasoning directly about sets of outcomes quickly becomes tedious, and we really need a better, more succinct way of writing about probabilities. This is exactly where Random variables are useful. Specifically, random variables provide a numerical representation of uncertainty, simplifying both notation and analysis.

It is crucial to distinguish between the 'random outcome' (e.g., the physical result of tossing a coin) and the 'variable X', which is a numeric representation of that outcome. This separation helps prevent misconceptions by framing the random event as one thing and its mathematical mapping as another.

A random variable $X$ can be defined as a function that maps each outcome in $\Omega$ to a real number. For a coin toss, this can be written as

X = \begin{cases} 1 & \text{heads} \\ 0 & \text{tails} \end{cases}

The randomness in a "random variable" originates in the experiment's outcomes, since it is the experiment that produces random outcomes, not the function or mapping itself. The random variable simply records the outcome numerically.

Once outcomes are expressed numerically, we can finally define probabilities, relationships between variables, and summary quantities such as expected values and variances in a uniform way.

When working with multiple random variables (as in conditional probabilities), we need to describe how they behave both individually and together. This is done through joint, marginal, and conditional distributions.

The joint distribution of two random variables $X$ and $Y$ assigns probabilities to pairs of values. In joint distributions, we are interested in the probability that $X$ takes the specific value $x$ at the same time that $Y$ takes the specific value $y$ , which is written as

P(X = x, Y = y).

This distribution fully characterises the system. From it, we can recover the behaviour of each variable separately through marginalisation. Marginalisation can be achieved by summing over all possible values of the other variable, effectively collapsing the joint distribution down to a distribution over a single variable by accounting for every possible value the other variable could take on, and is mathematically written as

P(X = x) = \sum_y P(X = x, Y = y).

Joint and Marginal distributions can be easily understood using a table of discrete probabilities. For a fair die, let $X$ be whether a roll is even or odd, and let $Y$ be whether the roll is small $(\leq 3)$ or large $(\geq 4)$ . The probability table is given below.

A joint distribution over two variables derived from a fair die, where row and column sums give the marginal distributions, and each cell represents the probability of the corresponding pair of events.

This table can be interpreted as follows:

The entry $P(X=\text{Even}, Y=\text{Small}) = 1/6$ corresponds to rolling a 2.
The entry $P(X=\text{Even}, Y=\text{Large}) = 2/6$ corresponds to rolling a 4 or 6.
Each row sum $(\sum_y)$ gives a marginal distribution of $X$
Each column sum $(\sum_x)$ gives a marginal distribution of $Y$
The total probability sums to 1, as required.

In addition to these distributions, conditional distributions describe how one variable behaves once the value of another is known, and can be calculated as

P(X = x \mid Y = y) = \frac{P(X = x, Y = y)}{P(Y = y)}.

For the example above, if we were to condition on the die roll being large, the probability that the roll is even can be calculated as

P(X = \text{Even} \mid Y = \text{Large}) = \frac{X = P(\text{Even}, Y = \text{Large})}{P(Y = \text{Large})} = \frac{2/6}{3/6} = \frac{2}{3},

which makes sense, since two out of three possible large rolls are even (4 and 6), but not 5.

Independence and Conditional Independence

When working with multiple events, such as in joint and conditional distributions, an important question is whether variables genuinely influence one another, or whether their apparent relationship disappears once we account for additional information. This leads to the ideas of independence and conditional independence.

Here, $A$ and $B$ are independent events. There is no overlap in probability for $A$ and $B$

Independence means that knowing the outcome of one event tells us nothing about the outcome of another. Two events $A$ and $B$ are independent if

P(A \cap B) = P(A) P(B).

Equivalently, $P(A \mid B) = P(A)$ whenever $P(B) > 0$ . Learning that $B$ occurred does not change the probability of $A$ .

For random variables $X$ and $Y$ , independence means that their joint distribution can be factorised as

P(X = x, Y = y) = P(X = x) P(Y = y).

Conditional independence weakens the notion of independence by allowing dependence to disappear once additional information is taken into account. Two variables $X$ and $Y$ are conditionally independent given a third variable $Z$ if, after fixing the value of $Z$ , knowing $X$ provides no further information about $Y$ , and vice versa. In this setting, any apparent relationship between $X$ and $Y$ is fully explained by their shared dependence on $Z$ .

Once $Z$ is known, knowing $X$ does not influence $Y$, and vice versa. There is no direct link between $X$ and $Y$.

Mathematically expressed, $X$ and $Y$ are conditionally independent given $Z$ if

P(X = x, Y = y \mid Z = z) = P(X = x \mid Z = z) P(Y = y \mid Z = z).

Conclusion

In this post, we explored and understood basic probability theory from first principles. We started by defining outcomes, events, and event spaces, then introduced probability distributions through a small set of axioms that ensure consistency. From these foundations, we developed conditional probability, the chain rule, Bayes' rule, and the language of random variables and distributions, showing how joint, marginal, and conditional behaviour fit into a single framework. Independence and conditional independence then clarified when variables genuinely interact and when apparent relationships disappear once relevant information is taken into account.

Together, these ideas form the core of probability theory. In the next part, we will extend this framework by learning how to query probability distributions directly, move beyond discrete outcomes to continuous spaces, and introduce expectation and variance as tools for summarising and reasoning about random variables.