Querying Probability Distributions

We have covered the basics of probability theory: outcome spaces, events, random variables, and probability distributions. We described uncertainty using joint, marginal, and conditional probabilities. With these in place, we can now ask a practical question:

How do we use a probability distribution to answer real questions?

In practice, we rarely work with a single random variable. Most problems in statistics, machine learning, and probabilistic modelling involve multiple variables. We model these with a joint distribution. We then ask structured questions about them. Here, we formalise these questions and explain the types of answers probability theory provides.

First, we explore probability queries, then compare them with optimisation-based queries like MAP, and finally extend the concepts to continuous variables.

Probability Queries

The simplest way to query a probability distribution is to ask about likelihoods. Given a joint distribution over several variables, we find the probabilities it assigns to specific events, often after conditioning on observed data.

Given a joint probability distribution over observed and unknown variables, a probability query asks: after fixing observed variables, how is probability distributed over the remaining variables?

Probability query variables fall into two groups.

The evidence variables, denoted by $E$ , which are observed and takes on a specific value $e$ .
The query variables, denoted by $Y$ , are the variables whose distribution we want to compute.

Our goal is to compute the conditional distribution, written as $P(Y∣E=e).$ $E = e$ means the evidence variables are fixed at specific values. The conditioning bar limits us to compatible outcomes. The result is a full distribution over $Y$ , not just a single probability.

This is called the posterior distribution over $Y$ . It assigns a probability to each possible value of $Y$ , given the observed evidence.

Though this is an application of conditional probability, it’s useful to treat it as a query. Again, we seek a distribution over values of $Y$ , not a single number.

When we condition on $E = e$ in the full joint distribution, we restrict ourselves to compatible outcomes. If we marginalise all variables except $Y$ , we obtain the same posterior. Probability queries, therefore, combine conditioning and marginalisation.

The example above demonstrates how we can compute a posterior distribution, starting with a full joint distribution over three variables. The following steps are taken:

Joint Distribution We begin with a joint distribution over all variables. In this case, our query variables ( $Y$ ), evidence variables ( $E$ ), and any hidden variables ( $Z$ ). This distribution represents every possible world the model can imagine.
Conditioning: First, we apply our observations. When we condition on $E=e$ , we discard all outcomes in the distribution that are inconsistent with our evidence. Visually, this is represented by the green bars and is similar to taking a cross-section of our 3D distribution. This restricts us to a smaller set of compatible outcomes, giving us the conditional distribution $P(Y,Z \mid E=e)$ .
Marginalisation: After conditioning, we focus on the variables of interest. We do this by summing (or integrating) over the variables we do not care about ( $Z$ ) in the conditional distribution $P(Y, Z \mid E=e)$ . This operation, called marginalisation, removes $Z$ and leaves us with the distribution for $Y$ alone. In the diagram, this is shown by collapsing or summing the remaining bars along the Z-axis, resulting in a 2D distribution over $Y$ .

The end result is the posterior distribution, $P(Y∣E=e)$ , which tells us exactly how probability is distributed over the possible values of $Y$ given what we have observed.

MAP Queries

Although we have established that probability queries return full distributions, in some situations, we would prefer a single concrete answer rather than a distribution.

This leads to a different type of query: the maximum a posteriori (MAP) query. Here, we ask which joint assignment of unknowns is most likely, given the observed evidence.

Let $W$ denote the set of all non-evidence variables (variables that are not fixed through observation). A MAP query can then be defined as

\text{MAP}(W \mid e) = \underset{w}{\text{argmax}} \ P(w \mid e).

This means: among all possible joint settings of the unknown variables, which single assignment is most likely given the evidence?

$P(w \mid e)$ is the posterior probability for assignment $w$ to $W$ after conditioning on $E = e$ . Maximising over all $w$ finds the most likely assignment.

A MAP query is an optimisation problem, not a probability calculation. The output is a single assignment of joint variables. MAP queries are used when a single explanation or decision is needed, such as selecting the most likely configuration of hidden variables.

Furthermore, it is important to understand that the MAP assignment is the most likely joint configuration of variables, and this query does not assign each variable its individually most likely value.

For example, consider two variables where their joint behaviour matters. Each variable may have a clear most-likely value on its own, but the combination of those values may be unlikely. MAP accounts for the joint structure, but marginal probabilities do not.

Probability queries and MAP queries should be used differently. If we compute $P(A \mid e)$ and choose the most likely value of $A$ , we are optimising $A$ in isolation. A MAP query over multiple variables instead optimises their joint behaviour. These operations are not equivalent.

In general,

\arg\underset{a,b}{\max} P(a,b \mid e) \neq \left(\arg\underset{a}{\max}P(a \mid e), \ \arg \underset{b}{\max} P(b \mid e)\right).

The most likely joint explanation can differ from the combination of the individually most likely values. This is a basic property of joint probability distributions, not a rare exception.

This distinction matters when designing inference algorithms. Using the wrong type of query can give mathematically correct answers that are not useful for the task.

Marginal MAP Queries

MAP queries can be refined further. Sometimes, we care only about a subset of variables and want the most likely assignment to those variables, while the remaining variables are deemed irrelevant.

Let $\mathbf{Y}$ denote the variables we care about, $\mathbf{E}$ the evidence, and $\mathbf{Z}$ the remaining variables. A marginal MAP query asks for

\text{MAP}(\mathbf{Y} \mid e) = \arg \underset{y}{\max} P(y \mid e).

Using the definition of conditional probability, this can be written as

\text{MAP}(\mathbf{Y} \mid e) = \arg\underset{\mathbf{Y}}{\max}\sum_\mathbf{Z} P(\mathbf{Y},\mathbf{Z} \mid e).

By this definition, marginal MAP queries mix summation and maximisation. We first sum over "irrelevant" variables to remove them from consideration, and then maximise over the variables of interest.

This combination makes marginal MAP queries harder to compute than pure probability or pure MAP queries. The result is not a full posterior distribution or a complete joint assignment. Instead, we find the most probable values for a chosen subset of variables, accounting for all possible values of the rest.

A crucial property of marginal MAP queries is non-monotonicity. Changing which variables we optimise can change the optimal assignment in unexpected ways. We cannot simply compute a MAP assignment and ignore unwanted variables.

Marginal MAP queries must instead be solved by marginalising out irrelevant variables and jointly optimising the variables of interest. The two steps must be done in the correct order.

Continuous Spaces

So far, we have focused on random variables with finite or countable values. Many quantities of interest, however, are naturally continuous. Height, time, cost, and physical measurements all take values in continuous ranges.

Continuous random variables raise a problem. With infinitely many possible values, we cannot assign a nonzero probability to each exact value. The chance of observing any exact value is always zero.

This forces us to shift from probabilities of exact values to probabilities of intervals. Instead of $P(X = x)$ (the probability that random variable $X$ takes on the value $x$ ), we work with expressions such as $P(a \le X \le b)$ (the probability that the random variable $X$ lies between the values $a$ and $b$ ).

Probability Density Functions

To handle continuous variables, we introduce probability density functions, or PDFs.

A probability density function $p(x)$ for a random variable $X$ is a nonnegative function that satisfies

\int p(x) \ dx = 1

This condition ensures that the total probability mass over all possible values of $X$ is equal to 1.

The PDF itself is not a probability. In a continuous space, the probability that $X$ takes any exact value is zero. The density shows how probability is spread across the real line. To get actual probabilities, we integrate the density over intervals. For any real numbers $a \le b$ ,

P(a \le X \le b) = \int_a^b p(x)\,dx.

Expressed in words, this means that the probability that $X$ falls between $a$ and $b$ is equal to the total area under the density curve between those two points.

Closely related to the PDF is the cumulative distribution function (CDF), defined as

P(X \le a) = \int_{-\infty}^a p(x)\,dx.

The CDF shows how probability accumulates as we move along the real line. For each value $a$ , it gives the total probability for all outcomes less than or equal to $a$ .

The PDF describes how probability is distributed locally, while the CDF describes how that probability accumulates globally. Where the density $p(x)$ is large, probability accumulates quickly, and the CDF rises steeply. Where the density is small, probability accumulates slowly, and the CDF rises more gradually.

In this way, the PDF controls how much probability is contributed by small neighbourhoods around each point, while the CDF tracks the probability that a random variable will take a value less than or equal to a specific number.

Common Continuous Distributions

The simplest continuous distribution is the uniform distribution on an interval $[a,b]$ . Its density is constant over the interval and zero elsewhere. The probability of any subinterval depends only on its length relative to $b-a$ .

More complex distributions assign density unevenly. The most important example is the Gaussian distribution, defined by a mean $\mu$ (also known as the average) and the variance $\sigma^2$ (which measures how spread out values are around the average). Its density has a bell shape, with the mean controlling the centre and the variance controlling how spread out the distribution is.

As the variance decreases, the density becomes more concentrated around the mean. As it increases, the distribution flattens out.

Joint Density Functions

All of the ideas developed for single continuous variables extend naturally to multiple continuous variables.

A joint density function $p(x_1, \dots, x_n)$ assigns density to combinations of values. As with regular probability density functions, it must be non-negative and integrate to one over the full space. Probabilities of joint events are obtained by integrating over regions:

P(a_1 \le X_1 \le b_1, \dots, a_n \le X_n \le b_n) = \int_{a_1}^{b_1} \cdots \int_{a_n}^{b_n} p(x_1, \dots, x_n)\,dx_1\cdots dx_n.

As with joint probabilities, we can recover marginal densities from a joint density by integrating over the variables. For example, the marginal density of $X$ from a joint density $p(x,y)$ can be calculated as

p(x) = \int p(x,y)\,dy.

This is the continuous version of marginalisation.

Conditional Density Functions

Defining conditional distributions in continuous spaces requires some care. For a continuous random variable, the probability of taking any exact value is zero. This is because probabilities are obtained by integrating the density over intervals, and an interval of zero width contributes zero area. As a result, $P(X = x) = 0$ for every value of $x$ , and we cannot directly apply the discrete definition of conditional probability.

Instead, we use limits over shrinking intervals. We look at a small interval around $x$ , such as $x - \varepsilon \le X \le x + \varepsilon$ (where $\varepsilon$ is a very small number), which has positive probability. We then define conditional probabilities for this interval and see what happens as the interval shrinks.

This leads us to look at quantities of the form

P(a \le Y \le b \mid x - \varepsilon \le X \le x + \varepsilon)

and take the limit as $\varepsilon \to 0$ . Under mild conditions, this limit exists and gives a conditional distribution of $Y$ given $X = x$ .

This limiting process gives a simple expression for the conditional density, written as

p(y \mid x) = \frac{p(x,y)}{p(x)}

provided that $p(x) > 0$ .

The limit is therefore implicit in the definition. The formula looks identical to the discrete case, but here it means conditioning on smaller and smaller neighbourhoods around $x$ , not on the event $X = x$ itself.

From conditional densities, we recover the familiar identities

p(x,y) = p(x)p(y \mid x),

and

p(x \mid y) = \frac{p(y \mid x)p(x)}{p(y)},

which are the density-based versions of the chain rule and Bayes’ rule.

Independence in Continuous Variables

With joint and conditional density functions in place, we can now describe how continuous random variables relate to one another. In particular, the notion of independence extends naturally to the continuous setting. Two variables $X$ and $Y$ are independent if their joint density factorises as

p(x,y) = p(x)p(y).

Conditional independence is defined the same way, by factorising conditional densities. These ideas are central in probabilistic modelling, especially in high-dimensional environments where structure is needed for tractable inference.

Conclusion

In this post, we looked at how to query and use probability distributions in practice. We started with probability queries, which return full posterior distributions, and compared them with MAP queries, which return the most likely assignments. We saw why joint optimisation is different from marginal optimisation, and how marginal MAP queries combine summation and maximisation.

We then extended these ideas to continuous random variables. By introducing probability density functions, joint and marginal densities, and conditional densities, we showed that the same logic underlies both discrete and continuous probability.

Together, these tools form the basis of probabilistic inference. Next, we will introduce expectation and variance, and show how they help us summarise and reason about random variables.