## The Rules of Probability

We assume the letter x represents some observed data, and the Greek letter θ (theta) represents the value of a parameter Θ, and we wish to estimate θ by making use of the data x.

Independent Outcomes: If a set of individual outcomes are independent then the probability of that outcome set is obtained by multiplying the probabilities of each of the individual outcomes together.

For example, consider a coin for which the probability of a head xh is p(xh) = 0.9, and the probability of a tail xt is therefore p(xt) = (1 − 0.9) = 0.1. If we flip this coin twice then there are 4 possible pairs of outcomes: two heads (xh, xh), two tails (xt, xt), a head followed by a tail (xh, xt), and a tail followed by a head (xt, xh). In order to work out some averages, imagine that we flip this coin twice, 100 times. We label each flip according to whether it came first or second within a pair of flips, so we have 100 first flip outcomes, and 100 corresponding second flip outcomes (see Table C.1). Table C.1.: The number N and probability N/100 of each possible outcome from 100 pairs of coin flips of a coin which lands heads up 90% of the time. Ordered sequences or permutations are written in round brackets ‘()’, whereas un-ordered sets or combinations are written in curly brackets ‘{}’.

Given that p(xh) = 0.9, we expect 90 heads and 10 tails within the set of 100 first flips, and the same applies to the set of 100 second flips. But what about the number of pairs of outcomes?

For each head obtained on the first flip, we can observe the corresponding outcome on the second flip, and then add up the number of pairs of each type (eg xh, xh). We already know that there are (on average)

 $90=0.9×100,$ (C.1)

heads within the set of 100 first flip outcomes. For each of these 90 heads, the outcome of each of the corresponding 90 second flips does not depend on of the outcome of the first flip, so we would expect

 $81=0.9×90,$ (C.2)

of these 90 second flip outcomes to be heads. In other words, 81 out of 100 pairs of coin flips should yield two heads. Notice that the figure of 90 heads was obtained from Equation C.1, so we can re-write Equation C.2 as

 81 = 0.9 × (0.9 × 100) (C.3) = 0.81 × 100, (C.4)

where 0.9 is the probability p(xh) of a head, so the probability of obtaining two heads is p(xh)2 = 0.92 = 0.81.

A similar logic can be applied to find the probability of the other pairs (xh, xt) and (xt, xt). For the pair (xt, xt), there are (on average) 10 tails observed in the set of 100 first flip outcomes. For each of these 10 flips, each of the corresponding 10 second flips also has an outcome, and we would expect 1 = 0.1 × 10 of these to be a tail too; so that 1 out of 100 pairs of coin flips should consist of two tails (xt, xt).

The final pair is a little more tricky, but only a little. For the ordered pair (xh, xt), as there are (on average) 90 heads from the set of 100 first flips, and we would expect 9 = 0.1 × 90 of the corresponding 90 second flips to yield a tail; so that 9 out of 100 pairs of coin flips should be (xh, xt) tails. Similarly, for the ordered pair (xt, xh), as there are (on average) 10 tails in the set of 100 first flips, we would expect 9 = 0.9 × 10 of the corresponding 10 second flips to yield a head; so that 9 out of 100 pairs of coin flips should be (xt, xh). If we now consider the number of pairs that contain a head and a tail in any order then we would expect there to be 18 = 9 + 9 pairs that contain a head and a tail. Notice that the figure of 90 heads was obtained from 90 = 0.9 × 100, so we can write this as 9 = (0.1 × 0.9) × 100, or p(xh)p(xt) × 100.

In summary, given a coin that lands heads up on 90% of flips, in any given pair of coin flips we have (without actually flipping a single coin) worked out that there is an 0.81 probability of obtaining two heads, an 0.01 probability of obtaining two tails, and an 0.18 probability of obtaining a head and a tail. Notice that these three probabilities sum to one, as they should. More importantly, the probability of obtaining each pair of outcomes is obtained by multiplying the probability associated with each individual coin flip outcome.

We can apply this rule to any number of coin flips. For example, consider a coin flipped 5 times in succession, which yields the outcome sequence (xh, xh, xt, xt, xh). Given that these individual outcomes are independent, the probability of the observed sequence is

 p((xh, xh, xt, xt, xh)) = p(xh)p(xh)p(xt)p(xt)p(xh) (C.5) = ${p\left({x}_{h}\right)}^{3}×{p\left({x}_{t}\right)}^{2}$ (C.6) = 0.007. (C.7)

Note that this is the probability of a particular sequence or permutation, and the probability of 3 heads and 2 tails in any order involves a constant in the form of a binomial coefficient (see Appendix E).

Conditional Probability: The conditional probability that X = x given that Θ = θ is defined as

 $\begin{array}{ccc}p\left(x|\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}\theta \right)& =& p\left(x,\theta \right)/p\left(\theta \right).\end{array}$ (C.8)

The Product Rule: Multiplying both sides of Equation C.8 by p(θ) yields the product rule (also known as the chain rule)

 $\begin{array}{ccc}p\left(x,\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}\theta \right)& =& p\left(x|\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}\theta \right)/p\left(\theta \right).\end{array}$ (C.9)

The Sum Rule and Marginalisation: The sum rule is also known as the law of total probability. In the case of a discrete variable,

 $\begin{array}{ccc}p\left(x\right)& =& \underset{i=1}{\mathrm{\Sigma }}p\left(x,\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}{\theta }_{i}\right),\end{array}$ (C.10)

and, applying the product rule, yields

 $\begin{array}{ccc}p\left(x\right)& =& \underset{i=1}{\mathrm{\Sigma }}p\left(x|\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}{\theta }_{i}\right)p\left({\theta }_{i}\right).\end{array}$ (C.11)

In the case of a continuous variable, the sum and product rules yield

 $p\left(x\right)={\int }_{\theta }^{}\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}p\left(x,\theta \right)d\theta ={\int }_{\theta }^{}\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}p\left(x|\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}\theta \right)p\left(\theta \right)d\theta .$ (C.12)

This is known as marginalisation, and yields the marginal probability p(x) of the joint distribution p(X, Θ) at X = x. When applied to all values of X, marginalisation yields the marginal probability distribution p(X) of p(X, Θ).

Bayes’ Rule: If we swap θ for x in Equation C.9 then

 $p\left(\theta ,x\right)=p\left(\theta |\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}x\right)p\left(x\right),$ (C.13)

where p(θ, x) = p(x, θ). Thus,

 $\begin{array}{ccc}p\left(\theta |\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}x\right)p\left(x\right)& =& p\left(x|\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}\theta \right)p\left(\theta \right).\end{array}$ (C.14)

Dividing both sides of Equation C.14 by p(x) yields Bayes’ rule

 $\begin{array}{ccc}p\left(\theta |\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}x\right)& =& \frac{p\left(x|\phantom{\rule[-0.0ex]{0.2em}{0.0ex}}\theta \right)p\left(\theta \right)}{p\left(x\right)}.\end{array}$ (C.15)