This tutorial is taken from Chapter 1 of the book
Bayes' Rule: A Tutorial Introduction to Bayesian Analysis
which can be downloaded as a pdf file from
here, and includes a table of contents, plus computer code in MatLab, Python and R.
James V Stone, February 2018.
Email: j.v.stone at sheffield.ac.uk
“… we balance probabilities and choose the most likely. It is the scientific use of the imagination … ”
Sherlock Holmes, The Hound of the Baskervilles.
AC Doyle, 1901.
Bayes’ rule is a rigorous method for interpreting evidence in the context of previous experience or knowledge. It was discovered by Thomas Bayes (c. 17011761), and independently discovered by PierreSimon Laplace (17491827). After more than two centuries of controversy, during which Bayesian methods have been both praised and pilloried, Bayes’ rule has recently emerged as a powerful tool with a wide range of applications, which include: genetics^{2}, linguistics^{12}, image processing^{15}, brain imaging^{33}, cosmology^{17}, machine learning^{5}, epidemiology^{26}, psychology^{31;44}, forensic science^{43}, human object recognition^{22}, evolution^{13}, visual perception^{23;41}, ecology^{32} and even the work of the fictional detective Sherlock Holmes^{21}. Historically, Bayesian methods were applied by Alan Turing to the problem of decoding the German enigma code in the Second World War, but this remained secret until recently^{16;29;37}.
In order to appreciate the inner workings of any of the above applications, we need to understand why Bayes’ rule is useful, and how it constitutes a mathematical foundation for reasoning. We will do this using a few accessible examples, but first, we will establish a few ground rules, and provide a reassuring guarantee.
In the examples in this chapter, we will not delve into the precise meaning of probability, but will instead assume a fairly informal notion based on the frequency with which particular events occur. For example, if a bag contains 40 white balls and 60 black balls then the probability of reaching into the bag and choosing a black ball is the same as the proportion of black balls in the bag (ie 60/100=0.6). From this, it follows that the probability of an event (eg choosing a black ball) can adopt any value between zero and one, with zero meaning it definitely will not occur, and one meaning it definitely will occur. Finally, given a set of mutually exclusive events, such as the outcome of choosing a ball, which has to be either black or white, the probabilities of those events have to add up to one (eg 0.4+0.6=1). We explore the subtleties of the meaning of probability in Section 7.1.
Before embarking on these examples, we should reassure ourselves with a fundamental fact regarding Bayes’ rule, or Bayes’ theorem, as it is also called: Bayes’ theorem is not a matter of conjecture. By definition, a theorem is a mathematical statement that has been proved to be true. This is reassuring because, if we had to establish the rules for calculating with probabilities, we would insist that the result of such calculations must tally with our everyday experience of the physical world, just as surely as we would insist that 1 + 1 = 2. Indeed, if we insist that probabilities must be combined with each other in accordance with certain common sense principles then Cox(1946)^{7} showed that this leads to a unique set of rules, a set which includes Bayes’ rule, which also appears as part of Kolmogorov’s (1933)^{24} (arguably, more rigorous) theory of probability.
Suppose that you wake up one day with spots all over your face, as in Figure 1.2. The doctor tells you that 90% of people who have smallpox have the same symptoms as you have. In other words, the probability of having these symptoms given that you have smallpox is 0.9 (ie 90%). As smallpox is often fatal, you are naturally terrified.
However, after a few moments of contemplation, you decide that you do not want to know the probability that you have these symptoms (after all, you already know you have them). Instead, what you really want to know is the probability that you have smallpox.
So you say to your doctor, “Yes, but what is the probability that I have smallpox given that I have these symptoms?”. “Ah”, says your doctor, “a very good question.” After scribbling some equations, your doctor looks up. “The probability that you have smallpox given that you have these symptoms is 1.1%, or equivalently, 0.011.” Of course, this is not good news, but it sounds better than 90%, and (more importantly) it is at least useful information. This demonstrates the stark contrast between the probability of the symptoms given a disease (which you do not want to know) and the probability of the disease given the symptoms, (which you do want to know).
Bayes’ rule transforms probabilities that look useful (but are often not), into probabilities that are useful. In the above example, the doctor used Bayes’ rule to transform the uninformative probability of your symptoms given that you have smallpox into the informative probability that you have smallpox given your symptoms.
Now, suppose you are a doctor, confronted with a patient who is covered in spots. The patient’s symptoms are consistent with chickenpox, but they are also consistent with another, more dangerous, disease, smallpox. So you have a dilemma. You know that 80% of people with chickenpox have spots, but also that 90% of people with smallpox have spots. So the probability (0.8) of the symptoms given that the patient has chickenpox is similar to the probability (0.9) of the symptoms given that the patient has smallpox (see Figure 1.2).
If you were a doctor with limited experience then you might well think that both chickenpox and smallpox are equally probable. But, as you are a knowledgeable doctor, you know that chickenpox is common, whereas smallpox is rare. This knowledge, or prior information, can be used to decide which disease the patient probably has. If you had to guess (and you do have to guess because you are the doctor) then you would combine the possible diagnoses implied by the symptoms with your prior knowledge to arrive at a conclusion (ie that the patient probably has chickenpox). In order to make this example more tangible, let’s run through it again, this time with numbers.
We can work out probabilities associated with a disease by making use of public health statistics. Suppose doctors are asked to report the number of cases of smallpox and chickenpox, and the symptoms observed. Using the results of such surveys, it is a simple matter to find the proportion of patients diagnosed with smallpox and chickenpox, and each patient’s symptoms (eg spots). Using these data, we might find that the probability that a patient has spots given that the patient has smallpox is 90% or 0.9. We can write this in an increasingly succinct manner using a special notation
p(symptoms are spots  disease is smallpox) = 0.9, 
where the letter p stands for probability, and the vertical bar  stands for “given that”. So, this shorthand statement should be read as
“the probability that the patient’s symptoms are spots given that he has smallpox is 90% or 0.9”.
The vertical bar indicates that the probability that the patient has spots depends on the presence of smallpox. Thus, the probability of spots is said to be conditional on the disease under consideration. For this reason, such probabilities are known as conditional probabilities. We can write this even more succinctly as
p(spotssmallpox) = 0.9. 
Similarly, we might find that spots are observed in 80% of patients who have chickenpox, which is written as
p(spotschickenpox) = 0.8. 
Equations 1.2 and 1.3 formalise why we should not use the symptoms alone to decide which disease the patient has. These equations take no account of our previous experience of the relative prevalence of smallpox and chickenpox, and are based only on the observed symptoms. As we shall see later, this is equivalent to making a decision based on the (in this case, false) assumption that both diseases are equally prevalent in the population, and that they are therefore a priori equally probable.
Note that the conditional probability p(spotssmallpox) is the probability of spots given that the patient has smallpox, but it is called the likelihood of smallpox (which is confusing, but standard, nomenclature). In this example, the disease smallpox has a larger likelihood than chickenpox. Indeed, as there are only two diseases under consideration, this means that, of the two possible alternatives, smallpox has the maximum likelihood. The disease with the maximum value of likelihood is known as the maximum likelihood estimate (MLE) of the disease that the patient has. Thus, in this case, the MLE of the disease is smallpox.
As discussed above, it would be hard to argue that we should disregard our knowledge or previous experience when deciding which disease the patient has. But exactly how should this previous experience be combined with current evidence (eg symptoms)? From a purely intuitive perspective, it would seem sensible to weight the likelihood of each disease according to previous experience of that disease, as in Figure 1.3. Since smallpox is rare, and is therefore intrinsically improbable, it might be sensible to weight the likelihood of smallpox by a small number. This would yield a small ‘weighted likelihood’, which would be a more realistic estimate of the probability that the patient has smallpox. For example, public health statistics may inform us that the prevalence of smallpox in the general population is 0.001, meaning that there is a one in a thousand chance that a randomly chosen individual has smallpox. Thus, the probability that a randomly chosen individual has smallpox is
p(smallpox) = 0.001. 
This represents our prior knowledge about the disease in the population before we have observed our patient, and is known as the prior probability that any given individual has smallpox. As our patient (before we have observed his symptoms) is as likely as any other individual to have smallpox, we know that the prior probability that he has smallpox is 0.001.
If we follow our commonsense prescription, and simply weight (ie multiply) each likelihood by its prior probability then we obtain ‘weighted likelihood’ quantities which take account of the current evidence and of our prior knowledge of each disease. In short, this commonsense prescription leads to Bayes’ rule. Even so, the equation for Bayes’ rule given below is not obvious, and should be taken on trust for now. In the case of smallpox, Bayes’ rule is \begin{equation} p({\rm smallpox}{\rm spots}) = \frac{p({\rm spots}{\rm smallpox}) \times p({\rm smallpox}) }{p({\rm spots})}. \tag{1.5} \end{equation}
The term p(spots) in the denominator of Equation 1.5 is the proportion of people in the general population that have spots, and therefore represents the probability that a randomly chosen individual has spots. As will be explained on p15, this term is often disregarded, but we use a value that makes our sums come out neatly, and assume that p(spots) = 0.081 (ie 81 in every 1,000 individuals has spots). If we now substitute numbers into this equation then we obtain
p(smallpoxspots) 
= 
0.9 × 0.001/0.081 


= 
0.011, 
which is the conditional probability that the patient has smallpox given that his symptoms are spots.
Crucially, the ‘weighted likelihood’ p(smallpoxspots) is also a conditional probability, but it is the probability of the disease smallpox given the symptoms observed, as shown in Figure 1.4. So, by making use of prior experience, we have transformed the conditional probability of the observed symptoms given a specific disease (the likelihood, which is based only on the available evidence) into a more useful conditional probability: the probability that the patient has a particular disease (smallpox) given that he has particular symptoms (spots).
In fact, we have just made use of Bayes’ rule to convert one conditional probability, the likelihood p(spotssmallpox) into a more useful conditional probability, which we have been calling a ‘weighted likelihood’, but is formally known as the posterior probability p(smallpoxspots).
As noted above, both p(smallpoxspots) and p(spotssmallpox) are conditional probabilities, which have the same status from a mathematical viewpoint. However, for Bayes’ rule, we treat them very differently.
The conditional probability p(spotssmallpox) is based only on the observed data (symptoms), and is therefore easier to obtain than the conditional probability we really want, namely p(smallpoxspots), which is also based on the observed data, but also on prior knowledge. For historical reasons, these two conditional probabilities have special names. As we have already seen, the conditional probability p(spotssmallpox) is the probability that a patient has spots given that he has smallpox, and is known as the likelihood of smallpox. The complementary conditional probability p(smallpoxspots) is the posterior probability that a patient has smallpox given that he has spots. In essence, Bayes’ rule is used to combine prior experience (in the form of a prior probability) with observed data (spots) (in the form of a likelihood) to interpret these data (in the form of a posterior probability). This process is known as Bayesian inference.
Bayesian inference is not guaranteed to provide the correct answer. Instead, it provides the probability that each of a number of alternative answers is true, and these can then be used to find the answer that is most probably true. In other words, it provides an informed guess. While this may not sound like much, it is far from random guessing. Indeed, it can be shown that no other procedure can provide a better guess, so that Bayesian inference can be justifiably interpreted as the output of a perfect guessing machine, a perfect inference engine (see Section 4.9, p92). This perfect inference engine is fallible, but it is provably less fallible than any other.
In order to make a diagnosis, we need to know the posterior probability of both of the diseases under consideration. Once we have both posterior probabilities, we can compare them in order to choose the disease that is most probable given the observed symptoms.
Suppose that the prevalence of chickenpox in the general population is 10% or 0.1. This represents our prior knowledge about chickenpox before we have observed any symptoms, and is written as
p(chickenpox) = 0.1, 
which is the prior probability of chickenpox. As was done in Equation 1.6 for smallpox, we can weight the likelihood of chickenpox with its prior probability to obtain the posterior probability of chickenpox
$\begin{array}{ccc}p\left(\mathrm{c}\mathrm{h}\mathrm{i}\mathrm{c}\mathrm{k}\mathrm{e}\mathrm{n}\mathrm{p}\mathrm{o}\mathrm{x}\right\mathrm{s}\mathrm{p}\mathrm{o}\mathrm{t}\mathrm{s})& =& p\left(\mathrm{s}\mathrm{p}\mathrm{o}\mathrm{t}\mathrm{s}\right\mathrm{c}\mathrm{h}\mathrm{i}\mathrm{c}\mathrm{k}\mathrm{e}\mathrm{n}\mathrm{p}\mathrm{o}\mathrm{x})\times p(\mathrm{c}\mathrm{h}\mathrm{i}\mathrm{c}\mathrm{k}\mathrm{e}\mathrm{n}\mathrm{p}\mathrm{o}\mathrm{x})/p(\mathrm{s}\mathrm{p}\mathrm{o}\mathrm{t}\mathrm{s})\hfill \\ & =& 0.8\times 0.1/0.081\hfill \\ & =& \mathrm{0.988.}\hfill \end{array}$ 
The two posterior probabilities, summarised in Figure 1.4, are therefore
p(smallpoxspots) = 0.011 

p(chickenpoxspots) = 0.988. 
Thus, the posterior probability that the patient has smallpox is 0.011, and the posterior probability that the patient has chickenpox is 0.988. Aside from a rounding error, these sum to one.
Notice that we cannot be certain that the patient has chickenpox, but we can be certain that there is a 98.8% probability that he does. This is not only our best guess, but it is provably the best guess that can be obtained; it is effectively the output of a perfect inference engine.
In summary, if we ignore all previous knowledge regarding the prevalence of each disease then we have to use the likelihoods to decide which disease is present. The likelihoods shown in Equations 1.2 and 1.3 would lead us to diagnose the patient as probably having smallpox. However, a more informed decision can be obtained by taking account of prior information regarding the diseases under consideration. When we do take account of prior knowledge, Equations 1.10 and 1.11 indicate that the patient probably has chickenpox. In fact, these equations imply that the patient is about 89 (=0.988/0.011) times more likely to have chickenpox than smallpox. As we shall see later, this ratio of posterior probabilities plays a key role in Bayesian statistical analysis (Section 1.1, p14).
Taking account of previous experience yields the diagnosis that is most probable, given the evidence (spots). As this is the decision associated with the maximum value of the posterior probability, it is known as the maximum a posteriori or MAP estimate of the disease.
The equation used to perform Bayesian inference is called Bayes’ rule, and in the context of diagnosis is \begin{equation} p({\rm disease}{\rm symptoms}) = \frac{p({\rm symptoms}{\rm disease}) \times p({\rm disease}) }{p({\rm symptoms})}, \tag{1.12} \end{equation}
which is easier to remember as \begin{equation} {\rm posterior} = \frac{ {\rm likelihood} \times {\rm prior} }{ {\rm marginal \: likelihood }}. \tag{1.13} \end{equation}
The marginal likelihood is also known as evidence, and we shall have more to say about it shortly.
If we consider a putative disease to represent a specific hypothesis, and the symptoms to be some observed data then Bayes’ rule becomes \begin{equation} p({\rm hypothesis}{\rm data}) = \frac{p({\rm data}{\rm hypothesis}) \times p({\rm hypothesis}) }{p({\rm data})}, \nonumber \end{equation}
where the word “hypothesis” should be interpreted as, “hypothesis is true”. Written in this form, the contrast between the likelihood and the posterior probability is more apparent. Specifically, the probability that the proposed hypothesis is true given some data that were actually observed is the posterior probability
p(hypothesisdata), 
whereas the probability of observing the data given that the hypothesis is true is the likelihood
p(datahypothesis). 
We now introduce a succinct, and reasonably conventional, notation for the terms defined above. There is nothing new in the mathematics of this section, just a rewriting of equations used above. If we represent the observed symptoms by x, and the disease by the Greek letter theta θ_{s} (where the subscript s stands for smallpox) then we can write the conditional probability (ie the likelihood of smallpox) in Equation 1.2 \begin{equation} p(x \theta_s) = p({\rm spots}{\rm smallpox}) = 0.9. \tag{1.16} \end{equation}
Similarly, the background rate of smallpox θ_{s} in the population can be represented as the prior probability
p(θ_{s}) = p(smallpox) = 0.001, 
and the probability of the symptoms (the marginal likelihood) is
p(x) = p(spots) = 0.081. 
Substituting this notation into Equation 1.5 (repeated here) \begin{equation} p({\rm smallpox}{\rm spots}) = \frac{p({\rm spots}{\rm smallpox}) \times p({\rm smallpox}) }{p({\rm spots})}, \tag{1.19} \end{equation}
yields \begin{equation} p(\theta_sx) = \frac{p(x \theta_s) \times p(\theta_s) } {p(x)}, \tag{1.20} \end{equation}
Similarly, if we define \begin{eqnarray} p(x\theta_c) & = & p( {\rm spots} {\rm chickenpox }) \nonumber \\ p(\theta_cx) & = & p( {\rm chickenpox} ∣{\rm spots} ) \nonumber \\ p(\theta_c) & = & p( {\rm chickenpox}), \tag{1.21} \end{eqnarray}
then we can rewrite Equation 1.9 to obtain the posterior probability of chickenpox as \begin{equation} p(\theta_cx) = \frac{p(x \theta_c) \times p(\theta_c) } {p(x)}, \tag{1.22} \end{equation}
If we use θ without a subscript to represent any disease (or hypothesis), and x to represent any observed symptoms (or data) then Bayes’ rule can be written as (we now drop the use of the × symbol) \begin{equation} p(\thetax) = \frac{p(x \theta) p(\theta) } {p(x)}, \tag{1.23} \end{equation}
Finally, we should note that smallpox made history by being the first disease to be eradicated from the Earth in 1979, which makes the prior probability of catching it somewhat less than the value p(θ_{s}) = 0.001 assumed in the above example.
Parameters and Variables: Notice that there is nothing special about which symbol stands for disease and which for symptoms, and that we could equally well have used θ to represent symptoms, and x to represent diseases. However, it is common to use a Greek letter like θ to represent the thing we wish to estimate, and x to represent the evidence (eg symptoms) on which our estimated value of θ will be based. Similarly, using an equally arbitrary but standard convention, the symbol that represents the thing we wish to estimate is usually called a parameter (θ), whereas the evidence used to estimate that thing is usually called a variable (x).
As noted above, when we take account of prior knowledge, it turns out that the patient is about 90 times more likely (ie 0.988 vs 0.011) to have chickenpox than smallpox. Indeed, it is often the case that we wish to compare the relative probabilities of two hypotheses (eg diseases). As each hypothesis acts as a (simple) model for the data, and we wish to select the most probable model, this is known as model selection, which involves a comparison using a ratio of posterior probabilities.
The posterior ratio, which is also known as the posterior odds between the hypotheses θ_{c} and θ_{s}, is \begin{equation} R_{post} = \frac{p(\theta_cx)} {p(\theta_sx) }, \tag{1.24} \end{equation}
If we apply Bayes’ rule to the numerator and denominator then \begin{equation} R_{post} = \frac{p(x \theta_c) p(\theta_c) / p(x)} {p(x \theta_s) p(\theta_s) / p(x)}, \tag{1.25} \end{equation}
where the marginal likelihood p(x) cancels, so that \begin{equation} R_{post} = \frac{p(x \theta_c)} { p(x \theta_s) } \times \frac{p(\theta_c)} { p(\theta_s) } \tag{1.26} \end{equation}
This is a product of two ratios, the ratio of likelihoods, or Bayes factor \begin{equation} B = \frac{p(x \theta_c)} { p(x \theta_s) } \tag{1.27} \end{equation}
and the ratio of priors, or prior odds between θ_{c} and θ_{s}, which is \begin{equation} R_{prior} = \frac{p(\theta_c)} { p(\theta_s) } \tag{1.28} \end{equation}
Thus, the posterior odds can be written as
R_{post} = B × R_{prior}, 
which, in words, is: posterior odds = Bayes factor × prior odds. In this example, we have
${R}_{\mathit{post}}=\frac{0.80}{0.90}\times \frac{0.1}{0.001}=88.9$.
Note that the likelihood ratio (Bayes factor) is less than one (and so favours θ_{s}), whereas the prior odds is much greater than one (and favours θ_{c}), with the result that the posterior odds come out massively in favour of θ_{c}. If the posterior odds is greater than 3 or less than 1/3 (in both cases one hypothesis is more than 3 times more probable than the other) then this is considered to represent a substantial difference between the probabilities of the two hypotheses^{19}, so a posterior odds of 88.9 is definitely substantial.
As promised, we consider the marginal likelihood p(symptoms) or p(x) briefly here (and in Chapter 2 and Section 4.5). The marginal likelihood refers to the probability that a randomly chosen individual has the symptoms that were actually observed, which we can interpret as the prevalence of spots in the general population.
Crucially, the decision as to which disease the patient has depends only on the relative sizes of different posterior probabilities (eg Equations 1.10, 1.11, and in Equations 1.20, 1.22). Note that each of these posterior probabilities is proportional to 1/p(symptoms) in Equations 1.10, 1.11, also expressed as 1/p(x) in Equations 1.20,1.22. This means that a different value of the marginal probability p(symptoms) would change all of the posterior probabilities by the same proportion, and therefore has no effect on their relative magnitudes. For example, if we arbitrarily decided to double the value of the marginal likelihood from 0.081 to 0.162 then both posterior probabilities would be halved (from 0.011 and 0.988 to about 0.005 and 0.494), but the posterior probability of chickenpox would still be 88.9 times larger than the posterior probability of smallpox. Indeed, the previous section on Bayes factors relies on the fact that the ratio of two posterior probabilities is independent of the value of the marginal probability.
In summary, the value of the marginal probability has no effect on which disease yields the largest posterior probability (eg Equations 1.10 and 1.11), and therefore has no effect on the decision regarding which disease the patient probably has.
The example above is based on medical diagnosis, but Bayes’ rule can be applied to any situation where there is uncertainty regarding the value of a measured quantity, such as the acoustic signal that reaches the ear when some words are spoken. The following example follows a similar line of argument as the previous one, and aside from the change in context, provides no new information for the reader to absorb.
If you walked into a hardware store and asked, Have you got fork handles?, then you would be surprised to be presented with four candles. Even though the phrases fork handles and four candles are acoustically almost identical, the shop assistant knows that he sells many more candles than fork handles (Figure 1.5). This in turn, means that he probably does not even hear the words fork handles, but instead hears four candles. What has this got to do with Bayes’ rule?
The acoustic data that correspond to the sounds spoken by the customer are equally consistent with two interpretations, but the assistant assigns a higher weighting to one interpretation. This weighting is based on his prior experience, so he knows that customers are more likely to request four candles than fork handles. The experience of the assistant allows him to hear what was probably said by the customer, even though the acoustic data was pretty ambiguous. Without knowing it, he has probably used something like Bayes’ rule to hear what the customer probably said.
Given that the two possible phrases are four candles and fork handles, we can formalise this scenario by considering the probability of the acoustic data given each of the two possible phrases. In both cases, the probability of the acoustic data depends on the words spoken, and this dependence is made explicit as two probabilities:
1) the probability of the acoustic data given four candles was spoken,
2) the probability of the acoustic data given fork handles was spoken.
A shorthand way of writing these is
p(acoustic datafour candles) p(acoustic datafork handles), 
where the expression p(acoustic datafour candles), for example, is interpreted as the likelihood that the phrase spoken was four candles. As both phrases are consistent with the acoustic data, the probability of the data is almost the same in both cases. That is, the probability of the data given that four candles was spoken is almost the same as the probability of the data given that fork handles was spoken. For simplicity, we will assume that these probabilities are
p(datafour candles) = 0.6 p(datafork handles) = 0.7. 
Knowing these two likelihoods does allow us to find an answer, but it is an answer to the wrong question. Each likelihood above provides an answer to the (wrong) question: what is the probability of the observed acoustic data given that each of two possible phrases was spoken?
The right question, the question to which we would like an answer is: what is the probability that each of the two possible phrases was spoken given the acoustic data? The answer to this, the right question, is implicit in two new conditional probabilities, the posterior probabilities
p(four candlesdata) p(fork handlesdata), 
as shown in Figures 1.6 and 1.7. Notice the subtle difference between the pair of Equations 1.31 and the pair 1.32. Equations 1.31 tells us the likelihoods, the probability of the data given two possible phrases, which turn out to be almost identical for both phrases in this example. In contrast, Equations 1.32 tells us the posterior probabilities, the probability of each phrase given the acoustic data.
Crucially, each likelihood tells us the probability of the data given a particular phrase, but takes no account of how often that phrase has been given (ie has been encountered) in the past. In contrast, each posterior probability depends, not only on the data (in the form of the likelihood), but also on how frequently each phrase has been encountered in the past; that is, on prior experience.
So, we want the posterior probability, but we have the likelihood. Fortunately, Bayes’ rule provides a means of getting from the likelihood to the posterior, by making use of extra knowledge in the form of prior experience, as shown in Figure 1.6.
Let’s suppose that the assistant has been asked for four candles a total of 90 times in the past, whereas he has been asked for fork handles only 10 times. To keep matters simple, let’s also assume that the next customer will ask either for four candles or fork handles (we will revisit this simplification later). Thus, before the customer has uttered a single word, the assistant estimates that the probability that he will say each of the two phrases is
p(four candles) = 90/100 = 0.9 p(fork handles) = 10/100 = 0.1. 
These two prior probabilities represent the prior knowledge of the assistant, based on his previous experience of what customers say.
When confronted with an acoustic signal that has one of two possible interpretations, the assistant naturally interprets this as four candles, because, according to his past experience, this is what such ambiguous acoustic data usually means in practice. So, he takes the two almost equal likelihood values, and assigns a weighting to each one, a weighting that depends on past experience, as in Figure 1.7. In other words, he uses the acoustic data, and combines it with his previous experience to make an inference about which phrase was spoken.
One way to implement this weighting (ie to do this inference) is to simply multiply the likelihood of each phrase by how often that phrase has occurred in the past. In other words, we multiply the likelihood of each putative phrase by its corresponding prior probability. The result yields a posterior probability for each possible phrase
$\begin{array}{c}p\left(\mathrm{four}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{candles}\right\mathrm{data})=\frac{p\left(\mathrm{data}\right\mathrm{four}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{candles})p\left(\mathrm{four}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{candles}\right)}{p\left(\mathrm{data}\right)}\\ p\left(\mathrm{fork}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{handles}\right\mathrm{data})=\frac{p\left(\mathrm{data}\right\mathrm{fork}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{handles})p\left(\mathrm{fork}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{handles}\right)}{p\left(\mathrm{data}\right)},\end{array}$ 
where p(data) is the marginal likelihood, which is the probability of the observed data.
In order to ensure that the posterior probabilities sum to one, the value of p(data) is 0.61 in this example, but as we already know from Section 1.1 (p15), its value is not important for our purposes. If we substitute the likelihood and prior probability values defined in Equations 1.31 and 1.33 in 1.34 then we obtain their posterior probabilities as
$\begin{array}{ccc}p\left(\mathrm{four}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{candles}\right\mathrm{data})& =& p\left(\mathrm{data}\right\mathrm{four}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{candles})p\left(\mathrm{four}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{candles}\right)/p\left(\mathrm{data}\right)\hfill \\ & =& 0.6\times 0.9/0.61=0.885,\hfill \\ p\left(\mathrm{fork}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{handles}\right\mathrm{data})& =& p\left(\mathrm{data}\right\mathrm{fork}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{handles})p\left(\mathrm{fork}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{handles}\right)/p\left(\mathrm{data}\right)\hfill \\ & =& 0.7\times 0.1/0.61=\mathrm{0.115.}\hfill \end{array}$
As in the previous example, we can write this more succinctly by defining
$\begin{array}{ccc}x& =& \mathrm{acoustic}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{data},\\ {\theta}_{c}& =& \mathrm{four}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{candles},\\ {\theta}_{h}& =& \mathrm{fork}\phantom{\rule[0.0ex]{0.2em}{0.0ex}}\mathrm{handles},\end{array}$
so that
$\begin{array}{ccc}p\left({\theta}_{c}\rightx)\phantom{\rule{0ex}{0ex}}& =& p\left(x\right{\theta}_{c})p\left({\theta}_{c}\right)/p\left(x\right)=0.885\\ p\left({\theta}_{h}\rightx)& =& p\left(x\right{\theta}_{h})p\left({\theta}_{h}\right)/p\left(x\right)=\mathrm{0.115.}\end{array}$ 
These two posterior probabilities represent the answer to the right question, so we can now see that the probability that the customer said four candles is 0.885 whereas the probability that the customer said fork handles was 0.115. As four candles is associated with the highest value of the posterior probability, it is the maximum a posteriori (MAP) estimate of the phrase that was spoken. The process that makes use of evidence (symptoms) to produce these posterior probabilities is called Bayesian inference.
This example follows the same line of reasoning as those above, but also contains specific information on how to combine probabilities from independent events, such as coin flips. This will prove crucial in a variety of contexts, and in examples considered later in this book.
Here, our task is to decide how unfair a coin is, based on just two coin flips. Normally, we assume that coins are fair or unbiased, so that a large number of coin flips (eg 1000) yields an equal number of heads and tails. But suppose there was a fault in the machine that minted coins, so that each coin had more metal on one side or the other, with the result that each coin is biased to produce more heads than tails, or vice versa. Specifically, 25% of the coins produced by the machine have a bias of 0.4, and 75% have a bias of 0.6. By definition, a coin with a bias of 0.4 produces a head on 40% of flips, whereas a coin with a bias of 0.6 produces a head on 60% of flips (on average). Now, suppose we choose one coin at random, and attempt to decide which of the two bias values it has. For brevity, we define the coin’s bias with the parameter θ, so the true value of θ for each coin is either θ_{0.4} = 0.4, or θ_{0.6} = 0.6.
One Coin Flip: Here we use one coin flip to define a few terms that will prove useful below. For each coin flip, there are two possible outcomes, a head x_{h}, and a tail x_{t}. For example, if the coin’s bias is θ_{0.6} then, by definition, the conditional probability of observing a head is θ_{0.6}
p(x_{h}θ_{0.6}) = θ_{0.6} = 0.6. 
Similarly, the conditional probability of observing a tail is
p(x_{t}θ_{0.6}) = (1 − θ_{0.6}) = 0.4, 
where both of these conditional probabilities are likelihoods. Note that we follow the convention of the previous examples by using θ to represent the parameter whose value we wish to estimate, and x to represent the data used to estimate the true value of θ.
Two Coin Flips: Consider a coin with a bias θ (where θ could be 0.4 or 0.6, for example). Suppose we flip this coin twice, and obtain a head x_{h} followed by a tail x_{t}, which define the ordered list or permutation
x = (x_{h}, x_{t}). 
As the outcome of one flip is not affected by any other flip outcome, outcomes are said to be independent (see Section 2.2 or Appendix C). This independence means that the probability of observing any two outcomes can be obtained by multiplying their probabilities
p(xθ) 
= p((x_{h}, x_{t})θ) 


= p(x_{h}θ) × p(x_{t}θ). 
More generally, for a coin with a bias θ, the probability of a head x_{h} is p(x_{h}θ) = θ, and the probability of a tail x_{t} is therefore p(x_{t}θ) = (1 − θ). It follows that Equation 1.40 can be written as
p(xθ) = θ × (1 − θ), 
which will prove useful below.
The Likelihoods of Different Coin Biases: According to Equation 1.41, if the coin bias is θ_{0.6} then
p(xθ_{0.6}) 
= 
θ_{0.6} × (1 − θ_{0.6}) 


= 
0.6 × 0.4 


= 
0.24, 
and if the coin bias is θ_{0.4} then (the result is the same)
p(xθ_{0.4}) 
= 
θ_{0.4} × (1 − θ_{0.4}) 


= 
0.4 × 0.6 


= 
0.24. 
Note that the only difference between these two cases is the reversed ordering of terms in Equations 1.43 and 1.46, so that both values of θ have equal likelihood values. In other words, the observed data x are equally probable given the assumption that θ_{0.4} = 0.4 or θ_{0.6} = 0.6, so they do not help in deciding which bias our chosen coin has.
Prior Probabilities of Different Coin Biases: We know (from above) that 25% of all coins have a bias of θ_{0.4}, and that 75% of all coins have a bias of θ_{0.6}. Thus, even before we have chosen our coin, we know (for example) there is a 75% chance that it has a bias of 0.6. This information defines the prior probability that any coin has one of two bias values, either p(θ_{0.4}) = 0.25, or p(θ_{0.6}) = 0.75.
Posterior Probabilities of Different Coin Biases: As in previous examples, we adopt the naïve strategy of simply weighting each likelihood value by its corresponding prior (and dividing by p(x)) to obtain Bayes’ rule
$\begin{array}{ccc}p\left({\theta}_{0.4}\right\mathbf{x})& =& p\left(\mathbf{x}\right{\theta}_{0.4})p\left({\theta}_{0.4}\right)/p\left(\mathbf{x}\right)\hfill \\ & =& 0.24\times 0.25/0.24\hfill \\ & =& 0.25,\hfill \end{array}$ 
$\begin{array}{ccc}p\left({\theta}_{0.6}\right\mathbf{x})& =& p\left(\mathbf{x}\right{\theta}_{0.6})p\left({\theta}_{0.6}\right)/p\left(\mathbf{x}\right)\hfill \\ & =& 0.24\times 0.75/0.24\hfill \\ & =& \mathrm{0.75.}\hfill \end{array}$ 
In order to ensure posterior probabilities sum to one, we have assumed a value for the marginal probability of p(x) = 0.24 (but we know from p15 that its value makes no difference to our final decision about coin bias). As shown in Figures 1.9 and 1.10, the probabilities in Equations 1.48 and 1.49 take account of both the data and of prior experience, and are therefore posterior probabilities. In summary, whereas the equal likelihoods in this example (Equations 1.44 and 1.47) did not allow us to choose between the coin biases θ_{0.4}and θ_{0.6}, the values of the posterior probabilities (Equations 1.48 and 1.49) imply that a bias of θ_{0.6} is 3 (=0.75/0.25) times more probable than a bias is θ_{0.4}.
When you look at Figure 1.11, do you see a hill or a crater? Now turn the page upsidedown. When you invert the page, the content of the picture does not change, but what you see does change (from a hill to a crater). This illusion almost certainly depends on the fact that your visual system assumes that the scene is lit from above. This, in turn, forces you to interpret the Figure 1.11 as a hill, and the inverted version as a crater (which it is, in reality).
In terms of Bayes’ rule, the image data are equally consistent with a hill and a crater, where each interpretation corresponds to a different maximum likelihood value. Therefore, in the absence of any prior assumptions on your part, you should see the image as depicting either a hill or a crater with equal probability. However, the assumption that light comes from above corresponds to a prior, and this effectively forces you to interpret the image as a hill or a crater, depending on whether the image is inverted or not. Note that there is no uncertainty or noise; the image is perfectly clear, but also perfectly ambiguous without the addition of a prior regarding the light source. This example demonstrates that Bayesian inference is useful even when there is no noise in the observed data, and that even the apparently simple act of seeing requires the use of prior information^{10;40;41;42}:
Seeing is not a direct apprehension of reality, as we often like to pretend. Quite the contrary: seeing is inference from incomplete information …
If we are given a coin with a known bias of say, θ = 0.6, then the probability of a head for each coin flip is given by the likelihood p(x_{h}θ) = 0.6. This is an example of a forward probability, which involves calculating the probability of each of a number of different consequences (eg obtaining two heads) given some known cause or fact, see Figure 1.12. If this coin is flipped a 100 times then the number of heads could be 62, so the actual proportion of heads is x_{true} = 0.62. But, because no measurement is perfectly reliable, we may miscount 62 as 64 heads, so the measured proportion is x = 0.64. Consequently, there is a difference, often called noise, between the true coin bias and the measured proportion of heads. The source of this noise may be due to the probabilistic nature of coin flips or to our inability to measure the number of heads accurately. Whatever the cause of the noise, the only information we have is the measured number of heads, and we must use this information as wisely as possible.
The converse of reasoning forwards from a given physical parameter or scenario involves a harder problem, also illustrated in Figure 1.12. Reasoning backwards from measurements (eg coin flips or images) amounts to finding the posterior or inverse probability of the value of an unobserved variable (eg coin bias, 3D shape), which is usually the cause of the observed measurement. By analogy, arriving at the scene of a crime, a detective must reason backwards from the clues, as eloquently expressed by Sherlock Holmes:
Most people, if you describe a train of events to them, will tell you what the result would be. They can put those events together in their minds, and argue from them that something will come to pass. There are few people, however, who, if you told them a result, would be able to evolve from their own inner consciousness what the steps were that led to that result. This power is what I mean when I talk of reasoning backward, or analytically.
Sherlock Holmes, from A Study in Scarlet. AC Doyle, 1887.
Indeed, finding inverse probabilities is precisely the problem Bayes’ rule is designed to tackle.
All decisions should be based on evidence, but the best decisions should also be based on previous experience. The above examples demonstrate not only that prior experience is crucial for interpreting evidence, but also that Bayes’ rule provides a rigorous method for doing so.
Note that this text includes corrections in the 8th printing (2017).