# MCMCorner: Bayesian bootcamp

MCMC robot needs love!

Probability theory is nothing but common sense reduced to calculation.” Pierre Laplace

The ability to rigorously diagnose MCMC performance is founded on a solid understanding of how the algorithms work. Fortunately, this does not necessarily entail a lot of complex probability theory—knowledge of some basic probability concepts and a strong intuitive understanding of the mechanics is sufficient. We’ll dedicate this installment to a brief introduction to Bayesian inference and some related concepts from probability theory. Next time we’ll get into the mechanics of MCMC.

First, let’s consider what the MCMC is trying to approximate. Bayesian inference is focussed on posterior probabilities, P($\theta$|X), the probability of the parameter, $\theta$, given the data, X, which is inferred using Bayes’ Theorem:

where P(X|$\theta$) is the likelihood function (the probability of observing the data given the parameter value), P($\theta$) is the prior probability for the parameter, and P(X) is the marginal likelihood of observing the data under the model.

We’ve all seen Bayes’ theorem a million times, but have you ever wondered where it comes from? Even if you’re not enthusiastically shouting Yes! at your computer, working through the derivation of Bayes’ theorem is useful: it provides an opportunity to cover some important concepts. Plus, how many fundamental probability theorems can you derive from first principles? Just imagine casually working this little gem into conversations with your friends! Surprisingly, Bayes’ theorem—one of the most important results in probability theory—simply derives from the definition of conditional probability: Given that B is observed, we write the probability of observing A as P(A|B). The vertical bar, “|”, is read “given that” or “conditional upon”. By definition:

This isn’t fancy theory or complicated math, it’s just common sense. The definition of conditional probability is simply stating that the probability of observing A given that we have observed B is equivalent to the fraction of cases in which we observe B, where we also observe A. We can make this idea more concrete. Imagine that we have collected some census data from the students in our class (on gender and nationality), which we then summarize in the following table:

We might ask: “What is the probability that a student is Canadian, given that he is male?” The conditional probability, P(Canadian | male), is just the fraction of cases in which we observed male students, 15/40, in which students were also observed to be Canadian, 3/40.  Accordingly, the solution is simply (3/40÷15/40) =  3/15 = 0.2. One in five students in our class are Canadian, given that we are considering male students.

By contrast, the joint probability—probability of observing both A and B, which we write P(AB)—can be easily obtained by rearranging the expression for conditional probability:

which is the conditional probability of observing A given that B is observed, times the probability of observing B (the condition).

For example, the joint probability of being both male and Canadian in our class is equal to the product of the conditional probability P(Canadian | male), which we have already determined to be 3/15, times the probability of the condition, P(male), which is 15/40.  So, the joint probability of being both male and Canadian is 3/15 x 15/40 = 3/40 = 0.075. Again, this result agrees with common sense—3 of the 40 students are Canadian males.

Note that we could have arrived at the joint probability for Canadian males via a different route, similar to that for the joint probability of A and B:

which is the conditional probability of observing B given that A has occurred, times the probability of observing the condition, A. Here, the joint probability of being both male and Canadian is the product of the conditional probability P(male | Canadian) x P(Canadian). The conditional probability of being male, given that the student is Canadian is (3/40÷5/40) = 3/5. The unconditional probability of a student being Canadian is 5/40. Therefore, the joint probability of a student being a Canadian male is 3/5 x 5/40 = 3/40 = 0.075. Same number, different route: cool!

A final probability concept is important in Bayesian inference: marginal probability. This is the unconditional probability of an observation. For example, the probability that a student in our class is Canadian (without conditioning on gender) is written in the margin of our table: 5/40 = 0.125. We obtain the marginal probability by summing over the values for gender: there are 2 female and 3 male Canadian students. We can query the table marginally with respect to any variable; for example, we can calculate the marginal probability that a student is female by summing over the categories for nationality: 23/40 + 2/40 = 25/40 = 0.63.

In phylogenetics, we are often interested in the tree parameter. Bayesian phylogenetic methods will typically generate a tree file: the trees it contains collectively comprise a sample from (and approximation of) the marginal posterior probability distribution of trees. It is the distribution of trees that is marginal with respect to all of the other parameters in the phylogenetic model (base frequencies, substitution rates, etc.). In this case, rather than marginalizing over a two-dimensional table of census data, the distribution of trees is marginal with respect to an N-dimensional hypervolume, where N is the number of free (estimated) parameters in our phylogenetic model.

When the marginalization is made with respect to continuous parameters, we typically describe the process as integrating over’ the nuisance parameters, and when the marginalization entails discrete parameters, we describe it as summing over’ the nuisance parameters—conceptually these operations are identical in that they are effectively averaging over some nuisance parameter(s). Nuisance parameters are just the parameters that we are marginalizing over (i.e., those not currently of interest), and may change depending on the parameter that we are focussing on: e.g., when we are calculating the marginal probability of being female, nationality is the nuisance parameter, but if we are interested in calculating the marginal probability of being Canadian, gender is the nuisance parameter.

We can summarize aspects of the marginal posterior probability densities (for continuous variables) or distributions (for discrete variables): the credible interval or highest posterior density, the mean, median, standard deviation, etc. We’ll talk more about these later.

Deriving Bayes’ Theorem.

Because the joint probability of observing two outcomes, P(AB), equals both P(B) P(A | B) and P(A) P(B | A), we can determine one conditional probability from the other using Bayes theorem. Recall that the definition of conditional probability is:

and that we can expand the term for the joint probability in the numerator as follows:

which gives:

Done: it’s that simple! The basic idea is to estimate the probability of an unobserved event, A, given the observed event, B. Here, the unobserved event is a parameter or an hypothesis, and the observed event is the dataset that we have collected to estimate the parameter/evaluate the hypothesis. In fact, let’s swap the names of the variables to reflect this relationship (using X to represent the data, and $\theta$ to represent a generic parameter):

P(parameter | data) = P(data | parameter) x P(parameter)
P(data)

or equivalently:

Now let’s walk through the components of Bayes’ theorem.

The posterior—Bayesian inference is focussed on the posterior probability, P($\theta$|X), which is simply the probability of an hypothesis or a parameter estimate given a sample of data (and also conditional on the model—as in maximum-likelihood estimation, parametric inferences are conditioned on the model, which amounts to assuming that the model provides an adequate description of the stochastic process that gave rise to the observed data). The posterior probability is a statement of our beliefs about the parameter after evaluating the data at hand.

The likelihood function—The likelihood function, P(X|$\theta$), is the vehicle that conveys the information in the data. A given phylogenetic model, such as GTR+$\Gamma$, will have the same likelihood function whether inference is performed via maximum-likelihood estimation or Bayesian inference.

The prior—The prior probability, P($\theta$), specifies our beliefs about the parameter value before looking at the new data at hand. During an analysis, our prior beliefs (prior probability) are updated to become our posterior beliefs (posterior probability). Our beliefs are updated by the information in the data via the likelihood function. I think this would make a nice Gregorian chant: “The joint posterior probability density is simply the updated version of the joint prior probability density: the prior is updated by information in the data via the likelihood function to provide the posterior estimate”.

This reflects the way (I think) we do science—we have some idea about an hypothesis (even if that belief is `all possible outcomes are equally probable’), we collect some new data relevant to evaluating that hypothesis, and then revise our beliefs based on the new evidence. Because the posterior probability is proportional to the product of the likelihood and the prior probability, the prior will always exert some influence on the inferred posterior probability. However, the influence of the prior will diminish as the amount of data increases and the posterior probability becomes dominated by the information in the data via the likelihood function. It is sometimes a difficult task to appropriately represent a prior belief as a probability distribution, but at least it is explicit about the fact that we have prior beliefs.

The marginal likelihood—The denominator—the marginal likelihood of the data, P(X)—simply serves to normalize the numerator. Posterior probabilities are, after all, probabilities, so they must range between zero and one. The numerator is the product of the likelihood of the data (for a specific parameter value) times the prior probability of that specific parameter value. This means that the corresponding denominator is the sum of the equivalent term for all possible parameter values. Just like the monster that lives under your bed, the denominator of Bayes theorem is a hairy beast. For the phylogeny problem, this is a multi-dimensional integration of the likelihood function over the joint prior probability densities for all parameters (including the vector of branch lengths and substitution-model parameters) for each tree, and a summation over all possible trees. Yikes!

The denominator is clearly way too nasty to solve analytically, and for many years was a ‘deal breaker’ for Bayesian data analysis. However, the rise of fast, cheap computing power and the development of some clever numerical methods (algorithms) made Bayesian inference possible. The magic of the Metropolis-Hastings MCMC algorithm is that it allows us to sample from (i.e., to approximate) the posterior probability density without having to calculate the marginal likelihood. Next time we’ll delve into the mechanics of the Metropolis-Hastings MCMC algorithm. In the mean time, if you have questions or comments about this post, please leave a message below or drop me an email.