MCMC Corner: I, Robot

MCMC robot needs love!

MCMC robot needs love!

It is the obvious things that are so difficult to see most of the time. People say ‘It’s as plain as the nose on your face.’ But how much of the nose on your face can you see, unless someone holds a mirror up to you?” Isaac Asimov

The ability to rigorously diagnose MCMC performance requires familiarity with some basic concepts from probability theory (discussed last time) and a strong intuitive understanding of the underlying mechanics—we need to know how the algorithms work in order to understand when they are not working. In this installment we’ll briefly cover the mechanics of the Metropolis Hastings MCMC algorithm.

Recall that Bayesian inference is focused on the posterior probability density of parameters. The posterior probability of the parameters can, in principle, be solved using Bayes’ theorem. However, (most) phylogenetic problems cannot be solved analytically, owing mainly to the denominator of Bayes’ theorem—the marginal likelihood requires solving multiple integrals (for all of the continuous parameters, such as branch lengths, substitution rates, stationary frequencies, etc.) for each tree, and summing over all trees.

Accordingly, Bayesian inference of phylogeny typically resorts to numerical methods that approximate the posterior probability density. There are many flavors of Markov chain Monte Carlo (MCMC) algorithms—Gibbs samplers, Metropolis-coupled and reversible-jump MCMC, etc.—we will consider the Metropolis Hastings (MH) algorithm because it is commonly used for phylogenetic problems, and because it is similar to many other variants (which we will cover elsewhere). Note that MCMC and Bayesian inference are distinct animals: they have a relationship similar to that between ‘optimization algorithms’ and ‘maximum-likelihood estimation.’ Some Bayesian inference can be accomplished without MCMC algorithms, and MCMC algorithms can be used to solve problems in non-Bayesian statistical frameworks.

To introduce the MH algorithm, we will imagine a robot that is programed to explore an area. Specifically, the goal of our robot is to generate a topographic map of an unknown terrain. This terrain has a total surface area of one hectare. We deploy our robot by parachute at a random location within the terrain. We have programmed the robot with three simple rules:

  1. If a proposed step will take the robot uphill, it automatically takes the proposed step.
  2. If a proposed step will take the robot downhill, it divides the elevation of the proposed location by the elevation of the current location: we call this quotient R. It then generates a uniform random number, U[0,1]. If U<R, the robot takes the proposed step; otherwise, it stays put.
  3. The distribution for proposing steps is symmetrical. That is, the probability of proposing a step (but not necessarily accepting a proposed step) from point A to point B is equal to the probability of proposing a step from point B to point A.

We allow our little robot to wander through the terrain following these three simple rules. At prescribed intervals (e.g., every 10 proposed steps), the robot records the details of his position (his elevation, latitude, longitude, etc.) in a log. If we allow our robot to wander long enough, his log is guaranteed to provide an arbitrarily precise description of the topography of the terrain. 

Let’s consider a slightly more formal description of the Metropolis-Hastings MCMC algorithm. First some preliminaries. This algorithm entails simulating a Markov chain that has a stationary distribution that is the joint posterior probability density of phylogenetic model parameters. The `state’ of the chain is a set of parameter values that fully specify phylogenetic model: e.g., a tree topology, a vector of branch lengths, and a set of parameters specifying the stochastic model of trait change (e.g., for the GTR + \Gamma  substitution model, this comprises a vector of values for the four stationary frequencies, a vector of values for the six exchangeability parameters, and the value of the alpha parameter describing the shape of the gamma-distrbuted among-site rate variation). Under the robot analogy, the state of the chain corresponds to a unique point in the terrain. The Markov property of the MCMC reflects the fact that the next state of the chain only depends on the current state of the chain, but not on previous states—that is, the past affects the future only through the present. The Monte Carlo aspect of the MCMC reflects the fact that it is a simulation: it is a numerical method that relies on repeated random sampling.

Now, on to the MH algorithm:

  1. Initialize the chain with values for all parameters, including the tree topology, \tau , the vector of branch lengths \nu , and substitution model parameters \pi_i, q_i, \alpha. We will call the set of model parameters \theta . The initial parameter values might be specified arbitrarily, or might be drawn from the corresponding prior probability density for each parameter.
  2. Select a single parameter (or set of parameters) to alter according to its proposal probability. For example, here are the default proposal probabilities used by MrBayes:
      The MCMC sampler will use the following moves:
         With prob.  Chain will use move
            1.00 %   Dirichlet(Revmat)
            1.00 %   Slider(Revmat)
            1.00 %   Dirichlet(Pi)
            1.00 %   Slider(Pi)
            2.00 %   Multiplier(Alpha)
           10.00 %   ExtSPR(Tau,V)
           10.00 %   ExtTBR(Tau,V)
           10.00 %   NNI(Tau,V)
           10.00 %   ParsSPR(Tau,V)
           40.00 %   Multiplier(V)
           10.00 %   Nodeslider(V)
            4.00 %   TLMultiplier(V)

We can see that ~1% of the time (i.e., with probability ~0.01), the MCMC will propose changes to the exchangeability (`revmat’) and stationary frequency (`pi’) parameters using a `Dirichlet’ proposal mechanism, and an equal effort proposing changes to the same parameters using something called a `Slider’ proposal mechanism. Conversely, ~10% of the time (i.e., with probability ~0.1), the MCMC will propose changes to the tree and branch lengths (`Tau and V’) using the `extending SPR’ proposal mechanism, and with equal probability using the `extending TBR’, `extending NNI’ and the `parsimony SPR’ proposal mechanisms.

For the moment, we won’t worry about the details of these proposal mechanisms—they basically involve different ways of `poking’ the current parameter value, \theta , to generate a new (proposed) parameter value, \theta ' . The important question at the moment is: Where do these proposal probabilities come from? The answer is: experience. We want to design the MCMC such that it invests effort in a given parameter in proportion to the difficulty of approximating that parameter. Note that the MCMC summarized above will spend ~2% of its time proposing changes to the exchangeability and stationary frequency parameters, but will invest ~40% of its time proposing changes to the topology parameter. Experience suggests that the tree topology is a more difficult parameter for the MCMC to approximate relative to the exchangeability and stationary frequency parameters (in fact, it appears that the developers of MrBayes determined from their experience that the topology is ~20 times harder to approximate).

3. Propose a change to the selected parameter using the parameter-specific proposal

mechanism. Different parameters may naturally have different prior probability densities—for example, the stationary frequencies are proportions (ranging between 0 and 1), and so are conveniently described using a Dirichlet prior probability density, whereas the alpha-shape parameter can range between zero and infinity, so is better described using a uniform prior probability density. Different kinds of prior probability densities will have specific proposal mechanisms—for example, there is something called a `Dirichlet’ proposal mechanism that is used to propose new values for parameters described by a Dirichlet prior, and something called a `Slider’ proposal mechanism that is used to propose new values for parameters described by a uniform prior. (Again, we’ll go into the gory details of these proposal mechanisms another time. The main idea for now is that the proposal mechanisms generate a new parameter value by poking the current parameter value.)

The design of proposal mechanisms is something of a dark art—trial and error are used to guide the development of mechanisms that work well. Nevertheless, there are basic criteria that the proposal mechanisms must meet. All proposal mechanisms must be:

(i) stochastic (new parameter values must be proposed `randomly’)

(ii) irreducible (all parameter values must be potentially accessible by the chain)

(iii) aperiodic (the proposal mechanism must not induce epicycles of the chain)

4. Calculate the probability of accepting the proposed change, R:

R = \text {min} \Bigg {[} 1, \underbrace {\frac{P(X \mid \theta ')}{P(X \mid \theta)}}_\text{likelihood ratio} \cdot \underbrace {\frac{P(\theta ')}{P(\theta)}}_\text{prior ratio} \cdot \underbrace {\frac{P(\theta \mid \theta ')}{P(\theta ' \mid \theta)}}_\text{proposal ratio} \Bigg {]}

The acceptance probability, R, is either equal to one or the product of three ratios—so long as that product is less than one (because R is a probability, so it cannot be greater than one). So, what are these three ratios?

Likelihood ratio: the likelihood ratio is simply the likelihood of the observations given the proposed value of the parameter, divided by the likelihood of the observations given the current value of the parameter. We can calculate the likelihood for any given parameterization of the phylogenetic model using the pruning (Felsenstein) algorithm.

Prior ratio: in Bayesian inference, each parameter is a random variable, and so is described by a prior probability density. Accordingly, we can `look up’ (calculate) the prior probability of any specific parameter value. The prior ratio is simply the prior probability of the proposed and current parameter values.

Proposal ratio: the proposal ratio is the probability of proposing the current parameter value given the proposed parameter value, divided by the converse. This is also called the Hastings ratio. This is equivalent to the rule that forces our robot to propose steps symmetrically; i.e., that the probability of proposing a step from point A to point B in the terrain is equal to the probability of proposing a step from point B to point A. For inference problems in which the dimensions of the model are static, the proposal ratio is usually equal to one (we’ll discuss exceptions to this in the context of reversible-jump MCMC in a future post).

Note that if the proposal ratio is equal to one, the acceptance probability, R is based only on the product of the likelihood and prior ratios. Does that ring a bell? [Hint: think of Bayes’ theorem.] [Hint 2: think of the right side of Bayes’ theorem.] [Hint 3: think of the numerator of the right side of Bayes’ theorem.] That is, the posterior probability is proportional to the product of the likelihood and prior probability. Accordingly, the acceptance probability is the posterior probability of the proposed state divided by the posterior probability of the current state. Just as we programmed our robot, if the ratio of the proposed and current elevations (posterior probabilities) is greater than 1 (i.e., it is an uphill step), we always accept the proposed change. When the ratio of the proposed and current elevations is less than 1 (i.e., it is an downhill step), we may or may not take the proposed step (stay tuned).

5. Generate a uniform random number between zero and one, U[0,1]. If U<R, accept

the proposed change; otherwise, the current state of the chain becomes the next state of the chain. Downhill moves will be accepted `randomly’ in proportion to the difference in elevation.

6. Repeat steps 2–5 an `adequate’ number of times.

That’s it! Notice that the decision to accept or reject proposed steps in the MCMC (and thus to sample from the joint posterior probability density) is based exclusively on the likelihood and prior probability of the proposed and current states—two quantities that are easy to calculate. The beautiful, fantabulous achievement of the Metropolis-Hastings algorithm is the slaying of the beastly denominator of Bayes’ theorem. Specifically, the algorithm allows us to do inference while entirely avoiding the need to calculate the (completely intractable) marginal likelihood!

Approximating the joint posterior probability density is based on samples from the MCMC: a chain following the simple rules outlined above will sample parameter values in proportion to their posterior probability. That is, the proportion of time that the chain spends in any particular state is a valid approximation of the posterior probability of that state (e.g., if a clade is present in 87% of the samples drawn from the chain, then the posterior probability of that clade is estimated to be 0.87).

Why do we accept proposals to states with lower posterior probability? People familiar with maximum-likelihood estimation often misunderstand the purpose of accepting proposals to states with a lower posterior probability. In maximum-likelihood inference, the game is to simply climb relentlessly and mindlessly up the likelihood surface in search of the very tip of the globally highest peak. Accordingly, the only reason these zombie hill climbers might consider a downward move would be motivated by concerns that they were currently climbing up a local (rather than global) optimum. By contrast, Bayesians are not just interested in the elevation of the very tip of the absolute peak of the posterior probability surface, but rather are interested in exploring and estimating the entire joint posterior probability density—we want topographical map of the entire posterior probability, not the elevation of the tip of the very highest peak in parameter space.

This survey of the joint posterior probability density results in a more robust inference procedure (inferences are averaged over the joint posterior probability of all model parameters), and allows us to estimate and evaluate the marginal posterior probability density for each of the parameters (these can be viewed by querying the MCMC samples with respect to the parameter of interest—we will do this in future tutorials). Accordingly, it is not sufficient for the chain to reach the peak of the joint posterior probability density; we need the chain to mix over the entire stationary distribution, spending time at each location in proportion to its posterior probability.

Diagnosing MCMC performance. Next to the specification of priors (which are inspired by much more philosophical considerations), performance of the MCMC algorithm used to approximate the joint posterior probability distribution is the greatest concern associated with Bayesian inference (it is certainly a more general concern, as it holds even when the priors are uncontroversial, and, in my opinion, is a more legitimate and practical concern). There are three closely related issues associated with MCMC performance: (1) convergence (is the robot sampling from the stationary distribution); (2) mixing (is the robot efficiently moving over the posterior probability density); and (3) adequacy (has the robot collected sufficient samples from the target distribution to describe it adequately). We’ll address these issues in future posts. In the mean time, if you have questions or comments about this post, please leave a message below or drop me an email.

10 thoughts on “MCMC Corner: I, Robot

  1. G

    Hi, nice tutorial. I am trying to implement M-H with code but I am stuck with the Hastings correction ratio when using a gaussian distribution. It is supposed to be equal to 1 right?
    In case of Random-Walk M-H it will be equal to one indeed because the mu of the proposal moves to the position of the current candidate:
    R-code:
    dnorm(y_current,y_candidate,my_sigma)/dnorm(y_candidate,y_current,my_sigma)
    Unfortunately I can not see how this can be equal to 1 when using a NON-Random-Walk M-H.
    In a non-random-walk M-H the mu will always remain the same, so the ratio above won’t be equal to 1. Did I misunderstand something? Thanks.

  2. Jeremy Brown

    Hi G,

    Can you be more specific about what you mean by a non-random-walk M-H? Are you talking about a situation where your Gaussian proposal distribution is not centered on the current value?

    1. G

      Hello Jeremy, Yes this is what I mean. Or in other words, when the proposal distribution does not move between iterations and stays fixed. Thank you.

    2. G

      When the proposal distribution is centered on the current value the Hastings correction ratio should be:
      dnorm(y_current,y_candidate,my_sigma)/dnorm(y_candidate,y_current,my_sigma)
      Right?
      When the proposal distribution is not centered on the current value and it stays fixed what is the Hastings correction ratio? I thought is was:
      dnorm(y_current,fixed_mu,my_sigma)/dnorm(y_candidate,fixed_mu,my_sigma)
      But this is not equal to 1. Maybe I am calculating the ratio wrong? Maybe I misunderstood the bibliography and this ratio should not be equal to 1 when using a symmetric distribution? Maybe a “symmetric distribution” is not what I think it is?
      Thank you!

    3. G

      Maybe what I have in mind is the M-H Independent Sampler where the ratio, when using a gaussian distribution, doesn’t have to be equal to 1. I guess I can not understand how to implement a Vanilla M-H without making it a Random-Walk M-H. In theory I understand that “I have to draw the candidate sample from my proposal distribution taking into consideration the current sample”. In practice what everyone does is to re-center the proposal distribution on the current sample, and then draw from that distribution. But this makes the algorithm a “Random-Walk M-H”. Unfortunately I can not find a Vanilla M-H implementation on the internet. I only find Random-Walk M-H and Independent M-H. Sorry for the confusion! 🙂

  3. Jeremy Brown

    Hi G,

    The Hastings ratio essentially corrects for asymmetries in the probability of proposals between two states. If your proposal distribution remains constant, the probability (density) of proposing each state is also constant. However, if you’re using a Gaussian distribution the ratio of those probability densities is unlikely to equal 1. You can use essentially the same formula you have listed above, but keep the mean the same in both the numerator and denominator.

    Is there a reason you want to use a fixed proposal distribution? It seems less efficient, unless you have strong suspicions about the location of your posterior.

    1. G

      Hi Jeremy,

      Thank you for your reply and the explanation. No there is no particular reason. I was just trying to implement all M-H variations and put a label on them. Now I am stuck on the Non-Random Walk M-H implementation (if such thing exists). The algorithm description says: “Draw candidate sample GIVEN/TAKING INTO CONSIDERATION the current sample.” All the implementations I find they just recenter the gaussian around the current sample. But that makes the algorithm a Random-Walk M-H. Some other implementations use a fixed proposal. That makes the algorithm a “Independent M-H”. So, what is the implementation of a “Vanilla M-H”? I am kinda confused. Thank you.

  4. Pingback: Jukes Cantor Model of DNA substitution | Workshop in Applied Phylogenetics

  5. kaiba

    Why is the convergence of the MCMC dependent mostly on the width of the Gaussian proposal and data uncertainties?

  6. Rachel

    Hi, thanks for doing this series of blogs, it’s very well-written. I am working on my own analysis and am frantically trying to understand how to diagnose my own results from MrBayes. Are there any more posts that cover that topic?

Comments are closed.