# MCMC Corner: I, Robot

MCMC robot needs love!

It is the obvious things that are so difficult to see most of the time. People say ‘It’s as plain as the nose on your face.’ But how much of the nose on your face can you see, unless someone holds a mirror up to you?” Isaac Asimov

The ability to rigorously diagnose MCMC performance requires familiarity with some basic concepts from probability theory (discussed last time) and a strong intuitive understanding of the underlying mechanics—we need to know how the algorithms work in order to understand when they are not working. In this installment we’ll briefly cover the mechanics of the Metropolis Hastings MCMC algorithm.

Recall that Bayesian inference is focused on the posterior probability density of parameters. The posterior probability of the parameters can, in principle, be solved using Bayes’ theorem. However, (most) phylogenetic problems cannot be solved analytically, owing mainly to the denominator of Bayes’ theorem—the marginal likelihood requires solving multiple integrals (for all of the continuous parameters, such as branch lengths, substitution rates, stationary frequencies, etc.) for each tree, and summing over all trees.

Accordingly, Bayesian inference of phylogeny typically resorts to numerical methods that approximate the posterior probability density. There are many flavors of Markov chain Monte Carlo (MCMC) algorithms—Gibbs samplers, Metropolis-coupled and reversible-jump MCMC, etc.—we will consider the Metropolis Hastings (MH) algorithm because it is commonly used for phylogenetic problems, and because it is similar to many other variants (which we will cover elsewhere). Note that MCMC and Bayesian inference are distinct animals: they have a relationship similar to that between ‘optimization algorithms’ and ‘maximum-likelihood estimation.’ Some Bayesian inference can be accomplished without MCMC algorithms, and MCMC algorithms can be used to solve problems in non-Bayesian statistical frameworks.

# List of a General Advice For Aspiring Phylogeneticists

For the past few years, we’ve maintained a growing list of general advice for folks interested in doing applied phylogenetics. We’ve now transferred this page to the new site. The first piece of advice on this page is to use a simple text editor rather than a complicated word processor when working with input and output files from phylogenetic software; the figures below show how much of a difference this can make.

Fig 1: This is what a text file should look like when opened in a text editor (in this case, the text editor is TextWrangler).

Fig. 2: This is what the same text above looks like if we save it as a Microsoft .doc formatted file.

Drop us a line if you can think of other basic advice you’d like to see added to our list!

TreeThinkers is a blog devoted to phylogenetic and phylogeny-based inference. We aim to use it as a place to discuss recent research and methods; to ask and answer questions; and serve as a general resource for news and trivia in phylogenetics. We have already had several posts since the 2013 Bodega Workshop ended and we plan to keep things going strong. This post is meant to give a (belated) introduction to the blog and provide some general information about how things work around here.

The group of us that organize the Bodega workshop have been talking about developing a blog associated with the course for a few years. Following the switch from our increasingly clunky wiki to this shiny new site, we’ve decided that the time was ripe….and here we are!    Several of us Bodega instructors have signed on as contributors, and we welcome guest posts or regular contributions from the rest of the community. Get in touch with me if you’re interested.

As a blog associated with a course, one of our central focuses will always be on teaching. Brian Moore has a running series of posts called MCMC Corner that discusses various aspects of Bayesian inference. Looking at his drafts of upcoming posts, this looks to be an informative and useful set of articles. We will also be posting on general topics of interest,  best practices or common sources of confusion encountered in phylogenetic analysis. Rich Glor’s tutorial explaining the various parameterizations of the Gamma distribution is a great example. Finally, we’ll post about recent findings, news, and announcements that are relevant to phylogenetics. One of our major goals here is to lower the learning curve associated with phylogenetic inference. With that goal in mind, if you have a question, ask it! Leave a comment, tweet @treethinkers, email me, or email one of the contributors and we’ll do our best to answer it in a post or a tutorial.

# New Tutorial on the Gamma Distribution

Figure 1: The impact of varying the shape parameter (alpha) on the gamma distribution.

Although the gamma distribution is widely used in phylogenetics, it remains somewhat of a mystery to many of the students in our annual workshop in applied phylogenetics. Acquiring a basic understanding of the gamma distribution is key to understanding how many widely-used phylogenetic methods work (e.g., MrBayes, BEAST, SIMMAP, BayesTraits). With a bit of help from Brian Moore, I’ve just posted a short tutorial that uses a series of simple R scripts and resulting figures to illustrate how and why the gamma distribution can assume such a wide range of shapes. I hope folks will find this tutorial useful, and encourage you to play around with the scripts I’ve posted to further your understanding of the gamma distribution. Drop me a line if you have any questions.

# Updated TreeSetViz

Between the Bodega Bay workshop and my spring semester class on computational phylogenetics at LSU, I’ve been talking a lot about phylogenetic analyses recently.  In my opinion, one of the most underutilized approaches for summarizing phylogenetic information is to visualize collections of trees in tree space (or an approximation of tree space projected into 2 or 3 dimensions).  These visualizations can help with diagnosing MCMC convergence problems, comparing the phylogenetic signal coming from different genes, picking an appropriate burn-in, and a whole bunch of other stuff.  I widely recommend such visualizations to students.  There are some drawbacks (e.g., lots of information can be lost during the projection to such low-dimensional spaces), but on balance it seems to me that the advantages outweigh these drawbacks.  There really isn’t another way to summarize that amount of phylogenetic information in a single plot.  As far as I know, the first paper to describe this approach was by Hillis, Heath, and St. John (2005, Syst. Biol., 54: 471-482), which included the release of a Mesquite plug-in called TreeSetViz for performing such visualizations using multi-dimensional scaling (MDS). I used this tool quite a lot and was disappointed that it became incompatible with newer releases of Mesquite.  In fact, I maintained an outdated version of Mesquite just to run TreeSetViz.  Well, unbeknownst to me, an updated version of TreeSetViz was released last year that is compatible with the most recent versions of Mesquite (hat tip to Vinson Doyle for pointing this out) and includes a wide variety of new options for comparing and visualizing trees.  Strangely, this new release isn’t mentioned on the original TreeSetViz website, but it is announced, along with very simple installation instructions, on the Mesquite website.  More recently, additional tools that are independent of Mesquite (TreeScaper) have been developed that also perform tree set visualization and projection.  Give them a try!

# Has CIPRES helped you out?

Cipres is preparing their annual useage report which, among other things, helps determine the size of their XSEDE allocation from year to year. Please read the email from Mark Miller (copied below) and participate in their short survey (available at http://www.surveymonkey.com/s/QLCB7WZ).

Dear CIPRES Science Gateway User,

We need your help to keep the CIPRES Science Gateway operating. You received this email because you are one of the 2800+ users who submitted a job from the CIPRES Science Gateway during the past year. Each year we must report on user activities to continue receiving annual allocations of computer time for our Gateway.

In the last year, strong user response to this survey helped us increase the amount of time we received from the XSEDE allocations committee, and an NSF award to create a new set of CIPRES Web Services. Soon you will be able to access CIPRES through tools like Mesquite, or through scripts you write yourself, as well as through the browser interface.

Even if you just used it for a class, you can help by completing a brief survey (just a few questions) describing your activities on the CIPRES Science Gateway, and providing opinions about how the CIPRES Gateway should continue to operate.  The survey is located here:   http://www.surveymonkey.com/s/QLCB7WZ

The survey results help us continue to provide the community with easy access to computational resources for phylogenetic programs, and help us plan for sustainability of the resource. It requires only a few minutes to complete, but your feedback is a key element in justifying access to NSF resources.
If you do not wish to complete the survey, please help by sending me citations for any publications that were enabled by the CIPRES Gateway.

Mark Miller
Principal Investigator

# MCMCorner: Bayesian bootcamp

MCMC robot needs love!

Probability theory is nothing but common sense reduced to calculation.” Pierre Laplace

The ability to rigorously diagnose MCMC performance is founded on a solid understanding of how the algorithms work. Fortunately, this does not necessarily entail a lot of complex probability theory—knowledge of some basic probability concepts and a strong intuitive understanding of the mechanics is sufficient. We’ll dedicate this installment to a brief introduction to Bayesian inference and some related concepts from probability theory. Next time we’ll get into the mechanics of MCMC.

First, let’s consider what the MCMC is trying to approximate. Bayesian inference is focussed on posterior probabilities, P($\theta$|X), the probability of the parameter, $\theta$, given the data, X, which is inferred using Bayes’ Theorem:

where P(X|$\theta$) is the likelihood function (the probability of observing the data given the parameter value), P($\theta$) is the prior probability for the parameter, and P(X) is the marginal likelihood of observing the data under the model.

# MCMCorner: Hello World!

MCMC robot needs love!

You can never be absolutely certain that the MCMC is reliable, you can only identify when something has gone wrong.” Andrew Gelman

Model-based inference is, after all, based on the model. Careful research means being vigilant both regarding the choice of model and rigorously assessing our ability to estimate under the chosen model. These two concerns pertain both to model-based inference of phylogeny—using programs such as RaXML or MrBayes—and to inferences based on phylogeny—such as the study of character evolution, lineage diversification—and indeed to all model-based inference.

The first issue—model specification, which entails three closely related issues—is critically important for the simple reason that unbiased estimates can only be obtained under a model that provides a reasonable description of the process that gave rise to our data. Model selection entails assessing the relative fit of our dataset to a pool of candidate models. Rankings are based on model-selection methods that compare the relative fit of candidate modes based either on their maximum-likelihood estimates (which measures the fit of the data to the model at a single point in parameter space), or on the marginal likelihood of the candidate models (which measures the average fit of the candidate models to the data). Model adequacy—an equally important but relatively neglected issue—assesses the absolute fit of the data to the model. Model uncertainty is related to the common (and commonly ignored) scenario in which multiple candidate models provide a similar fit to the data: in this scenario, conditioning on any single model (even the best) will lead to biased estimates, and so model averaging is required to accommodate uncertainty in the choice of model.