“You can never be absolutely certain that the MCMC is reliable, you can only identify when something has gone wrong.” Andrew Gelman
Model-based inference is, after all, based on the model. Careful research means being vigilant both regarding the choice of model and rigorously assessing our ability to estimate under the chosen model. These two concerns pertain both to model-based inference of phylogeny—using programs such as RaXML or MrBayes—and to inferences based on phylogeny—such as the study of character evolution, lineage diversification—and indeed to all model-based inference.
The first issue—model specification, which entails three closely related issues—is critically important for the simple reason that unbiased estimates can only be obtained under a model that provides a reasonable description of the process that gave rise to our data. Model selection entails assessing the relative fit of our dataset to a pool of candidate models. Rankings are based on model-selection methods that compare the relative fit of candidate modes based either on their maximum-likelihood estimates (which measures the fit of the data to the model at a single point in parameter space), or on the marginal likelihood of the candidate models (which measures the average fit of the candidate models to the data). Model adequacy—an equally important but relatively neglected issue—assesses the absolute fit of the data to the model. Model uncertainty is related to the common (and commonly ignored) scenario in which multiple candidate models provide a similar fit to the data: in this scenario, conditioning on any single model (even the best) will lead to biased estimates, and so model averaging is required to accommodate uncertainty in the choice of model.
Much less concern is given to the second aspect of model-based inference: the ability to obtain reliable estimates under the chosen model(s). Our field is currently experiencing a “pioneering era”—increasingly complex (and presumably realistic) phylogenetic models are being proposed and implemented at an unprecedented rate. Our frontier era, however, more closely resembles the ‘wild west’ of the 1760’s than the ‘space race’ of the 1960’s: the statistical behavior of many new models remains unchartered territory, and might aptly carry the warning label ‘Here be dragons’. Nevertheless, most users implicitly assume, it seems, that if a model has been implemented correctly, and if that implementation has been “successfully” used to obtain an estimate from a given dataset (i.e., the input file has been read into the program, the program has been run and an output file has been generated), then we must have performed valid inference under the model. This would be perfectly sound reasoning if inferences were based on analytical methods.
Owing to the complexity of most phylogenetic models, however, it is not possible to estimate parameters analytically. Instead, parameter estimates are based on numerical methods. In the case of maximum-likelihood estimation, these are typically hill-climbing algorithms that search the profile likelihood to identify the vector of point estimates for all phylogenetic model parameters that jointly maximize the likelihood of observing the data under the model. The reliability of these algorithms can (and should) be assessed by comparing estimates obtained from repeated analyses that are initiated from random points in parameter space. Because there is only one maximum likelihood estimate, the terminal parameter values of replicate searches should be identical (within the precision of computer memory).
In the Bayesian statistical framework, inferences focus on the joint posterior probability density of phylogenetic model parameters, which is approximated by Markov chain Monte Carlo (MCMC) algorithms. It may be comforting to know that, in theory, an appropriately constructed and adequately run MCMC simulation is guaranteed to provide an arbitrarily precise description of the joint posterior probability density. In practice, however, even a given MCMC algorithm that provides reliable estimates in most cases will nevertheless fail in some cases and is not guaranteed to work for any given dataset. This raises an obvious question: “When do we know that an MCMC algorithm provides reliable estimates for a given empirical analysis“. The answer is simple: Never.
Fortunately, this problem is not unique to Bayesian inference of phylogeny. Much of Bayesian inference outside our field also relies on MCMC algorithms to approximate the joint posterior probability density of parameters: similar concerns regarding the reliability of those numerical approximations have motivated the development of a suite of diagnostic tools to assess MCMC performance. The trick is learning how to use these tools effectively and rigorously, especially for analyses that entail complex phylogenetic models and/or large datasets. As a field, we have failed both to emphasize the importance of assessing MCMC performance, and also to provide opportunities to develop the skills to do so.
I intend to use this blog as a venue to discuss issues related to the diagnosis of MCMC algorithms used for Bayesian inference of phylogeny. The goal of this thread is to raise general awareness on this issue by stimulating discussion about specific diagnostics, successful strategies, common pathologies, best practices, etc., associated with assessing the performance of phylogenetic MCMC methods. If you have questions or comments, please leave a message below or drop me an email.