MrBayes 3.2.2 Primer (with a focus on model selection)
UCE Example Data
Summary
Primary Contact: Jeremy Brown
A description from the MrBayes webpage:
MrBayes is a program for the Bayesian estimation of phylogeny. Bayesian inference of phylogeny is based upon a quantity called the posterior probability distribution of trees, which is the probability of a tree conditioned on the observations. The conditioning is accomplished using Bayes’s theorem. The posterior probability distribution of trees is impossible to calculate analytically; instead, MrBayes uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees.
Introduction
This tutorial uses MrBayes version 3.2.1 and one example data file: primates.nex.
Tutorial
Installation
MrBayes can be downloaded from the MrBayes website. Pre-compiled, executable versions of MrBayes for PCs or Macs are available for direct download. Both serial and parallel versions are available. Parallel versions allow a single analysis to take advantage of multiple processors, if they are available.
The source code can also be downloaded and compiled locally for use on Unix systems or for command-line use on other systems. Command-line versions will facilitate batch and cluster-based analyses. Instructions for compilation are given on the website.
Starting MrBayes
If you have downloaded a pre-compiled executable version of MrBayes, simply double-click on the icon to start MrBayes. If you have compiled the source code locally, you can execute MrBayes from the command line (e.g., ./mb in Unix), once you have set your working directory properly.
Once MrBayes has started, it should print out an intro splash screen and then wait for your commands at the command prompt:
MrBayes >
Getting Help
The single most important MrBayes command is help. Simply typing help on the command line,
MrBayes > help
will provide a list of the possible MrBayes commands and a brief description of their purpose. If you would like more information on a particular command, including a list of its associated parameters and their current settings, use help <command_name> where<command_name> should be replaced by the name of the command in which you are interested. For example, to get help on theexecute command, type:
MrBayes > help execute
Loading and Manipulating Data
The first command that you will need to use MrBayes is execute. This command loads information (either a data matrix or a set of MrBayes commands) from a file. The file should be primates.nex file in the same folder as MrBayes and then type:
MrBayes > execute primates.nex
The execute command can be used identically with a file containing MrBayes commands. If this file contains all the commands needed for an analysis, one could simply open MrBayes and execute this command file at the command prompt. You could also run MrBayes with a command file by specifying the file name (e.g., mb_cmds.nex) on the command line when executing MrBayes. For instance, in Unix:
mb mb_cmds.nex
[Note: This command line should work if you’ve run the MrBayes installer, so the MrBayes binary has been automatically placed in your path. If you’ve compiled MrBayes from source code on your own, you’ll need to move the binary to the folder where you want to run your analysis. You’ll then need to type
./mb mb_cmds.nex
instead.]
Once data has been loaded into MrBayes, sites can be excluded from an analysis using the exclude command. Sites are designated for exclusion based on the number corresponding to their column in an alignment. For instance, to exclude sites 90-99 in the primates data that we already loaded, type:
MrBayes > exclude 90-99
If sites need to be re-included, use the include command:
MrBayes > include 90-99
Often, one might want to include/exclude every n-th character. For instance, defining a character set that starts with the first character and includes every 3rd character would circumscribe all 1st codon positions. This type of inclusion/exclusion is accomplished by using the notation \3 at the end of the character list. For example, to exclude all 1st codon positions, type:
MrBayes > exclude 1-. \3
The
"."
symbol is used to mean the last position in our data set. To exclude all 2nd codon positions, use a very similar command, but start the character list with the 2nd site:
MrBayes > exclude 2-. \3
The inclusion/exclusion status of all sites in our dataset can be viewed using the charstat command.
Specifying a Model of Sequence Evolution
Basic Model Manipulations
Most model specification is done using the lset command. The most common model specifications performed in lset involve the number of substitution types and the model of rate variation across sites. For instance, the number of unique substitution rates can be specified using the nst parameter. Possible values for nst are 1 (as in the Jukes-Cantor model), 2 (as in the HKY model), and 6 (as in the GTR model). Possible models of rate variation specified with the rates parameter include a proportion of invariable sites (I), gamma-distributed rates across sites (G), or a combination of the two (I+G). For example, you could specify a GTR+G model by typing:
MrBayes > lset nst=6 rates=gamma
In the newest version of MrBayes (3.2), one can also avoid having to specify only one scheme of substitution types by allowing MrBayes to move across different schemes as part of its MCMC sampling. This procedure is known as reversible jump MCMC (RJ-MCMC). To set up reversible jumping, use the following command:
MrBayes > lset nst=mixed rates=gamma
Reversible jumping is not currently set up for different models of rate variation across sites, so you will still need to specify +I, +G, or +I+G.
While I will not provide detailed instructions here on analyses with doublet and codon models in MrBayes, these models can also be specified using the lset command. See the MrBayes manual for more information.
By default, MrBayes allows the stationary frequencies of the four nucleotides to be estimated from the data. To fix these frequencies at particular values (often 0.25 for all four nucleotides), use the prset command. MrBayes considers the fixation of base frequencies to be a prior setting, rather than a setting of the likelihood model. For instance, to fix all frequencies to be equal, type:
MrBayes > prset statefreqpr=fixed(equal)
Partitioning
It is now widely recognized that the evolutionary process has not been homogeneous across sites. One approach to accomodating heterogeneity in the evolutionary process across sites is to divide sites into distinct subsets (a process called partitioning) and model the evolution of each subset using an independent Markov model of nucleotide substitution.
The first step in performing a partitioned analysis is to define the distinct subsets of your data. These definitions are made using thecharset command. The syntax for specifying sites with charset is the same as with the include and exclude commands. For instance, to specify a subset corresponding to the first codon position, type:
MrBayes > charset cod.pos.1 = 1-. \3
Subsets corresponding to codon positions 2 and 3 could be similarly specified as:
MrBayes > charset cod.pos.2 = 2-. \3 MrBayes > charset cod.pos.3 = 3-. \3
Note that each site must be assigned to one and only one subset to perform a partitioned analysis.
Once all of the appropriate character sets have been defined, they need to be explicity combined into a partition. Appropriately, this is done with a command called partition. For instance, to specify a partitioning scheme with 3 subsets corresponding to the three codon positions, type
MrBayes > partition cod.pos = 3:cod.pos.1,cod.pos.2,cod.pos.3
The number immediately preceding the
":"
gives the number of subsets in the partioning scheme.
After defining a partitioning scheme, you need to tell MrBayes that you actually want to use that scheme. This is done with the setcommand. So, to use the
"cod.pos"
partitioning scheme, type
MrBayes > set partition=cod.pos
Now that the partitioning scheme is all set, you need to define models of evolution for each subset. As above, this is done with thelset command. The one additional consideration is that you need to tell MrBayes to which subsets you want any particular lset call to apply. To apply one lset command to all subsets within a partitioning scheme, type
MrBayes > lset applyto=(all) nst=6 rates=gamma
If you want to apply different lset calls to different partitions, use the numbers corresponding to the subset (defined by the order in which the subsets are listed in the partition definition). So, to apply an HKY model to codon positions 1 & 2 and a GTR+G model to codon position 3, type
MrBayes > lset applyto=(1,2) nst=2 MrBayes > lset applyto=(3) nst=6 rates=gamma
After defining the appropriate models, you need to explicitly tell MrBayes that parameters (some or all) need to be estimated independently (or
"unlinked"
) for each partition. The
"unlinking"
of parameters across parititions is accomplished with the unlinkcommand
MrBayes > unlink revmat=(all) tratio=(all) statefreq=(all) shape=(all)
Each parameter (or related set of parameters) needs to be unlinked individually.
"revmat"
corresponds to the relative rates of substitution,
"tratio"
corresponds to the transition/transversion rate ratio,
"statefreq"
corresponds to the stationary nucleotide frequencies, and
"shape"
corresponds to the alpha shape parameter of the gamma distribution that models rate variation across sites. You may also need to unlink the proportion of invariable sites across partitions if they’re included in your models.
To check that you’ve unlinked parameters appropriately across partitions, use the showmodel command.
It is also possible to allow the overall tree length to be estimated independently for each subset, through the use of independent
"scaling parameters"
. By default, these scaling parameters are linked across partitions. Unlike the unlinking of other parameters, the scaling parameters are unlinked using prset by typing
MrBayes > prset ratepr=variable
Note that the use of rate multipliers across partitions still assumes that the relative branch lengths in the tree are identical across partitions. If, instead, one wants to model different relative branch lengths across partitions, every branch length can be estimated independently across partitions. The unlinking of every branch length across partitions is done with unlink
MrBayes > unlink brlens=(all)
Unlink commands can be reversed using the link command, with identical syntax.
Specifying Priors on Model Parameters
Priors for nearly all parameters (i.e., relative rates of substitution, base frequencies, branch lengths, etc) can be set using the prsetcommand. The available prior distributions vary by parameter type. Lots of information on possibile distributions can be gained fromhelp prset. As one example, a more informative prior around equal base frequencies (but not fixed at equal frequencies) can be set by increasing the values of the Dirichlet parameters
MrBayes > prset applyto=(all) statefreqpr=Dirichlet(100,100,100,100)
One particularly important prior to consider is the branch-length prior. This prior is important because this single distribution is applied to all the branch lengths in the tree, which can be a very large number of independent parameters. This is a particularly good prior to try testing for prior sensitivity. In other words, run multiple analyses with different branch-length priors and examine the sensitivity of the resulting posteriors to the specification of the prior. Also note that the parameter specified for the exponential prior on branch lengths is the rate of the exponential distribution, which is the inverse of its mean. To specify an exponential prior that is pushed up more tightly around small branch lengths, you should specify a larger rate parameter. For instance,
MrBayes > prset brlenspr=Unconstrained:Exp(100)
– Exponential prior on branch lengths with a mean of 0.01.
MrBayes > prset brlenspr=Unconstrained:Exp(10)
– Exponential prior on branch lengths with a mean of 0.1. (Default in MrBayes)
MrBayes > prset brlenspr=Unconstrained:Exp(1)
– Exponential prior on branch lengths with a mean of 1.
It has also been shown that the branch-length prior specification can affect the inferred posterior probabilities of the topology. Therefore, when performing prior sensitivity analyses with different branch-length priors, it might be important to monitor sensitivity to inferred branch lengths and topologies.
Manipulating Markov chain Monte Carlo (MCMC) Settings
There are a huge number of potential settings pertaining to the Markov chain Monte Carlo (MCMC) analysis that can be manipulated in MrBayes. Most of these changes are made using the mcmcp and mcmc commands. These commands differ only in whether or not they begin the Markov chain when the command is issued. mcmcp just sets the parameter values without starting the analysis. mcmcwill begin the analysis after setting the values of parameters. Some of the MCMC parameters that can be set using these commands include:
ngen – The number of MCMC generations MrBayes will run before pausing
nruns – The number of independent MCMC analyses run by MrBayes
nchains – The number of Metropolis-coupled chains for each independent MCMC analysis
samplefreq – The frequency with which MrBayes writes output to files
printfreq – The frequency with which MrBayes prints output to the screen
filename – The root name for output file names
temp – The degree of heating for Metropolis-coupled chains
Many other MCMC options can be set with these commands. The full list of these options can be obtained by typing help mcmcp orhelp mcmc.
The other command useful in tweaking the details of the MCMC analysis is the props command. props can alter the relative frequency of proposals to different parameters, as well as the distributions from which proposals are drawn. Proposal distributions differ according to the parameter of interest. The final page of the MrBayes manual lists the proposal distributions for each parameter type and their associated tuning parameters. It is also important to note that in the current version of MrBayes (3.1.2), the propscommand can only be issued
"interactively"
. In other words, you cannot put a props command in a MrBayes block of a command file. Typing props will cause MrBayes to prompt you for more information about how to change proposal frequencies and distributions. Finally, a ‘word of caution‘. Certain changes to proposals can make it virtually impossible for the MCMC analysis to mix properly across the posterior distribution and give biased results. Using rigorous checks of convergence should allow you to detect cases of very poor mixing.
Summarizing Output
After you have run an MCMC analysis in MrBayes and made sure that your runs have converged (a topic not covered here), you can summarize the estimated posterior distributions of both the parameter values and trees using the sump (parameters) and sumt(trees) commands. You can also discard those samples biased by the starting point of an MCMC analysis, a process known as discarding the
"burn-in"
. Removing burn-in when summarizing the posterior distributions is done by specifying a value for the burn-in parameter with sump and sumt
MrBayes > sump burnin=1000
– Summarizing posterior samples of parameter values, discarding 1,000 samples (NOT generations)
MrBayes > sumt burnin=1000
– Summarizing posterior samples of trees, discarding 1,000 samples
Several other parameters can be specified when performing posterior summaries. A full list can be obtained by typing help sump orhelp sumt.
The output of the sump command will include means, medians, variances, and 95% credible intervals for all scalar parameters.
The output of the sumt command includes estimated posterior probabilities of branches (.parts file), estimated posterior probabilities of trees (.trprobs file), and a majority-rule consensus tree calculated from the posterior tree samples (.con file).
Other software (some listed below) can help you summarize posterior distributions of trees and parameter values, as well as diagnose convergence.
Estimating Marginal Likelihoods Using Steppingstone Sampling
The marginal likelihood of a model is an important quantity that allows comparison of fit across different models, accounting for uncertainty in the values of model parameters. The simplest way to estimate the marginal likelihood of a model based on the output of MCMC sampling of the posterior is to calculate the harmonic mean of their likelihood scores. However, the harmonic mean can have some highly undesirable statistical properties, including bias and extremely high variance. This leads to low repeatability across replicates for estimates that are inaccurate. A much more accurate and repeatable approach was recently developed, also based on importance sampling, and is known as steppingstone sampling (Xie et al., 2011; Fan et al., 2011). In this approach, an MCMC chain is still used, but it samples a series of “power posterior distributions” that move progressively between the posterior and the prior (or vice versa). A theoretical treatment of this approach is beyond the scope of this tutorial, but I encourage any user interested in estimating marginal likelihoods to read the papers describing and characterizing this approach (Xie et al., 2011; Fan et al., 2011).
To estimate marginal likelihoods via steppingstone in MrBayes, one first needs to specify appropriate settings. This can be done using the
mcmcp
and
ssp
commands. Relevant parameters include the total number of generations sampled (set with
mcmcp
), the burn-in before the first step of the sampling (set with
ssp
), the burn-in for each sampling step (set with
mcmcp
), and the number of different power posterior distributions sampled (set with
ssp
).
After everything is set up, start the steppingstone sampling process using the simple
ss
command:
MrBayes > ss
Once sampling is completed across all power posterior distributions, MrBayes will provide an estimate of the marginal likelihood. It is good practice to estimate these values several times with independent runs to ensure their repeatability.
NOTE: Because MrBayes is sampling many different distributions that vary between the posterior and prior when steppingstone is run, do NOT use the trees and parameter files from an
ss
run to estimate posterior probabilities and credible intervals.
Related Software
Tracer
Andrew Rambaut to summarize posterior distributions of scalar values (e.g., log-likelihoods and model parameters). It was originally written to summarize output from BEAST, but it also works quite nicely with MrBayes .p files. Tracer allows the generation of trace plots, plots of marginal distributions (both single and joint), and summary statistics of marginal distributions.
AWTY
David Swofford. It’s currently available as a web-based tool and will generate plots comparing the estimated posterior distributions of branch support from multiple independent MCMC runs.
TreeSetViz
Mesquite implementing multidimensional scaling to visualize the distances between a set of phylogenetic trees in two dimensions, written by a number of students and faculty at City University of New York (CUNY) and the Univ. of Texas. It is extremely useful for visualizing the movement of an MCMC run through tree space, as well as comparing the tree topologies sampled by independent analyses.
NOTE: As of the last update of this wiki page, TreeSetViz was not compatible with the most recent version of TreeSetViz are still available.
TreeScaper
TreeScaper allows visualization of the distances between a set of trees in 2D (similar to TreeSetViz) or 3D using multidimensional scaling. TreeScaper is stand-alone software and provides a good deal of flexibility in projecting the tree-to-tree distances. It is useful for many applications, including the comparison of trees sampled across independent MrBayes runs or the comparison of trees sampled during analyses of different genes.
References
Fan, Y., R. Wu, M.-H. Chen, L. Kuo, and P.O. Lewis. 2011. Choosing among partition models in Bayesian phylogenetics. Systematic Biology 28: 523-532. Available Here
Hillis, D.M., T.A. Heath, and K. St. John. 2005. Analysis and visualization of tree space. Systematic Biology 54: 471-482. Available Here
Ronquist, F. and J.P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574 Available Here
Ronquist, F., M. Teslenko, P. van der Mark, D.L. Ayres, A. Darling, S. Hohna, B. Larget, L. Liu, M.A. Suchard, and J.P. Huelsenbeck. 2012. MrBayes3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic Biology In Press Available Here
Wilgenbusch, J.C., D.L. Warren, and D.L. Swofford. 2004. AWTY: A system for graphical exploration of MCMC convergence in Bayesian phylogenetic inference Available Here
Xie, W., P.O. Lewis, Y. Fan, L. Kuo, and M.-H. Chen. 2011. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic Biology 60:150-160 Available Here
Hi, it`s a awsome tutorial but I had a problem with the partition by codon and evolutive model.
I have three gene and two are coding and one is a rDNA, each gene has a different evolutive model and I try something like
charset citb = 1-470;
charset citb_p1 = 1-470\3;
charset citb_p2 = 2-470\3;
charset citb_p3 = 3-470\3;
charset coi = 471-1076;
charset coi_p1 = 471-1076\3;
charset coi_p2 = 472-1076\3;
charset coi_p3 = 473-1076\3;
charset 16S = 1077-1505;
partition by_codon= 3: citb, coi, 16S;
set partition = by_codon;
lset applyto=(1) nst=6 rates=gamma;
prset apply=(1) statefreqpr=fixed(equal);
lset applyto=(2) nst=2 rates=invgamma;
lset applyto=(3) nst=6 rates=invgamma;
unlink revmat=(all) pinvar=(all) shape=(all) statefreq=(all);
prset ratepr=variable;
mcmcp ngen= 20000000 printfreq=1000 samplefreq=1000 nchains=4;
end;
I wrote this blocks on the .nex file, and when I put this file on mrbayes, the software read but don`t start the analysis what do you think?
Thanks
If you change the “mcmcp” to just “mcmc” or add a separate “mcmc” command after the block above, the analysis will start.
— Jeremy
Hi Jeremy
It works
Thank you very much