We have had a series of posts introducing several foundational tools in phylogenetic inference including Bayesian reasoning, Markov Chain Monte Carlo, and the gamma distribution’s many uses in phylogenetics. Today, we’ll continue with this theme in a crosspost from my UH colleague Floyd Reed‘s laboratory blog. Here, Floyd gives a simple derivation of the Jukes Cantor model of DNA substitution. Here it is in lightly edited form:
In previous posts I talked about irreversible and reversible mutations between two states or alleles. However, there are four nucleotides, A, C, G, and T. How can we model mutations among these four states at a single nucleotide site? It turns out that this is important to consider for things like making gene trees to represent species relationships. If we just use the raw number of differences between two species’ DNA sequences we can get misleading results. It is actually better to estimate and correct for the total number of changes that have occurred, some fraction of which may not be visible to us. The simplest way to do this is the Jukes-Cantor (1969) model.
Imagine a nucleotide can mutate with the same probability to any other nucleotide, so that the mutation rates in all directions are equal and symbolized by .
So from the point of view of the “A” state you can mutate away with a probability of (lower left above). However, another state will only mutate to an “A” with a probability of (lower right above); the “T” could have just as easily mutated to a “G” or “C” instead of an “A”.
Many participants of the molecular evolution workshops I attend are very interested in methods for estimating the evolutionary dynamics of serially-sampled pathogens. Recent versions of BEAST and BEAST2 have some of the most exciting and cutting-edge models for understanding evolutionary processes in these data. Because of this, I wanted to call your attention to a new tutorial on this subject:
Trevor Bedford has posted a tutorial entitled: Inferring spatiotemporal dynamics of the H1N1 influenza pandemic from sequence data.
He provides several detailed exercises that will surely help anyone new to these methods understand how to analyze their own data in BEAST. Interestingly, Trevor is hosting his tutorial on github, which I think is a great idea.
I really enjoy teaching and participating in phylogenetics workshops. Currently, I’m preparing my teaching materials for the Wellcome Trust-EMBL-EBI Advanced Course on Computational Molecular Evolution, where I have the awesome opportunity to teach a section on divergence time estimation with Jeff Thorne. Since I’ve made some minor updates to the BEAST tutorial that I’ve given at recent workshops, I wanted to create a more permanent page to host the document and data files. So, for those interested, you can find the updated tutorial here. I will try to keep this tutorial as up-to-date as possible.
Figure 1: The impact of varying the shape parameter (alpha) on the gamma distribution.
Although the gamma distribution is widely used in phylogenetics, it remains somewhat of a mystery to many of the students in our annual workshop in applied phylogenetics. Acquiring a basic understanding of the gamma distribution is key to understanding how many widely-used phylogenetic methods work (e.g., MrBayes, BEAST, SIMMAP, BayesTraits). With a bit of help from Brian Moore, I’ve just posted a short tutorial that uses a series of simple R scripts and resulting figures to illustrate how and why the gamma distribution can assume such a wide range of shapes. I hope folks will find this tutorial useful, and encourage you to play around with the scripts I’ve posted to further your understanding of the gamma distribution. Drop me a line if you have any questions.