Author Archives: Bob Thomson

About Bob Thomson

Assistant professor in the Department of Biology at University of Hawaii Manoa.

UPDATE: Must read papers for graduate students

Following up on my previous post, here is the list of ‘Must Read’ papers in phylogenetics that were suggested on Twitter. I think that this is a great start, even though it is missing some classics and some important topics (divergence time estimation, for example). Thanks to everyone for chipping in with their thoughts and thanks again to Matt Hahn and Matt Pennell for getting the conversation started.

I apologize if I missed anyone’s contributions. Feel free to suggest additions, either here in the comments or on twitter with the hashtag #mustreadphylo.

Bull, J. J., Huelsenbeck, J. P., Cunningham, C. W., Swofford, D. L., & Waddel, P. J. (1993). Partitioning and combining data in phylogenetic analysis. Systematic Biology, 42(3), 384–397.

Cavalli-Sforza, L. L., & Edwards, a W. F. (1967). Phylogenetic analysis. Models and estimation procedures. The American Journal of Human Genetics, 19, 233–257.

Edwards, S. V. (2009). Is a new and general theory of molecular systematics emerging? Evolution, 63, 1–19.

Felsenstein, J. (1973). Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Biology, 22, 240–249.

Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Systematic Zoology, 27, 401–410.

Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17, 368–376.

Felsenstein, J. (1985). Confidence limits on phylogenies: an approach using the bootstrap. Evolution, 39, 783–791.

Felsenstein, J. (1985). Phylogenies and the comparative method. American Naturalist, 125, 1–15.

Goldman, N. (1993). Statistical tests of models of DNA substitution. Journal of Molecular Evolution, 36, 182–198.

Hillis, D. M., & Bull, J. J. (1993). An Empirical Test of Bootstrapping as a Method for Assessing Confidence in Phylogenetic Analysis. Systematic Biology, 42, 182–192.

Holder, M., & Lewis, P. O. (2003). Phylogeny estimation: traditional and Bayesian approaches. Nature Reviews. Genetics, 4, 275–284.

Kumar, S., Filipski, A. J., Battistuzzi, F. U., Kosakovsky Pond, S. L., & Tamura, K. (2012). Statistics and truth in phylogenomics. Molecular Biology and Evolution, 29, 457–472.

Maddison, W. P. (1997). Gene Trees in Species Trees. Systematic Biology, 46, 523–536.

Pauling, L., & Zuckerkandl, E. (1963). Chemical paleogenetics. Acta Chem. Scand, 17, S9 – S16.

Sullivan, J., & Swofford, D. (1997). Are Guinea Pigs Rodents?? The Importance of Adequate Models in Molecular Phylogenetics. Journal of Mammalian Evolution, 4, 77–86.

Must read papers for graduate students

This post is sparked by an ongoing conversation on twitter that was kicked off when Matthew Hahn and Matt Pennell got to talking about developing a list of papers that should be required reading for graduate students with an interest in phylogenetics. This a good question, and I can’t recall seeing one. I start teaching on phylogenetics in our graduate core course here at UH next week and the 2015 Bodega workshop is only a few weeks away, so I’m finding this to be a timely and useful conversation.

There are already several good suggestions from folks on twitter, including, well….most of Joe Felsenstein’s early phylogenetics papers and his book, Maddison’s 1997 paper and Edwards 2009 paper on gene tree conflicts, and Sullivan and Swofford’s 1997 paper on the importance of adequate models (of course, guinea pigs are also a noble beast deserving of study in their own right).

Please jump into the conversation on twitter with your suggestions, or leave them here in the comments. I’ll post an update with a bibliography in a few days. Thanks to Matt and Matt for bringing this up!

On building a small cluster

Treethinkers reader Nick left a comment on one of my earlier posts asking for some details about the cluster that I built for my lab. I’ll do that with this post. I’ll start by outlining some information about the cluster, list the specific parts I used (although note that this was two years ago, so good choices would likely be different today), and then give a couple of general thoughts on building and maintaining your own cluster.

Our cluster is a small machine intended to crunch through moderate numbers of phylogenetic analyses and to serve as a resource for projects where it’s convenient to have more administrative access than you typically have on large shared clusters. It comprises 4 compute machines and a head node. Each compute machine has two 6-core Xeons, 500Gb of storage, and 24 Gb of memory. Because these processors are threaded, each chip with 6 physical cores has 12 threads available, meaning the 4 compute machines have 96 threads available. I built it using pretty standard commodity parts available from your favorite internet based vendor. Many of these parts are tailored to the gaming market, which is actually a little annoying…lots of fancy LEDs lighting everything up. I built the head node from a cheap barebones PC that I bought from Newegg. It provides a lot of storage and has plenty of power for compiling, transfers, and other maintenance tasks. This cluster is far from being blazing fast, but it’s a good workhorse for us that is roughly on par with 4 high end mac pros from a couple of years ago. It’s small enough to not cause any problems with cooling and it can run on a single 20 amp breaker. In short, I built it trying to find a balance between processing power and difficulty in setup and maintenance.


Continue reading

Jukes Cantor Model of DNA substitution

We have had a series of posts introducing several foundational tools in phylogenetic inference including Bayesian reasoning, Markov Chain Monte Carlo, and the gamma distribution’s many uses in phylogenetics. Today, we’ll continue with this theme in a crosspost from my UH colleague Floyd Reed‘s laboratory blog. Here, Floyd gives a simple derivation of the Jukes Cantor model of DNA substitution. Here it is in lightly edited form:

In previous posts I talked about irreversible and reversible mutations between two states or alleles.  However, there are four nucleotides, A, C, G, and T.  How can we model mutations among these four states at a single nucleotide site?  It turns out that this is important to consider for things like making gene trees to represent species relationships.  If we just use the raw number of differences between two species’ DNA sequences we can get misleading results.  It is actually better to estimate and correct for the total number of changes that have occurred, some fraction of which may not be visible to us.  The simplest way to do this is the Jukes-Cantor (1969) model.

Imagine a nucleotide can mutate with the same probability to any other nucleotide, so that the mutation rates in all directions are equal and symbolized by \mu.


So from the point of view of the “A” state you can mutate away with a probability of 3\mu (lower left above).  However, another state will only mutate to an “A” with a probability of \mu (lower right above); the “T” could have just as easily mutated to a “G” or “C” instead of an “A”.
Continue reading

Q&A: Excluding Character Sets with Partitions in MrBayes

Bodega Workshop alum Christoph Hiedtke has the following question regarding excluded character sets when setting up partitions in MrBayes. With his permission I’m posting it here. I’ve run into this exact problem before and I’m sure many others have also.

Christoph writes:

Hey gang, how is everybody doing?

I am going crazy over what initially seemed to be a rather trivial MrBayes operation. Initially I had set up a MrBayes file dividing my alignment into 3 partitions and it executes perfectly. I then wanted re-run the same file but this time excluding one partition from my analysis with the designated “exclude” command, but for some reason I am getting an error I cannot get around. Does anyone know whats going on?

Here is part of my command block:

begin MrBayes;
charset p1 = 1-370 371-844 845-1124 2159-2395 3018-3328;
charset p2 = 1125-1404 1685-1921 1922-2158 2396-2706 2707-3017;
charset p3 = 1405-1684;
partition parts = 3: p1, p2, p3;
exclude p3;
set partition = parts;
unlink statefreq=(all) revmat=(all) shape=(all) pinvar=(all);
prset applyto=(all) ratepr=variable;

MrBayes gets stuck on the “set partition = parts” line with the following error:

Defining charset called p1
Defining charset called p2
Defining charset called p3
Defining partition called parts
Excluding character(s)
Setting parts as the partition, dividing characters into 3 parts.
Setting model defaults
Seed (for generating default start values) = 1507443219
You must have at least one site in a partition. Partition 3
has 0 site patterns.
Error when setting parameter “Partition” (2)


Ok, before getting to the answer, I’ll just point out that the obvious alternative approach of excluding p3 and THEN defining a partition that contains only the p1 and p2 character sets will also give an error for not including all sites into one of the partitions. All sites need to be assigned to a partition and all partitions need to have at least one site, so how can we get away with excluding anything?

The solution is to change the command block as follows:

begin MrBayes;
charset p1 = 1-370 371-844 845-1124 2159-2395 3018-3328;
charset p2 = 1125-1404 1685-1921 1922-2158 2396-2706 2707-3017;
charset p3 = 1405-1684;
partition parts = 2: p1, p2 p3;
exclude p3;
set partition = parts
unlink statefreq=(all) revmat=(all) shape=(all) pinvar=(all);
prset applyto=(all) ratepr=variable;

We’ve defined 2 partitions (instead of 3) assigning the ‘extra’ character set to the second partition (note the missing comma). Now we have all sites assigned to a partition and no partitions are empty, so we’re free to exclude the character set.

Phylogenetic Computing: What’s Your Solution?

Grant deadlines for DEB are coming up and this has me thinking about the best way to go about actually doing the computation that I’m proposing to do. Since my lab is still in its early “get up and running” phase, I’m also in a position to invest in new resources and set up some standard operating procedures for the future. This is an issue that all phylogeneticists struggle with at one point or another, so I thought it would be useful to poll the community. What do you use for big analysis jobs in your lab?

Like many people in my generation of phylogenetics, I started out in the days of taping warning notes to monitors (Figure 1) (i.e., cobble together whatever desktop machines one can get hands on…and then jealously guard them from the enemy lab-mates for the months it takes your analysis to finish). Times have obviously changed since then and we aren’t, as a field, nearly as computationally limited as we were 10 or even 5 years ago. Many free or easily accessible computing options are now available: CIPRES, iPLANT discovery environment, Amazon’s EC2, XSEDE (formerly TeraGrid), and any number of university/college/departmental clusters…and that’s not to mention the homebuilt clusters and trusty (dusty) desktops sitting in the corners of our labs. The workhorse software of our field is also faster than it used to be, allowing us to get more done in the same amount of time, irrespective of the hardware being used.

PAUP warning note

Figure 1 – The classic PAUP* warning note (note: I stole this from the ad for Brian O’Meara’s “Fast Free Phylogenies” HPC workshop at NIMBioS)

For the last few years, I’ve enjoyed the benefit of having my own small but speedy cluster (built cheaply using commodity parts), as well as a TeraGrid allocation. These have worked well for my needs: they’ve allowed me to get analyses finished in a timely fashion; run many tests and toy analyses without feeling limited; lend time to coworkers in a pinch…and aside from all that, the cluster allows for lots of satisfying tinkering during off hours. All that said, the TeraGrid allocation is now finished, the large cluster on Maui that I’d been hoping to get an allocation on is no longer available, and I’m already seeing that the tropical climate here on Oahu is hell on hardware (e.g. my monitor fills up with condensation anytime I leave it off for more than a day or two). I’m thinking about eventually moving completely into EC2 and XSEDE and not having to worry about hardware at all.

I’d appreciate learning about the experiences that others have had. What is your preferred solution for phylogenetic computing?

Evolution Deadline

A quick note that the early registration deadline for this year’s joint annual meeting of SSE, SSB, and ASN is fast approaching. Early registration ends on Friday, April 19th. I’m really looking forward to this one. It’s in a beautiful location and should nicely avoid that convention center wasteland feel that many meetings unfortunately have these days. I’m sure that many people associated with the Bodega workshop will be there. Who’s going?


About TreeThinkers

TreeThinkers is a blog devoted to phylogenetic and phylogeny-based inference. We aim to use it as a place to discuss recent research and methods; to ask and answer questions; and serve as a general resource for news and trivia in phylogenetics. We have already had several posts since the 2013 Bodega Workshop ended and we plan to keep things going strong. This post is meant to give a (belated) introduction to the blog and provide some general information about how things work around here.

The group of us that organize the Bodega workshop have been talking about developing a blog associated with the course for a few years. Following the switch from our increasingly clunky wiki to this shiny new site, we’ve decided that the time was ripe….and here we are!    Several of us Bodega instructors have signed on as contributors, and we welcome guest posts or regular contributions from the rest of the community. Get in touch with me if you’re interested.

As a blog associated with a course, one of our central focuses will always be on teaching. Brian Moore has a running series of posts called MCMC Corner that discusses various aspects of Bayesian inference. Looking at his drafts of upcoming posts, this looks to be an informative and useful set of articles. We will also be posting on general topics of interest,  best practices or common sources of confusion encountered in phylogenetic analysis. Rich Glor’s tutorial explaining the various parameterizations of the Gamma distribution is a great example. Finally, we’ll post about recent findings, news, and announcements that are relevant to phylogenetics. One of our major goals here is to lower the learning curve associated with phylogenetic inference. With that goal in mind, if you have a question, ask it! Leave a comment, tweet @treethinkers, email me, or email one of the contributors and we’ll do our best to answer it in a post or a tutorial.
Continue reading