On building a small cluster

Treethinkers reader Nick left a comment on one of my earlier posts asking for some details about the cluster that I built for my lab. I’ll do that with this post. I’ll start by outlining some information about the cluster, list the specific parts I used (although note that this was two years ago, so good choices would likely be different today), and then give a couple of general thoughts on building and maintaining your own cluster.

Our cluster is a small machine intended to crunch through moderate numbers of phylogenetic analyses and to serve as a resource for projects where it’s convenient to have more administrative access than you typically have on large shared clusters. It comprises 4 compute machines and a head node. Each compute machine has two 6-core Xeons, 500Gb of storage, and 24 Gb of memory. Because these processors are threaded, each chip with 6 physical cores has 12 threads available, meaning the 4 compute machines have 96 threads available. I built it using pretty standard commodity parts available from your favorite internet based vendor. Many of these parts are tailored to the gaming market, which is actually a little annoying…lots of fancy LEDs lighting everything up. I built the head node from a cheap barebones PC that I bought from Newegg. It provides a lot of storage and has plenty of power for compiling, transfers, and other maintenance tasks. This cluster is far from being blazing fast, but it’s a good workhorse for us that is roughly on par with 4 high end mac pros from a couple of years ago. It’s small enough to not cause any problems with cooling and it can run on a single 20 amp breaker. In short, I built it trying to find a balance between processing power and difficulty in setup and maintenance.

Two new workshops on phylogenetics and macroevolution

NESCent Academy will be hosting two workshops this summer that may be of interest to folks reading this blog and the deadline for applications is 1st May 2014.

Paleobiological and Phylogenetic Approaches to Macroevolution, July 22-29

This course will teach participants to use fossil and phylogenetic data to analyze macroevolutionary patterns using traditional paleobiological stratigraphic methods, phylogenetic comparative methods and combined fossil and tree approaches. Macroevolutionary research is currently split into two quite isolated branches, one based on fossils and the other on extant taxa and phylogenies. Increasingly,evolutionary biologists in both camps are realizing that, only by combining neontological and paleontological data and approaches, can a new, and more powerful integrative macroevolution emerge. Unfortunately, these two disciplines utilize very different data and quantitative methods. Therefore to truly initiate a synthesis of these two approaches we need to train students and researchers to understand the intricacies of both fossil and phylogenetic data, and the methods necessary to integrate them.  APPLY HERE. More information can be found here.

Instructors
Roger Benson Dept. of Earth Sciences, University of Oxford
Samantha Hopkins Clark Honors College and the Department of Geological Sciences, University of Oregon
Gene Hunt Dept. of Paleobiology, National Museum of Natural History, The Smithsonian Institution, Washington DC 20013-7012, USA.
Samantha Price Dept. Evolution & Ecology, University of California Davis
Daniel Rabosky Dept. of Ecology and Evolutionary Biology, University of Michigan
Lars Schmitz Keck Science Department, Claremont McKenna, Pitzer, and Scripps Colleges
Graham Slater Dept. of Paleobiology, National Museum of Natural History, The Smithsonian Institution

Phylogenetic Analysis Using RevBayes, August 25-31

The Bayesian statistical framework for phylogeny estimation has facilitated the development of models that better capture biological complexity. This course is built around the use of the new, open-source program RevBayes (http://sourceforge.net/projects/revbayes/). RevBayes implements an R-like language (complete with control statements, user-defined functions, and loops) that enables the user to build up phylogenetic models from simple parts. This course cover the basics of probability theory, graphical models, and phylogenetics. Then, building on these concepts, we will provide lectures on statistical methods for phylogenetic inference, macroevolution, and epidemiology. APPLY HERE. More information can be found here.

Instructors
Bastien Boussau, LBBE, Lyon, France
Tracy Heath, UC Berkeley & U Kansas
Sebastian Höhna, UC Davis & UC Berkeley
John Huelsenbeck, UC Berkeley
Michael Landis, UC Berkeley
Nicolas Lartillot, LBBE, Lyon, France
Brian Moore, UC Davis
Fredrik Ronquist, NRM Stockholm

A new phylogeneticist blogger

I’d like to advertise a newcomer among bloggers in phylogenetics: Nicolas Lartillot, now a researcher in Lyon. Nicolas just started blogging a couple of weeks ago but, judging from the number of posts he has already contributed, he seems bound to become a very prolific blogger.

Nicolas has made several noteworthy contributions to the field of phylogenetics, in particular Bayesian phylogenetics. For instance he has developed the CAT model of protein evolution, which seems to be more resilient against the Long Branch Attraction artifact, he has proposed Thermodynamic Integration for computing Bayes factors, he has developed a model for investigating correlations between continuous traits and rates of molecular evolution along a phylogeny, and he maintains the PhyloBayes package.

His blog is called “The Bayesian kitchen”, which I believe means that, underneath the nice theoretical properties of Bayesian inference, a fair amount of cooking is sometimes necessary to get things to work. So far his posts have been about the Bayesian/frequentist divide, about the philosophy of Bayesian inference, or about the interpretation of posterior probabilities, among other things. He uses examples from phylogenetics (e.g. dating, diversification models, ), comparative methods, or gene tree-species tree methods) or population genetics to help make his points. I’m certain I’m going to learn a lot from his posts, and I believe some of the readers of this blog will enjoy them too!

2014 Bodega Bay Workshop – Apply now

Applications are now being accepted for the 2014 Workshop in Applied Phylogenetics. This year’s workshop will run from March 8 to 15 at the Bodega Bay Marine Lab on the northern California Coast. The application deadline is January 3rd. See the 2014 workshop page for more information and instructions to apply.

Workshop on Integrating Molecular Phylogenies and the Fossil Record

Last week I attended a workshop organized by Hélène Morlon, Tiago Quental, and Charles Marshall on integrating data from the fossil record into phylogenetic methods. This three-day workshop was sponsored by the the France-Berkeley Fund, a cool program that provides seed grants to build partnerships between UC Berkeley researchers and French collaborators. All of the events took place at the UCMP on the UC Berkeley campus.

Hélène, Charles, and Tiago recognized the increasing interest in methods and analyses that incorporate data from fossil taxa; and since there are several of us working in this area–particularly in methods development–the need for building a collaborative network is critical. Furthermore, as methods become more and more reliant on data from the fossil record, connections between neontologists and paleontologists must be formed. Notably, a similar working group – organized by Sam Price and Lars Schmitz – was held at NESCent this past spring and was made up of an overlapping set of researchers. One result of the NESCent catalysis meeting will be a SSE Symposium at Evolution 2014 on “Reuniting fossil and extant approaches to macroevolution”.

Jukes Cantor Model of DNA substitution

We have had a series of posts introducing several foundational tools in phylogenetic inference including Bayesian reasoning, Markov Chain Monte Carlo, and the gamma distribution’s many uses in phylogenetics. Today, we’ll continue with this theme in a crosspost from my UH colleague Floyd Reed‘s laboratory blog. Here, Floyd gives a simple derivation of the Jukes Cantor model of DNA substitution. Here it is in lightly edited form:

In previous posts I talked about irreversible and reversible mutations between two states or alleles.  However, there are four nucleotides, A, C, G, and T.  How can we model mutations among these four states at a single nucleotide site?  It turns out that this is important to consider for things like making gene trees to represent species relationships.  If we just use the raw number of differences between two species’ DNA sequences we can get misleading results.  It is actually better to estimate and correct for the total number of changes that have occurred, some fraction of which may not be visible to us.  The simplest way to do this is the Jukes-Cantor (1969) model.

Imagine a nucleotide can mutate with the same probability to any other nucleotide, so that the mutation rates in all directions are equal and symbolized by $\mu$.

So from the point of view of the “A” state you can mutate away with a probability of $3\mu$ (lower left above).  However, another state will only mutate to an “A” with a probability of $\mu$ (lower right above); the “T” could have just as easily mutated to a “G” or “C” instead of an “A”.

Q&A: Excluding Character Sets with Partitions in MrBayes

Bodega Workshop alum Christoph Hiedtke has the following question regarding excluded character sets when setting up partitions in MrBayes. With his permission I’m posting it here. I’ve run into this exact problem before and I’m sure many others have also.

Christoph writes:

Hey gang, how is everybody doing?

I am going crazy over what initially seemed to be a rather trivial MrBayes operation. Initially I had set up a MrBayes file dividing my alignment into 3 partitions and it executes perfectly. I then wanted re-run the same file but this time excluding one partition from my analysis with the designated “exclude” command, but for some reason I am getting an error I cannot get around. Does anyone know whats going on?

Here is part of my command block:

begin MrBayes;
charset p1 = 1-370 371-844 845-1124 2159-2395 3018-3328;
charset p2 = 1125-1404 1685-1921 1922-2158 2396-2706 2707-3017;
charset p3 = 1405-1684;
partition parts = 3: p1, p2, p3;
exclude p3;
set partition = parts;
prset applyto=(all) ratepr=variable;
end;

MrBayes gets stuck on the “set partition = parts” line with the following error:

Defining charset called p1
Defining charset called p2
Defining charset called p3
Defining partition called parts
Excluding character(s)
Setting parts as the partition, dividing characters into 3 parts.
Setting model defaults
Seed (for generating default start values) = 1507443219
You must have at least one site in a partition. Partition 3
has 0 site patterns.
Error when setting parameter “Partition” (2)

HELP!!

Ok, before getting to the answer, I’ll just point out that the obvious alternative approach of excluding p3 and THEN defining a partition that contains only the p1 and p2 character sets will also give an error for not including all sites into one of the partitions. All sites need to be assigned to a partition and all partitions need to have at least one site, so how can we get away with excluding anything?

The solution is to change the command block as follows:

begin MrBayes;
charset p1 = 1-370 371-844 845-1124 2159-2395 3018-3328;
charset p2 = 1125-1404 1685-1921 1922-2158 2396-2706 2707-3017;
charset p3 = 1405-1684;
partition parts = 2: p1, p2 p3;
exclude p3;
set partition = parts
prset applyto=(all) ratepr=variable;
end;

We’ve defined 2 partitions (instead of 3) assigning the ‘extra’ character set to the second partition (note the missing comma). Now we have all sites assigned to a partition and no partitions are empty, so we’re free to exclude the character set.