# On building a small cluster

Treethinkers reader Nick left a comment on one of my earlier posts asking for some details about the cluster that I built for my lab. I’ll do that with this post. I’ll start by outlining some information about the cluster, list the specific parts I used (although note that this was two years ago, so good choices would likely be different today), and then give a couple of general thoughts on building and maintaining your own cluster.

Our cluster is a small machine intended to crunch through moderate numbers of phylogenetic analyses and to serve as a resource for projects where it’s convenient to have more administrative access than you typically have on large shared clusters. It comprises 4 compute machines and a head node. Each compute machine has two 6-core Xeons, 500Gb of storage, and 24 Gb of memory. Because these processors are threaded, each chip with 6 physical cores has 12 threads available, meaning the 4 compute machines have 96 threads available. I built it using pretty standard commodity parts available from your favorite internet based vendor. Many of these parts are tailored to the gaming market, which is actually a little annoying…lots of fancy LEDs lighting everything up. I built the head node from a cheap barebones PC that I bought from Newegg. It provides a lot of storage and has plenty of power for compiling, transfers, and other maintenance tasks. This cluster is far from being blazing fast, but it’s a good workhorse for us that is roughly on par with 4 high end mac pros from a couple of years ago. It’s small enough to not cause any problems with cooling and it can run on a single 20 amp breaker. In short, I built it trying to find a balance between processing power and difficulty in setup and maintenance.

# Two new workshops on phylogenetics and macroevolution

NESCent Academy will be hosting two workshops this summer that may be of interest to folks reading this blog and the deadline for applications is 1st May 2014.

Paleobiological and Phylogenetic Approaches to Macroevolution, July 22-29

This course will teach participants to use fossil and phylogenetic data to analyze macroevolutionary patterns using traditional paleobiological stratigraphic methods, phylogenetic comparative methods and combined fossil and tree approaches. Macroevolutionary research is currently split into two quite isolated branches, one based on fossils and the other on extant taxa and phylogenies. Increasingly,evolutionary biologists in both camps are realizing that, only by combining neontological and paleontological data and approaches, can a new, and more powerful integrative macroevolution emerge. Unfortunately, these two disciplines utilize very different data and quantitative methods. Therefore to truly initiate a synthesis of these two approaches we need to train students and researchers to understand the intricacies of both fossil and phylogenetic data, and the methods necessary to integrate them.  APPLY HERE. More information can be found here.

Instructors
Roger Benson Dept. of Earth Sciences, University of Oxford
Samantha Hopkins Clark Honors College and the Department of Geological Sciences, University of Oregon
Gene Hunt Dept. of Paleobiology, National Museum of Natural History, The Smithsonian Institution, Washington DC 20013-7012, USA.
Samantha Price Dept. Evolution & Ecology, University of California Davis
Daniel Rabosky Dept. of Ecology and Evolutionary Biology, University of Michigan
Lars Schmitz Keck Science Department, Claremont McKenna, Pitzer, and Scripps Colleges
Graham Slater Dept. of Paleobiology, National Museum of Natural History, The Smithsonian Institution

Phylogenetic Analysis Using RevBayes, August 25-31

The Bayesian statistical framework for phylogeny estimation has facilitated the development of models that better capture biological complexity. This course is built around the use of the new, open-source program RevBayes (http://sourceforge.net/projects/revbayes/). RevBayes implements an R-like language (complete with control statements, user-defined functions, and loops) that enables the user to build up phylogenetic models from simple parts. This course cover the basics of probability theory, graphical models, and phylogenetics. Then, building on these concepts, we will provide lectures on statistical methods for phylogenetic inference, macroevolution, and epidemiology. APPLY HERE. More information can be found here.

Instructors
Bastien Boussau, LBBE, Lyon, France
Tracy Heath, UC Berkeley & U Kansas
Sebastian Höhna, UC Davis & UC Berkeley
John Huelsenbeck, UC Berkeley
Michael Landis, UC Berkeley
Nicolas Lartillot, LBBE, Lyon, France
Brian Moore, UC Davis
Fredrik Ronquist, NRM Stockholm

# A new phylogeneticist blogger

I’d like to advertise a newcomer among bloggers in phylogenetics: Nicolas Lartillot, now a researcher in Lyon. Nicolas just started blogging a couple of weeks ago but, judging from the number of posts he has already contributed, he seems bound to become a very prolific blogger.

Nicolas has made several noteworthy contributions to the field of phylogenetics, in particular Bayesian phylogenetics. For instance he has developed the CAT model of protein evolution, which seems to be more resilient against the Long Branch Attraction artifact, he has proposed Thermodynamic Integration for computing Bayes factors, he has developed a model for investigating correlations between continuous traits and rates of molecular evolution along a phylogeny, and he maintains the PhyloBayes package.

His blog is called “The Bayesian kitchen”, which I believe means that, underneath the nice theoretical properties of Bayesian inference, a fair amount of cooking is sometimes necessary to get things to work. So far his posts have been about the Bayesian/frequentist divide, about the philosophy of Bayesian inference, or about the interpretation of posterior probabilities, among other things. He uses examples from phylogenetics (e.g. dating, diversification models, ), comparative methods, or gene tree-species tree methods) or population genetics to help make his points. I’m certain I’m going to learn a lot from his posts, and I believe some of the readers of this blog will enjoy them too!

# 2014 Bodega Bay Workshop – Apply now

Applications are now being accepted for the 2014 Workshop in Applied Phylogenetics. This year’s workshop will run from March 8 to 15 at the Bodega Bay Marine Lab on the northern California Coast. The application deadline is January 3rd. See the 2014 workshop page for more information and instructions to apply.

# Is There Life After Graduate School?

In an earlier post, I discussed the decision about attending graduate school in the sciences. I argued that graduate school is certainly not the right choice for everyone. For people of a certain mind-set, though, it is the perfect choice. And even if you have all the right attributes for graduate school, you can still be miserable if you pick the wrong advisor or graduate program, so that choice is also important. But let’s assume that you decided that graduate school was the right choice for you, you did the research, found the perfect advisor, happily toiled away long hours discovering things about the natural world that no one else in the world knew about, published lots of exciting papers about those results, finished a dissertation, and successfully completed a Ph.D. Now you have to address the question that friends and family have been asking you for years: What will you do for the rest of your life, and how will you make a living doing it? How can you make a living doing something as specialized and arcane as phylogenetics, for example?

At least once a month, I see blog posts from disgruntled current or former graduate students about “The Terrible Experience of Graduate School.” I advise a group of extremely bright undergraduates who are interested in research careers in the sciences, and they get scared to death by all these internet horror stories. The problem is, almost the only people who blog about their graduate school experience are the people who are (or were) extremely unhappy. There are certainly unhappy graduate students, but the truth is that many graduate students love the experience. But no one seems to want to write or read a blog post about the writer’s wonderful experience in graduate school. It sounds like gloating or bragging, and happy people usually are just content to be happy.

# Workshop on Integrating Molecular Phylogenies and the Fossil Record

Last week I attended a workshop organized by Hélène Morlon, Tiago Quental, and Charles Marshall on integrating data from the fossil record into phylogenetic methods. This three-day workshop was sponsored by the the France-Berkeley Fund, a cool program that provides seed grants to build partnerships between UC Berkeley researchers and French collaborators. All of the events took place at the UCMP on the UC Berkeley campus.

Hélène, Charles, and Tiago recognized the increasing interest in methods and analyses that incorporate data from fossil taxa; and since there are several of us working in this area–particularly in methods development–the need for building a collaborative network is critical. Furthermore, as methods become more and more reliant on data from the fossil record, connections between neontologists and paleontologists must be formed. Notably, a similar working group – organized by Sam Price and Lars Schmitz – was held at NESCent this past spring and was made up of an overlapping set of researchers. One result of the NESCent catalysis meeting will be a SSE Symposium at Evolution 2014 on “Reuniting fossil and extant approaches to macroevolution”.

# Jukes Cantor Model of DNA substitution

We have had a series of posts introducing several foundational tools in phylogenetic inference including Bayesian reasoning, Markov Chain Monte Carlo, and the gamma distribution’s many uses in phylogenetics. Today, we’ll continue with this theme in a crosspost from my UH colleague Floyd Reed‘s laboratory blog. Here, Floyd gives a simple derivation of the Jukes Cantor model of DNA substitution. Here it is in lightly edited form:

In previous posts I talked about irreversible and reversible mutations between two states or alleles.  However, there are four nucleotides, A, C, G, and T.  How can we model mutations among these four states at a single nucleotide site?  It turns out that this is important to consider for things like making gene trees to represent species relationships.  If we just use the raw number of differences between two species’ DNA sequences we can get misleading results.  It is actually better to estimate and correct for the total number of changes that have occurred, some fraction of which may not be visible to us.  The simplest way to do this is the Jukes-Cantor (1969) model.

Imagine a nucleotide can mutate with the same probability to any other nucleotide, so that the mutation rates in all directions are equal and symbolized by $\mu$.

So from the point of view of the “A” state you can mutate away with a probability of $3\mu$ (lower left above).  However, another state will only mutate to an “A” with a probability of $\mu$ (lower right above); the “T” could have just as easily mutated to a “G” or “C” instead of an “A”.

# Q&A: Excluding Character Sets with Partitions in MrBayes

Bodega Workshop alum Christoph Hiedtke has the following question regarding excluded character sets when setting up partitions in MrBayes. With his permission I’m posting it here. I’ve run into this exact problem before and I’m sure many others have also.

Christoph writes:

Hey gang, how is everybody doing?

I am going crazy over what initially seemed to be a rather trivial MrBayes operation. Initially I had set up a MrBayes file dividing my alignment into 3 partitions and it executes perfectly. I then wanted re-run the same file but this time excluding one partition from my analysis with the designated “exclude” command, but for some reason I am getting an error I cannot get around. Does anyone know whats going on?

Here is part of my command block:

begin MrBayes;
charset p1 = 1-370 371-844 845-1124 2159-2395 3018-3328;
charset p2 = 1125-1404 1685-1921 1922-2158 2396-2706 2707-3017;
charset p3 = 1405-1684;
partition parts = 3: p1, p2, p3;
exclude p3;
set partition = parts;
prset applyto=(all) ratepr=variable;
end;

MrBayes gets stuck on the “set partition = parts” line with the following error:

Defining charset called p1
Defining charset called p2
Defining charset called p3
Defining partition called parts
Excluding character(s)
Setting parts as the partition, dividing characters into 3 parts.
Setting model defaults
Seed (for generating default start values) = 1507443219
You must have at least one site in a partition. Partition 3
has 0 site patterns.
Error when setting parameter “Partition” (2)

HELP!!

Ok, before getting to the answer, I’ll just point out that the obvious alternative approach of excluding p3 and THEN defining a partition that contains only the p1 and p2 character sets will also give an error for not including all sites into one of the partitions. All sites need to be assigned to a partition and all partitions need to have at least one site, so how can we get away with excluding anything?

The solution is to change the command block as follows:

begin MrBayes;
charset p1 = 1-370 371-844 845-1124 2159-2395 3018-3328;
charset p2 = 1125-1404 1685-1921 1922-2158 2396-2706 2707-3017;
charset p3 = 1405-1684;
partition parts = 2: p1, p2 p3;
exclude p3;
set partition = parts
prset applyto=(all) ratepr=variable;
end;

We’ve defined 2 partitions (instead of 3) assigning the ‘extra’ character set to the second partition (note the missing comma). Now we have all sites assigned to a partition and no partitions are empty, so we’re free to exclude the character set.

# Phylogenetic Computing: What’s Your Solution?

Grant deadlines for DEB are coming up and this has me thinking about the best way to go about actually doing the computation that I’m proposing to do. Since my lab is still in its early “get up and running” phase, I’m also in a position to invest in new resources and set up some standard operating procedures for the future. This is an issue that all phylogeneticists struggle with at one point or another, so I thought it would be useful to poll the community. What do you use for big analysis jobs in your lab?

Like many people in my generation of phylogenetics, I started out in the days of taping warning notes to monitors (Figure 1) (i.e., cobble together whatever desktop machines one can get hands on…and then jealously guard them from the enemy lab-mates for the months it takes your analysis to finish). Times have obviously changed since then and we aren’t, as a field, nearly as computationally limited as we were 10 or even 5 years ago. Many free or easily accessible computing options are now available: CIPRES, iPLANT discovery environment, Amazon’s EC2, XSEDE (formerly TeraGrid), and any number of university/college/departmental clusters…and that’s not to mention the homebuilt clusters and trusty (dusty) desktops sitting in the corners of our labs. The workhorse software of our field is also faster than it used to be, allowing us to get more done in the same amount of time, irrespective of the hardware being used.

Figure 1 – The classic PAUP* warning note (note: I stole this from the ad for Brian O’Meara’s “Fast Free Phylogenies” HPC workshop at NIMBioS)

For the last few years, I’ve enjoyed the benefit of having my own small but speedy cluster (built cheaply using commodity parts), as well as a TeraGrid allocation. These have worked well for my needs: they’ve allowed me to get analyses finished in a timely fashion; run many tests and toy analyses without feeling limited; lend time to coworkers in a pinch…and aside from all that, the cluster allows for lots of satisfying tinkering during off hours. All that said, the TeraGrid allocation is now finished, the large cluster on Maui that I’d been hoping to get an allocation on is no longer available, and I’m already seeing that the tropical climate here on Oahu is hell on hardware (e.g. my monitor fills up with condensation anytime I leave it off for more than a day or two). I’m thinking about eventually moving completely into EC2 and XSEDE and not having to worry about hardware at all.

I’d appreciate learning about the experiences that others have had. What is your preferred solution for phylogenetic computing?