Phylogenetic Computing: What’s Your Solution?
Grant deadlines for DEB are coming up and this has me thinking about the best way to go about actually doing the computation that I’m proposing to do. Since my lab is still in its early “get up and running” phase, I’m also in a position to invest in new resources and set up some standard operating procedures for the future. This is an issue that all phylogeneticists struggle with at one point or another, so I thought it would be useful to poll the community. What do you use for big analysis jobs in your lab?
Like many people in my generation of phylogenetics, I started out in the days of taping warning notes to monitors (Figure 1) (i.e., cobble together whatever desktop machines one can get hands on…and then jealously guard them from
the enemy lab-mates for the months it takes your analysis to finish). Times have obviously changed since then and we aren’t, as a field, nearly as computationally limited as we were 10 or even 5 years ago. Many free or easily accessible computing options are now available: CIPRES, iPLANT discovery environment, Amazon’s EC2, XSEDE (formerly TeraGrid), and any number of university/college/departmental clusters…and that’s not to mention the homebuilt clusters and trusty (dusty) desktops sitting in the corners of our labs. The workhorse software of our field is also faster than it used to be, allowing us to get more done in the same amount of time, irrespective of the hardware being used.
For the last few years, I’ve enjoyed the benefit of having my own small but speedy cluster (built cheaply using commodity parts), as well as a TeraGrid allocation. These have worked well for my needs: they’ve allowed me to get analyses finished in a timely fashion; run many tests and toy analyses without feeling limited; lend time to coworkers in a pinch…and aside from all that, the cluster allows for lots of satisfying tinkering during off hours. All that said, the TeraGrid allocation is now finished, the large cluster on Maui that I’d been hoping to get an allocation on is no longer available, and I’m already seeing that the tropical climate here on Oahu is hell on hardware (e.g. my monitor fills up with condensation anytime I leave it off for more than a day or two). I’m thinking about eventually moving completely into EC2 and XSEDE and not having to worry about hardware at all.
I’d appreciate learning about the experiences that others have had. What is your preferred solution for phylogenetic computing?
We have access to a University cluster, an XSede allocation, and a a small lab cluster (12 Mac Pros dedicated to the cluster, plus opportunistic usage on 7 more Pros ). You can see usage on our in house plus University cluster through time on my website (http://www.brianomeara.info/) — we go up to 750 jobs at once. If I’m doing a stock analysis in something like RAxML I’ll use the Cipres cluster or iPlant, but for a lot of what we do (testing and using new approaches we develop) we have to run our own software. We haven’t used our XSede allocation much yet.
EC2 or similar is nice in that you can scale up quickly (handy for impending grant deadlines), and I’d like to move to do more on that. However, I worry about going entirely that way for two reasons. One is grant funding: with a 5% funding rate, there’s a good chance a lab will go through a lean spell with little funding. With your own hardware, you still have things that work whether your lab is fat or lean (assuming space, electricity, network access, and cooling is covered). Shared clusters make sense in many ways, but it seems in many cases they’re moving to a condo model: “buying in” gets you use of hardware for three years, but you have to pay annual maintenance fees on top of that, and then your investment disappears when the time period ends. Three year old hardware is old from a cutting edge computing standpoint, but for many problems (such as testing new algorithms) the scale of the problem isn’t very different than it was five years ago, and so an oldish cluster with guaranteed access is still handy. On the other hand, managing it is either a pain for you or someone in your lab or a fixed expense of a sys admin. The other thing I worry about with EC2 is having one neophyte lab member burn through the cash quickly (let’s have each instance independently spend time FTPing genbank for data….). One could put limits on lab members, but that’s then a lot of organizing within the lab, and it also discourages experimentation (make sure you know exactly what you want, because every second costs). With your own cluster, or one where there’s a fixed allocation by years rather than CPU time, you can feel freer to do a run, see what results, and tweak. However, if you’re in a climate where hardware gets waterlogged, that’s a problem. I’d think there would be dehumidifiers or something similar to dehumidify the air. Put the computers in a room with desert tortoises so IACUC requires the air to be dehumidified…. [kidding]
I’ve opted for the condo-in-a-local-cluster route myself, in large part because we have great university-wide computing resources at LSU. For the purchase price of each node (without paying any maintenance or other annual fees), I get 4 years of priority access. Combined with standard access to queues on several university systems, we generally have more than enough resources available to run jobs, even when urgently trying to meet a deadline. In the lab, I’ve just purchased mid-range desktops for each person to use when coding, running tests, etc. I’ve been toying around with the idea of using EC2 or XSede, but we haven’t yet hit a point where local resources were exhausted.
As a grad student, I’ve been fortunate enough to cobble together several different computing sources for my various projects and can get the job done without too much trouble. For larger jobs (e.g. transcriptome assembly, simulations etc.) I have access to two different university clusters, both of which have their limitations. Both resources offer some level of technical support; however, one cluster strictly limits the number of jobs you can be running (2-20 jobs, with a lower allotment for computationally intensive/longer jobs) and often involves a long waitlist, while the other cluster is more flexible but is limited to analyses for a specific project. The lab has several desktops for smaller and/or more specific analyses, and these resources are often being regularly used by various lab members. As far as personal computational resources, I am limited to my trusty laptop and a low-cost custom built desktop that I invested in early during grad school. Like I said, it’s almost a little bit of everything, but it’s getting the job done!
Thanks for the thoughts all. It seems that most people are using some combination of in house and external resources, perhaps meaning that no one is yet comfortable moving totally to offsite resources yet? I’m tempted to try it, but I can definitely see some headaches developing (for custom software in particular, as Brian points out). I built our current cluster just last year, so I hope it has a few more years of use in it before I need to make a decision…and by then, the options probably will have changed anyway.
Good question. I frequently rely on lab and department clusters, but have found the NERSC cluster to be a much more reasonable option for scaling up then EC2. I’ve twice applied for entry-level NERSC grants (50,000 hours) which I have found both straightforward and incredibly useful. http://www.nersc.gov/ Having professional support when trying to scale your applications can also be very nice. Haven’t crunched the numbers recently, but guessing that the equivalent time on EC2 would be pretty pricey.
I know I’m late to the party, but can you write a blog post or two on your own cluster, hardware manifests and the like? I’m really interested myself since now we can go to parallel GPU computing in TNT and R at least, but I’m not sure for PAUP or MrBayes. I’m not script-handy, but if it allowed analyses of MrBayes or similar software/scripts to go much quicker, I’d be all for such a thing.
I use PAUP for most of my analyses, but I use shortcuts such as Wilson’s treespace search to get around computational time issues. While I have a quadcore machine, PAUP can’t take advantage of multicore processing, which is frustrating. I know there’s a version of PAUP out there from someone’s thesis that tackled having PAUP being able to use multiple cores or multiple GPUs, but I don’t know if they released it.
Hi Nick. Sure thing, I’ll pull together something and put it up later this week.