Author Archives: Mike May

Trees R Us: Introduction

“What I cannot create, I do not understand.” Richard Feynman

This series of posts is intended to be a hands-on R-based companion to some of the other things our contributors discuss. We might delve deeper into the behavior of the gamma distribution (or any of the many probability distributions popular in phylogenetics), code up an MCMC algorithm, or work through Felsenstein’s pruning algorithm, to name a few exercises. Playing around with these things in R, even in a simple way, can bring understanding that reading the primary literature or staring at Wikipedia cannot.

I hope this series sheds light on some of the more black-boxy aspects of statistical phylogenetics, and also helps beginning R users develop good programming habits. I invite others to contribute to the series as much as they’d like. I assume that most readers have a rudimentary understanding of R, as in have the ability to open their R GUI (or favorite IDE), write a script, and execute it.

As an initial post, I will first provide a very rough sketch of some of the salient features of the R language (with a small dose of personal opinion), introduce some good practices for writing in R, and then make sure readers are up to speed on writing functions, using for loops and apply-like functions, and the supremely important concept of vectorization. A basic understanding of these topics will help you navigate the code that I (and others) write and should form a solid foundation for writing your own scripts.

What’s the deal with R anyways?

R is a flexible, extensible programming language with a relatively gentle learning curve. These days, it seems to be the go-to language for young biologists with little background in computer science (like me, for certain values of young) who are trying to put together their own analyses. R code can be executed line-by-line, which makes writing software much easier for people who are not used to assembling a (buggy) program from scratch.

Continue reading