Trees R Us: Introduction

“What I cannot create, I do not understand.” Richard Feynman

This series of posts is intended to be a hands-on R-based companion to some of the other things our contributors discuss. We might delve deeper into the behavior of the gamma distribution (or any of the many probability distributions popular in phylogenetics), code up an MCMC algorithm, or work through Felsenstein’s pruning algorithm, to name a few exercises. Playing around with these things in R, even in a simple way, can bring understanding that reading the primary literature or staring at Wikipedia cannot.

I hope this series sheds light on some of the more black-boxy aspects of statistical phylogenetics, and also helps beginning R users develop good programming habits. I invite others to contribute to the series as much as they’d like. I assume that most readers have a rudimentary understanding of R, as in have the ability to open their R GUI (or favorite IDE), write a script, and execute it.

As an initial post, I will first provide a very rough sketch of some of the salient features of the R language (with a small dose of personal opinion), introduce some good practices for writing in R, and then make sure readers are up to speed on writing functions, using for loops and apply-like functions, and the supremely important concept of vectorization. A basic understanding of these topics will help you navigate the code that I (and others) write and should form a solid foundation for writing your own scripts.

What’s the deal with R anyways?

R is a flexible, extensible programming language with a relatively gentle learning curve. These days, it seems to be the go-to language for young biologists with little background in computer science (like me, for certain values of young) who are trying to put together their own analyses. R code can be executed line-by-line, which makes writing software much easier for people who are not used to assembling a (buggy) program from scratch.

However, the relative easiness of R comes at a cost: programs written in R are generally much slower than those written in languages like C, for example. This is because R is an interpreted computer language — your code is interpreted into machine language (binary) and executed each time you run your code¹ — whereas C is a compiled computer language — your code is compiled into machine language just once, and then reused repeatedly. That may seem like a trivial difference, but when you execute a program thousands or millions of times, the time taken to translate the code into binary can become substantial. Becoming a skilled R user means learning how to minimize, and sometimes even completely alleviate, the impact of the interpreted language, and requires you to write clean, efficient code.

Phylogeneticists face a particular problem when using R because the way we calculate a phylogenetic likelihood — using Felsenstein’s pruning algorithm — is not amenable to many of the tricks sophisticated R users have at their disposal to mitigate the cost of interpretation. Have you noticed that none of the major tree inference programs (MrBayes, BEAST, RAxML, GARLI, etc.) are implemented in R? There is a reason for that, and its name is for loop.

Writing in Style

Believe it or not, style is very important when it comes to writing code in R, or any programming language for that matter. Making sure that other programmers can understand the code you write is very important for transparency in the modern world of science, but it also makes it easier for you to keep track of and work with code you have written in the past. Opening up one of your year-old scripts and having no idea what younger-you was doing can be a real nightmare.

Google’s R Style Guide is a commonly used style reference, and adhering to it ensures that your code can be easily read by other R users. The main points are:

Comment frequently. Anything on a line that follows # will be ignored by R, so you can leave messages for humans when they look at your code
Be explicit when naming functions and variables — avoid using names like x, res, z, xx, or anything else that does not clearly indicate the identity of your function or variable
When naming a function, never use periods to separate words and always capitalize words, LikeThis
When naming variables, separate words by periods and never use capital letters, like.this

Enough abstraction, let’s do some R.

Being functional

A function is simply a bit of code that takes one or more arguments (the things in parentheses), performs a task depending on those arguments, and returns the result. For example, the basic R function rnorm takes a few arguments (specifically, the number of random normal numbers to generate, and the mean and standard deviation of the normal distribution from which to draw the variables, plus some other arguments that can be safely ignored for now) and returns the desired random numbers. R is an extensible program language that encourages its users to write their own functions, but a lot of beginners are intimidated by this prospect and end up writing huge chunks of unreadable code instead. However, functions are cleaner, faster and more flexible than those walls of code, so it is imperative that we become comfortable with them.

Previously, Rich discussed the gamma distribution and its use in phylogenetics. He makes frequent calls to the dgamma function, which takes a number (x), a shape parameter ( $\alpha$ ), and a rate parameter ( $\beta$ ), and returns the probability of drawing that number (or drawing a number near that number — don’t get all mathosophical on me). As an exercise, let’s write our own version of dgamma.

Let’s look up the probability density of the gamma distribution on Wikipedia. For the preferred parameterization in phylogenetics, the density function (which calculates the probability of a number being drawn from that distribution) looks like this (substituting $P$ -notation for $f$ -notation):

$P(x|\alpha,\beta) = \beta^\alpha \frac{1}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}.$

That’s a nasty thing to do by hand, but fortunately we live in the future and can have our machines do it for us while we drink our caffeinated beverage of choice. For values $x$ , $\alpha$ , and $\beta$ , we can compute the above equation with this line of code:

(beta^alpha)*(1/gamma(alpha))*x^(alpha-1)*exp(-beta*x)

This line of code works, but we may want to use it a lot and not execute the line manually over and over again. To do that, we write a function. Functions look like this (replace MyFunction with your function name, arguments with your arguments, stuff with your operations, and output with your return value):

MyFunction = function(arguments){

  # Write a comment to tell people what your function's arguments are,
  # what the function does, what the return value is, and anything else
  # worth mentioning.

  stuff
  return(output)

}

We write our gamma density function like this:

GammaDensity = function(x,alpha,beta){

  # GammaDensity takes a number x and computes the probability of drawing
  # that numbers from a gamma distribution with shape parameter alpha
  # and rate parameter beta, assigning it to answer. It then returns
  # the numeric answer.

  answer = (beta^alpha)*(1/gamma(alpha))*x^(alpha-1)*exp(-beta*x)
  return(answer)

}

Our function computes the probability of x given our gamma distribution and assigns it to answer, then returns answer. Now that we’ve written a function, let’s take it for a spin (and compare it to R’s own dgamma).

GammaDensity(x=5,alpha=1,beta=3)
[1] 9.17707e-07

dgamma(5,shape=1,rate=3)
[1] 9.17707e-07

dgamma(5,shape=1,rate=3) == GammaDensity(x=5,alpha=1,beta=3)
[1] TRUE

Excellent! Our function and R’s function are in agreement. Had they disagreed, we would be concerned and start looking for bugs.

Getting things done: `for` loop, `apply`, and vectorization

In programming, as in math, there are many different ways to get the answer you want, but not all ways are equally efficient. for loops, apply-like functions, and vectorization are all ways we can use a function repeatedly, but they are not entirely interchangeable. I will present the strengths and weaknesses of each.

So we’ve written a function. We now want to use that function on a bunch of numbers. First, we have to have those numbers. Why not just make them up, since this is an exercise anyways? Let’s make a million uniform random numbers on the interval $[0,1]$ .

our.data = runif(n=1e6)

`for` loop

Now we want to calculate the probability of each of those numbers according to a particular gamma distribution (say $\Gamma(\alpha=5,\beta=3)$ ). Our first instinct is to use a for loop. We would do that by pre-allocating a vector to hold the probabilities (one per random number), and then running a for loop over the numbers, storing each output in the appropriate spot in our vector:

output = numeric(length(our.data))

for(i in 1:length(our.data)){
  output[i] = GammaDensity(our.data[i],alpha=3,beta=5)
}

After a bit of computation, output[i] now holds the probability of our.data[i].

The `apply` family

An alternative to the for loop is the apply family of functions. There are many different apply-like functions, each suitable for a different purpose, but fundamentally they are just slightly-cleaner for loops. Some are more convenient for working with matrices or outputting lists, but in essence they are visiting each element of an object in succession, performing a function, and returning the answer. It is a common misconception that these functions are vectorized (see below), but in fact they are not. Don’t be fooled.

The function sapply is the member of this family that works on vectors, like our very own our.data. Re-writing the for loop above as an sapply would look like this:

output = sapply(our.data,GammaDensity,alpha=3,beta=5)

sapply takes each element of our.data in order and performs GammaDensity on it, using alpha=3 and beta=5. We’ve cleaned up our code a bit (not having to worry about pre-allocation is convenient), but it’s still a little slow. Also, we’ve lost the ability to carry previous calculations forward — apply is only appropriate when you have a function you want to perform independently on every element of an object. For example, if the value of element i depends on element i-1, we cannot use any sort of apply function (without dealing with function scope, which may be discussed in a later installment) because we must have the value i-1 stored somewhere before we can calculate i.

Vectorization

Doing things in sequence is inefficient for an interpreted language. R programmers use vectorized functions to get around the inefficiency of working with long vectors. A vectorized function is one that takes one or more vectors as arguments, performs its duties on all of the elements of those vectors more-or-less concurrently (i.e. interpreting the function once), and returns the answer for each element of the vector. This can be substantially faster than using a for loop, and you should strive to vectorize any function you write.

The R developers have done an excellent job of ensuring that the base functions in R are as vectorized as possible. Many base functions are vectorized by default, and mathematical operators (which are just disguised functions anyways) are automatically vectorized. Since our GammaDensity function contains only vectorized functions, the function itself is already vectorized! Surprise! Try it out:

output = GammaDensity(x=our.data,alpha=3,beta=5)

Lightning fast! This vectorized function is orders of magnitude faster than the alternatives, for loop and sapply. You should always always always vectorize when possible. It may take some time to figure out whether your function can be vectorized, but the extra effort can pay off in the long run, especially if you intend to distribute your code.

Now we might understand why doing phylogeny inference in R is impractical — Felsenstein’s pruning algorithm depends on previously calculated conditional likelihoods, which means it must be done in sequence and therefore is very reliant on for loops. Bayesian inference of phylogeny is especially difficult in R because approximating the posterior density with MCMC is a fundamentally sequential procedure. However, R has a number of strengths that should not be overlooked. It is easy to learn and adapt to your own purposes, it has a great deal of software (packages) already developed that may suit your needs, and it has excellent resources for data visualization and figure preparation. It is also probably popular among visitors to this website, making it a lingua franca and useful pedagogical tool for programming in general.

Next time, we will use our function-writing skills to code up a simple MCMC algorithm to approximate an arbitrary posterior density. Make sure you’ve studied Brian’s lecture!

¹: without packages compiler and jit.

5 thoughts on “Trees R Us: Introduction”

Rich Glor 2 April 2013 at 8:04 pm

Great stuff Mike! Thanks.
Brian Moore 3 April 2013 at 11:58 am

Nice post Mike! I think this will make a great column!—Brian
Yaniv Brandvain 3 April 2013 at 4:26 pm

you are making my dreams come true. keep it up, I’m excited to learn more!!!!
1. Rich Glor 10 April 2013 at 11:56 am
  
  Perhaps we can have a series of posts where Yaniv tells us his dreams and Mike responds with posts that make them come true?
  1. Bob Thomson 11 April 2013 at 11:39 am
    
    I would happily refocus the entire blog toward this endeavor.