Morphological Evolution in R

Primary Contact: Samantha Price (saprice ‘at’ ucdavis ‘dot’ edu)

Data & R code

Download the data and R code for the tutorial here: Data&Rcode 

Download an annotated explanation of the Brownian Motion simulation: BMsims.R

Download R code to run a test of methods that identify shifts without a priori methods (auteur & phytools): identifyRateshift.R

Slides

Download pdf of slides: Bodega2013

Comparing rates of phenotypic evolution in R

This tutorial will combine various aspects of the phylogenetic comparative methods covered earlier in the week to show you how to estimate and compare rates of continuous character evolution in species with different discrete states e.g. habitats or dietary strategies. Rates of character evolution are not only interesting in their own right but are also a useful tool for other comparative analyses such as ancestral state reconstruction and, in a morphological context, Brownian rates of evolution can be used as phylogenetically corrected estimates of morphological variance (see O’Meara et al. 2004 and Hutcheon & Garland 2004). This method has been used to answer a variety questions such as “Are rates of floral evolution in the large parasitic Rafflesia faster than their smaller relatives?” (Barkman et al. 2008), “Does piscivory limit morphological diversification?” (Collar et al. 2009) “Do coral reefs promote morphological diversification in fishes?” (Price et al. 2011; Price et al. 2012).

This tutorial will show you how to run analyses in the free open source statistical software environment ‘R’ using functions in the phytools package written by Liam Revell. This code is based on methods developed by Brian O’Meara and in previous years (prior to 2011) we used his Brownie program (see 2010 tutorial) but it can be tricky to get it to run on linux or PC, which is why we have switched to R.

See the papers cited in this tutorial for more details. All the code (with annotations) in this tutorial can be downloaded in Morph_ratein2013R.R this will open in R.

Installation and Getting Started

You will need to install the phytools package from CRAN with all its dependencies.

Change the path to the files as appropriate; I am assuming you have a folder on your Desktop called Bodega where you keep all the code for this workshop.

Dataset and Rationale

We are using a dataset of grunts a group of primarily nocturnal marine fish. We are testing the hypothesis that complex reef environments promote the rate of morphological evolution within the trophic apparatus of these fishes. So we will be comparing reef-living species (1) to non-reef species (0). The dataset represents the 3 functionally important trophic characters. We are going to incorporate uncertainty in tree topology, branch lengths and mapping of reef living onto the phylogeny by sampling from the Bayesian posterior distribution of trees from a BEAST run and using stochastic mapping upon these trees (which you heard about earlier in the week from Rich Glor). Then when we integrate the rate of evolution across each of these trees we will have an estimate of the rate of character evolution incorporating the uncertainty in all of these factors.

We have 10,000,000 trees from a Bayesian posterior distribution generated using BEAST, which we have sub-sampled to give us 9 trees (please note you would need to use many more trees for this to be a valid procedure but due to time constraints we are going to limit ourselves). On each of these trees we have used stochastic character mapping (see Nielsen 2002, Huelsenbeck et al. 2003) implemented in SIMMAP v1.0 (see [WWW]http://www.simmap.com) to create a distribution of character histories of reef (1) or non-reef living (0) sampled in proportion to their Bayesian posterior probability. We have sampled 10 character histories for each tree (again this is far too few for a proper analysis). The gruntsimmaptrees.nex file contains 90 trees with the reef/non-reef character mapped onto every topology. This can be read into R using read.simmap.R this creates a multiPhylo object with an additional element called mapped.edge

gruntrs<-read.simmap(“Desktop/Bodega/gruntsimmaptrees.nex”, format=’nexus’) # you can also use format=’phylip’ which is the default so if you get an error you might need to check your tree format. Also note you want to the the default rev.order=TRUE when reading in simmap output as it uses the counterintuitive way of placing the states and times along each branch is given (from root to tip) in right-to-left order.

Plot the first few trees to see if they look okay

cols<-c(“black”, “red”); names(cols)<-c(“0”, “1”) #If you just use the default colors it will not plot the branches associated with the 0 character as it only recognizes 1<
par(mfrow=c(3,3))
for (i in 1:9) plotSimmap(gruntrs[[i]], cols)

Now read in the data

dat<-read.table(“Desktop/Bodega/gruntsdata.txt”)

and extract the first trait into its own vector

trait1<-dat[,1]
names(trait1)<-row.names(dat)

Estimating Rate of Brownian Motion Evolution

For trait 1 we are going to run two models, one that fixes the rate of BM to be the same across reef and non-reef habitats (1-rate model) and the other that allows the rate to vary across habitats (2-rate model).

We are now going to run trait 1 with the first tree, calculating the p-value from the likelihood ratio test comparing the 1 and 2-rate models  using simulation to generate the null distribution (parametric bootstrapping). This is preferable to the chi-square when you have few taxa in one of your categories as it may be non-conservative.

trait1tree1res<-brownie.lite(gruntrs[[1]], trait1, test=”simulation”, nsim=1000)

We are now going to run a loop across all 90 SIMMAP trees. So first create an empty results matrix with 90 rows (1 for each tree) and 10 columns taken from the output from brownie.lite or modifications thereof and name the columns with the names of the different elements of the output from brownie.lite. We are also going to calculate the Akaike Information Criterion for small sample size (AICc) from the maximum likelihood estimate using the standard formula.

resultstrait1<-matrix(nrow=90,ncol=10, dimnames=list(1:90, c(“BM_singlerate”,”logL_singlerate”, “k1”, “1rate_AICc”, “Nonreef_rate”, “Reef_rate”,”logL_multiplerate”, “k2”, “2rate_AICc”, “pchisq”)))

converge<-vector() # this is an empty vector which we are setting up and will eventually contain information about convergence of the likelihood analyses

for(i in 1:90){ # This tells R to run the commands below 90 times, for us this is for every tree in gruntrs

res<-brownie.lite(gruntrs[[i]],trait1, test=”chisq”) # This creates a temporary vector res which contains the results of brownie.lite for the ith tree, it gets overwritten each time which is why we later pull out the important info and place it in the results matrix. We have chosen to estimate the p-value for the likelihood ratio test using the chi-squared approximation as it is quicker but you may want to explore the simulation option which compares the likelihood ratio from the data to likelihood ratios obtained by simulating under the null hypothesis.

res1aicc<-(-2*res$logL1+2*res$k1+((2*res$k1*(res$k1+1))/(90-res$k1-1))) # This creates a temporary vector res1aicc which contains the calculated AICc score for the 1 rate model for the ith tree.

res2aicc<-(-2*res$logL.multiple+2*res$k2+((2*res$k2*(res$k2+1))/(90-res$k2-1))) # This creates a temporary vector res2aicc which contains the calculated AICc score for the 2 rate model for the ith tree.

resultstrait1[i,]<- as.numeric(c(res$sig2.single, res$logL1, res$k1, res1aicc, res$sig2.multiple[sort(names(res$sig2.multiple))][[1]],res$sig2.multiple[sort(names(res$sig2.multiple))][[2]], res$logL.multiple, res$k2, res2aicc, res$P.chisq)) # This takes the important results from the temporary vector res and places them in the ith row of the results matrix we set up earlier labeling them all as numeric so we can apply mathematical functions to them later. Note that the sorting is essential, as depending on the tree, it can vary between giving the reef or non-reef rate first and we want to make sure we always have the reef rates in the reef column and the non-reef rates in their own column!

converge[i]<-res$convergence # This takes the text saying whether the ith likelihood calculation has converged and places them in the vector we already set up

}

Now would be a good point to make sure that R thinks that all of the likelihood analyses have converged as well as take a look at the results and write them to a table.

converge

Most, if not all, of the analyses have converged – if there are ones that haven’t you may want to consider removing them for the next set of calculations. If none of them have converged try to find out what has prevented the likelihood from being optimized and above all do not trust the results.

resultstrait1
write.table(resultstrait1, file=”Desktop/Bodega/Trait1BMrateresults.txt”)

We are now going to calculate the column means and standard deviation; the standard deviation will be useful in the calculation of the Standard Errors.

trait1res<-rbind(colMeans(resultstrait1), apply(resultstrait1, 2, sd)) # this binds the results from the following two commands by row (rbind). The first command calculates the means of the columns for resultstrait1 and the second uses the apply function to quickly calculate the standard deviation for every column (that is what the 2 specifies) in resultstrait1.

rownames(trait1res)<-c(“mean”,”sd”)
trait1res

Analyzing Output: Model-selection

You now have all the information at your fingertips to examine the hypothesis. This section mainly uses model selection and multi-model inference and I suggest you refer to Burnham & Anderson 2002 if you are not familiar with this approach.

The first way of deciding whether a 1 or 2-rate model fits the data better is to use the likelihood ratio test: D=-2[ln(likelihood for 1-rate model) – ln(likelihood for 2-rate model)] the significance of this is determined by a chi-squared distribution with 1df. This is calculated for you by brownie.lite and is the p-value in the results.

The next way is to use model selection. First calculate the difference in AICc score (aka ?AICc) for the two models (model – best fitting model). Where the best fitting model is the one with the lowest AICc score, which in our case is the 2-rate model for trait1:

deltaAICc<-trait1res[1,][4]-trait1res[1,][9] # this takes the 4th element in the 1st row of trait1res which is the AICc for the 1rate model and and the 9th element in the 1st row which is the AICc score for the best-fitting model whicg is the two-rate model here.

The general rule of thumb is if the ?AICc is >4 there is substantial support for the best fitting model (see Burnham & Anderson 2002).

Analyzing Output: Model-averaging

We could also use model averaging to incorporate uncertainty about model choice by actually using the estimated rates from both models. For this we need to calculate the AICc weight using the relative likelihood (relLrate) of the 1 and 2-rate models which is calculated as: exp(-0.5*?AICc score for that model(best fitting model gets 0)). You then sum the rates of evolution for each model for each habitat weighted by their AICc weights to calculate the model averaged rate parameters.

relL1rate<-exp(-0.5*(trait1res[1,][5]-trait1res[1,][14])) # relative likelihood of the 1-rate model
relL2rate<-exp(-0.5*0) # relative likelihood of the 2-rate model

rate1aiccweight<-relL1rate/(relL1rate+relL2rate)
rate2aiccweight<-relL2rate/(relL1rate+relL2rate)

Modelaveragedreefrate=((trait1res[1,][1]* rate1aiccweight)+(trait1res[1,][6]*rate2aiccweight))

Modelaveragednonreefrate=((trait1res[1,][1]* rate1aiccweight)+(trait1res[1,][5]*rate2aiccweight))

Using R on a cluster (non-interactively)

Note that is analysis may take a while if you have a 1000 trees with lots of taxa so you may want to consider running it on a machine that is not your laptop. If you have access to multiple CPU’s you could consider splitting the analysis into several portions to run on different CPU’s (batch processing) and then concatenating the output. To convert this script to submit to a cluster/server you need to make sure all the important data is written to a file and that you plot to a pdf or other file type (don’t forget to close it too – the default is a pdf called Rplots.pdf which is what you will generate if you run this script as is – you will notice you would probably want to make the size of the plotting area larger etc.). You also should ensure that you have the the right path to your files. Finally, you need to add quit() at the end of your script. To run R non-interactively (on a cluster) you need to use the command R CMD BATCH input-filename (i.e this script) output-filename(optional).

Try running this script, available to download at the top of the page as Morph_rateinR2012.R non-interactively. On a mac you should be able to open the Terminal window and navigate to the folder where you have saved the script (using cd) and then type:

R CMD BATCH Morph_rateinR2012.R

You should generate a default output file name Morph_rateinR2012.Rout this will show you exactly the commands that have been executed and can be used to figure out where in the analysis it currently is.

Identifying rate shifts without a priori hypotheses

Over the few years several different methods have been published that allow the identification of shifts in the rate of phenotypic evolution without a priori hypotheses (Revell et al., 2012; Eastman et al. 2011, Venditti et al. 2011 and Thomas & Freckleton 2011). We are going to have a little fun comparing two of these approaches: Revell et al. 2012 and Eastman et al. 2011. This is just a quick introduction to these new methods to show you what is possible.

Both these methods implement MCMC so you will need to apply what you have learnt throughout the workshop to implement them correctly – in particular refer to Brian Moore’s tutorial on MCMC diagnosis. So if you want to implement these methods in your thesis or papers please spend *a lot* of time familiarizing yourself with the method you choose; its benefits and pitfalls as well as how to implement it correctly i.e. checking that the MCMC has converged, is appropriately mixed and not autocorrelated etc..

Eastman et al. auteur package
auteur implements a reversible-jump MCMC approach to find one or more shifts in the Brownian motion rate throughout the topology. For precise details about the RJMCMC approach they implement please see the paper: Eastman et al. 2011 Evolution 65(12), 3578-3589. This method is ideal for data exploration, to help you identify what may be interesting patterns pursue, it is unlikely to be sufficient on its own.

trait<-dat[,1]
names(trait)<-row.names(dat)

Now we will run 2 chains of 100,000 each (this is no guarantee of convergence).

rjmcmc.bm(gruntrs[[1]], trait, summary=TRUE, reml=TRUE, ngen=100000, sample.freq=500, constrainK=1, fileBase=”trait1tr1_test1″ ) # We are constraining it to only evaluate models with an additional independent rate using constrainK=1 so that we can more easily compare it to the Revell et al. approach.

rjmcmc.bm(gruntrs[[1]], trait, summary=TRUE,reml=TRUE, ngen=100000, sample.freq=500, fileBase=”trait1tr1_test2″)

This is where you should do your MCMC diagnostics – we are going to pretend you have done it for now and just get rid of 25% as burn-in

pool.rjmcmcsamples(base.dirs=c(“BM.trait1tr1_test1.parameters”, “BM.trait1tr1_test2.parameters”), lab=”trait1tr1″)

shifts.plot(phy=gruntrs[[1]], base.dir=”trait1tr1.combined.rjmcmc”, burnin=0.25, legend=TRUE, edge.width=2)

Revell et al. 2012 evol.rate.mcmc phytools package
This method uses a Bayesian MCMC approach to estimate a single rate shift upon a tree.

gruntr<-read.tree(text=write.tree(gruntrs[[1]]))

gruntres<-evol.rate.mcmc(gruntr, trait, ngen=100000, control=list(sample=200))

The defaults: symmetric exponential proposal distribution and Gaussian proposal distributions centered on 0 for the two rates and the ancestral value (at root). Here we have set it to run 100,000 generations and sampling every 200.

This is where you should do all your MCMC diagnosis – for now we are going to pretend you have done it! For now we will remove the first 25% as burn-in which is done by only using the last 75% of the output gruntres$mcmc[125:501, ]

Now calculate the split with the smallest summed distances to all other splits.

gruntsminsplit<-minSplit(gruntr, gruntres$mcmc[125:501, c(“node”, “bp”)], method=”sum”)

Extract the posterior sample of the average split

gruntpost<-posterior.evolrate(gruntr, ave.shift=gruntsminsplit,gruntres$mcmc[125:501,], gruntres$tips[125:501])

pies<-hist(gruntres$mcmc[125:501,][,”node”], breaks=0.5:length(gruntr$tip)+gruntr$Nnode+0.5)

plot(gruntr)

We can plot the posterior probabilities on the tree using pie charts on the edges (rate shift has to appear somewhere so the probabilities across the tree have to sum to 1)

edgelabels(edge=match(which(pies$density>0.01), gruntr$edge[,2]), pie=pies$density[which(pies$density>0.01)], piecol=c(“red”, “black”), cex=0.5)

Caveats – phylogenetic comparative methods

Is the evolutionary model appropriate for your data?

Brownian motion: It is always good practice to test for BM-like behaviour of your traits prior to using any analysis that assumes traits evolve under BM. You heard about this from Luke Mahler and Rich Glor earlier this week but as a quick recap Brownian motion (BM) assumes that at any point in time a trait can increase or decrease in value, and the direction and magnitude of that change is independent of current or past states (i.e. no trends towards increasing or decreasing character values such as Cope’s rule etc.). It also assumes that evolution is gradual, with no punctuated evolutionary leaps or slowing down/speeding up. One of several ways to test whether your traits fit a BM model is to use 3 branch length transformations ? , ? and ? developed by Mark Pagel (1997, 1999).

Your hypothesis can also determine whether BM is appropriate, if you expect the different discrete states to select for different optimal phenotypes within the continuous characters then you may want to consider Ornstein-Uhlenbeck models (see Hansen 1997 and Butler and King 2004). OU models add an adaptive peak (mean) and strength of selection to a BM model, to model selection towards a particular phenotype. BM is a special case of OU, when the strength of selection towards the peak approaches zero, the model collapses to a BM model. These models can be implemented in the R package OUwie (Beaulieu et al. 2012) and can take into account evolving under discrete selective regimes as given by stochastic character maps.

Does your tree/data have the power?

How well does your data and tree estimate the parameter estimates and does it have the power to distinguish between different models?  These are important questions to know the answer to if you are going to be running comparative phylogenetic analyses.  One way you can quantify you can look at this is through simulation.

If you are interested in how well you are estimating the parameter you would take your Maximum Likelihood parameter estimate and simulate data using that value, then estimate the parameter for the simulated data to see distribution around the simulated value. For example, you heard about Pagel’s lambda parameter from Rich and Luke earlier in the week, so we are going to estimate it for trait 1 and then simulate data using that estimate.  Luke also covered some of this in his lecture on Wednesday using Carl Boettiger’s pmc R package.

library(geiger)

First we estimate lambda

trait1lambda<-fitContinuous(gruntrs[[1]], trait1, model=”lambda”)

Then we simulate data with the the estimate of lambda, to do that we create a tree that has been transformed using the estimated lambda

lambdatr<-lambdaTree(gruntrs[[1]], trait1lambda$Trait1$lambda)

and we simulate using Brownian motion with the transformed lambda tree and the estimate of the Brownian motion rate from the original data

simLambdatraits<-fastBM(lambdatr, sig2=trait1lambda$Trait1$beta, nsim=100) # ideally you want to use a 1000 or more.

Now I am going to estimate the lambda parameter for the 100 simulated datasets

lambda<-vector(length=100)
for(i in 1:100){
    tmp<-fitContinuous(gruntrs[[1]], simLambdatraits[,i], model=”lambda”)
    lambda[i]<-tmp$Trait1$lambda
}

Now if we look at the distribution of the lambda parameter we can see that our estimate of lambda is not at all accurate.

hist(lambda, col=”grey”)
abline(v=0.65, col=”red”, lwd=2)

Miscellaneous

a) Common errors: Sometimes the default optimization methods for Maximum Likelihood will not work with your dataset so you may need to change methods you can do this by altering the method in the brownie.lite code (try adding method =”L-BFGS-B”,lower=c(0,-Inf))

b) Best practices: This is beyond the scope of this tutorial but if you intend to run this kind of analysis for a publication I highly recommend looking at the other output from Brownie.lite. In var.single and vcv.multiple it provides information on the shape of the likelihood surface for the 1 and 2 rate models respectively. This should be incorporated into any estimate of the Standard Error of the parameter estimates along with the error associated with different tree topologies, branch lengths and character histories, which is estimated from the standard deviation of the parameter estimates across the simmap trees.

References

Barkman, Bendiksby, Lim, Salleh, Nais, Madulid and Schumacher (2008) Accelerated rates of floral evolution at the upper size limit for flowers. Current Biology 18(19) pg 1508-1513.

Beaulieu, J.M., Jhwueng, D-C, Boettiger, C. & O’Meara, B.C. (2012) Modelling stabilizing selection: expanding the Ornstein-Uhlenbeck model of adaptive evolution. Evolution 66(8), 2369-83.

Burnham, K. P., and D. R. Anderson. 2002. Model Selection and Multimodel
Inference: A practical information theoretic approach. Springer-Verlag, New York.

Butler, M. A., and A. A. King. 2004. Phylogenetic comparative analysis: a
modeling approach for adaptive evolution. American Naturalist 164:683-695.

Collar D. et al. 2009. Piscivory limits diversification of feeding morphology in centrarchid fishes. Evolution 63(3), 1557-1573.

Eastman, J. M., Alfaro, M. E., Joyce, P., Hipp, A. L. & Harmon, L. J. (2011) A novel comparative method for identifying shifts in the rate of character evolution on trees. Evolution, 65(12), 3578-3589.

Hansen, T. F. 1997. Stabilizing selection and the comparative analysis of
adaptation. Evolution 51:1341-1351.

Huelsenbeck, J. P., Nielsen, R., Bollback, J. P. 2003. Stochastic mapping of morphological characters. Systematic Biology 52(3), 131-158.

Hutcheon & Garland 2004. Are megabats big? Journal of Mammalian Evolution 11 (3-4), pg 257-277

Nielsen, R. 2002. Mapping mutations on phylogenies. Systematic Biology 51(5), 729-739.

O’Meara et al. 2006. Testing for different rates of continuous trait evolution using likelihood. Evolution 60(5), pg 922-933.

Pagel, M. (1997) Inferring evolutionary processes from phylogenies. Zoologica Scripta 26(4), 331-348.

Pagel, M (1999) Inferring the historical patterns of biological evolution. Nature 401, 877-884.

Price, S.A., Holzman, R. A., Near, T. & Wainwright, P. (2012) Coral reefs promote the evolution of morphological diversity and ecological novelty in labrid fishes. Ecology Letters 14(5), 462-469.

Revell, L. J., Mahler, D. L., Peres-Neto,P. R. & Redelings, B. D. (2012) A new phylogenetic method for identifying exceptional phenotypic diversification. Evolution 66 (1), 135-146.

Thomas, G. & Freckleton, R. P. (2012) MOTMOT: models of trait macroevolution on trees. Methods in Ecology and Evolution 3(1), 145-151 (online early 2011)

Venditti,C., Meade, A. & Pagel, M. (2011) Multiple routes to mammalian diversity. Nature 479 (393),

Leave a Reply