Primary contacts: Brian Moore, Tracy Heath, Bob Thomson, & Rich Glor
Outline
Part I: Simulating Phylogenies Under Birth-Death Process & Estimating Parameters from these Simulations
Expectations under stochastic branching process model (SBP) models are diffuse to a degree that defies intuition, which makes it extremely difficult to detect departures from expectations. It is instructive to simulate data under these models to get a sense of the variance that occurs stochastically.
We will simulate some trees under a constant birth-death SBP model using the TreeSim R package.
Before going any further, let’s load all the packages we need today:
library(geiger) library(ape) library(TreeSim) library(laser) library(diversitree)
For the first run, we will generate 100 trees under a relatively high extinction rate with a speciation rate of 0.9 and an extinction rate of 0.5, conditioning on the time of the process, 10 units.
sim.bd.age(10, 100, 0.9, 0.5) -> simTrees
Now we’d like to determine what fraction of our trees went extinct. To do this, we’re going to again use the sapply function. The way we’re going to do this is by asking if the simulation ended up producing a tree (when the entire clade goes extinct, the output of the simulation is the number 0 rather than a tree). To ask if a simulation produced a tree, we’re going to use the is "phylo" function.
sapply(1:length(simTrees), function(i) is(simTrees[[i]], "phylo")) -> survivors
We’re now going to estimate what fraction of our trees survived to the present day
sum(survivors)/length(simTrees)
Now let’s restrict our attention to those trees that survived to the present day. To do this, we’re going to make a new list that consists only of those trees that survived:
simTrees[survivors] -> simTrees_survivors
To confirm that this worked, we can plot a sample of the surviving trees. Let’s just plot the first 10 trees.
par(mfcol=c(2,5)) sapply(1:10, function(i) plot(simTrees_survivors[[i]], show.tip.label=FALSE))
We should get a plot that looks like the figure below, along with some output on each tree returned by the sapply function. We’re not going to pay much attention to the text output right now, but we do want to take a look at the trees. To do this, you might need to go to the Window menu bar in R and select the Quartz window.
Now we want to eliminate extinct taxa from the trees we just generated. To do this, we’re going to loop through all of our trees and trim the extinct taxa using the “prune.extinct.taxa” command. We’ll save the resulting trees to a new list called “simTrees_extantOnly.” NOTE: this pruning might take a couple of minutes, so just be patient if you don’t get the output right away.
list() -> simTrees_extantOnly for(i in 1:length(simTrees_survivors)) { prune.extinct.taxa(simTrees_survivors[[i]]) -> simTrees_extantOnly[[i]] }
Let’s take a look at the trees after deleting these extinct species:
sapply(1:10, function(i) plot(simTrees_extantOnly[[i]], show.tip.label=FALSE))
To get a sense of the inherent variance of the constant-rate birth-death SBP, we can plot the species diversity of the realized trees. To do this, we’ll generate a histogram of the number of species in each of the simulated trees.
vector() -> tips for (i in 1:length(simTrees_extantOnly)) { Ntip(simTrees_extantOnly[[i]]) -> tips[[i]] } par(mfcol=c(1,1)) hist(tips)
Now we’re ready to start investigating patterns of diversification from a sample of extant trees. Let’s first take a look at some lineage-through-time (LTT)plots. To do this, we’re going to plot the first five simulated trees over the lineage through time plots generated from each of these trees.
par(mfcol=c(2,5)) #generate 2-row, 5-column plotting window for(i in 1:5) { plot(simTrees_extantOnly[[i]], show.tip.label=FALSE) ltt.plot(simTrees_extantOnly[[i]], log="y") }
Note that we’re going to use the “log=”y”” command to plot lineage accumulation on a log axis (recall that we expect lineages to accumulate exponentially)
We can actually see how deletion of extinct taxa impacts the shape of the lineage through time plot by overlaying the plots for the realized and reconstructed histories. When looking at this plot, we can see the push of the past and the pull of the present that result from high relative extinction rates. Try manipulating the speciation and extinction rates to better understand how these parameters contribute to differences in the “real” lineage accumulation curve and the “reconstructed” linage accumulation curve.
par(mfcol=c(1,1)) ltt.plot(simTrees_survivors[[1]]) time <- sort(branching.times(simTrees_extantOnly[[1]]), decreasing = TRUE) N <- 1:(length(time) + 1) lines(-c(time, 0), N)
Now let’s estimate birth-death parameters for our simulated trees. We’re only going to try to do this for trees that have at least 10 taxa. This cut-off is somewhat arbitrary, but we’d like to avoid doing something ridiculous like trying to estimate rate parameters from trees with only two or three tips. Thus, the next thing we need to do is to get the set of simulated trees that have more than 10 extant taxa. This is easily done with a few lines in R.
count <- 1 list() -> simTrees_extantOnly10plus for(i in 1:length(simTrees_extantOnly)) { if(length(simTrees_extantOnly[[i]]$tip.label)>= 10) { simTrees_extantOnly[[i]] -> simTrees_extantOnly10plus[[count]] count <- count + 1 } }
Now we’re ready to estimate diversification rate parameters from our simulated data. The point of doing this is to ask whether we can obtain the diversification parameters under the best case scenario (i.e., when we know the simple processes that were used to generate the trees we’re investigating). To do this, we’re going to use the bd function in the R package laser. This function will estimate two parameters from our trees using maximum likelihood: (1) the net diversification rate (b-d) and (2) the relative extinction rate (b/d). We’re going to use a loop to obtain parameter estimates for each of our simulated datasets.
list() -> BDresults #Generates empty list to store our results matrix(NA, length(simTrees_extantOnly10plus), 5) -> BDparameters #Generates empty matrix to store parameter values colnames(BDparameters) <- c("netRate", "relativeE", "netRate2", "relativeE2", "gamma") for(i in 1:length(simTrees_extantOnly10plus)) { branching.times(simTrees_extantOnly10plus[[i]]) -> btimes bd(btimes, ai=seq(0.05, 0.99, length.out=20)) -> BDresults[[i]] BDresults[[i]]$r1 -> BDparameters[i,1] BDresults[[i]]$a -> BDparameters[i,2] }
We can investigate the resulting parameter values by simply viewing the table we just generated, which should have the net diversification rate in column 1 and the relative extinction rate in column 2. Compare these estimates to the true values of the parameters
Parameter 1 (b-d) = 0.9-0.5 = 0.4
Parameter 2 (d/b) = 0.5/0.9 = 0.6
We can also generate a box plot from these parameter estimates to get some idea of the uncertainty expected when estimating speciation and extinction parameters from simulated datasets.
par(mfcol=c(1,1)) boxplot(BDparameters[,1], BDparameters[,2])
Let’s just do this one more time to make sure that our ML estimates didn’t settle on the wrong area of the likelihood surface.
list() -> BDresults #Generates empty list to store our results for(i in 1:length(simTrees_extantOnly10plus)) { branching.times(simTrees_extantOnly10plus[[i]]) -> btimes bd(btimes, ai=seq(0.05, 0.99, length.out=20)) -> BDresults[[i]] BDresults[[i]]$r1 -> BDparameters[i,3] BDresults[[i]]$a -> BDparameters[i,4] }
Take a look at your table (BDparameters) and create a boxplot of all estimates to see if you got similar estimates from each run of this analysis.
par(mfcol=c(1,1)) boxplot(BDparameters[,1], BDparameters[,2], BDparameters[,3], BDparameters[,4])
Having completed this simple simulation analysis, how do you feel about your ability to estimate speciation and extinction rates from phylogenetic trees?
Part II: Is Diversification Constant Over Time?
All of the simulations that we’ve considered so far have assumed that diversification rate constancy. That is, rates of both speciation and extinction have been constant over time and across lineages. These assumptions are not very realistic. Specifically, rates may vary across lineages for various reasons (different lineages may possess different traits or occur in different ecological/geographic contexts that impact probabilities of speciation and/or extinction), and rates of species accumulation my tend to decrease through time (owing to ecological or geographic factors that might limit the opportunity for speciation or increase the odds of extinction as niches are filled, etc.). Pybus and Harvey introduced the gamma statistic to test whether rates have decreased through time. We’re going to start by estimating the gamma parameter from our simulated datasets. Our expectation is that we will recover the positive gamma values typical of clades
that did not experience a temporal shift in diversification rate over time.
#Fill in the table with estimates of the gamma parameter vector()->critGam for(i in 1:length(simTrees_extantOnly10plus)) { gammaStat(simTrees_extantOnly10plus[[i]]) -> BDparameters[i,5] mccrTest(Ntip(simTrees_extantOnly10plus[[i]]), 0, 100,ObservedGamma=prunedGammaVal[[i]])$critical.value ->criticalGamma[[i]] } hist(BDparameters[,5]) summary(BDparameters[i,5] < criticalGamma)
Make a histogram or plot of your gamma statistics (BDparameters[,5]). Do any of your simulated datasets support the hypothesis of a decline in diversification rate over time?
In many real datasets, we aren’t able to sample all of the extant taxa. This can bias our inference about slowdowns in diversification rate if we fail to properly account for it in our analysis. To explore this, we can randomly prune taxa from our trees in order to simulate a 50 percent sampling fraction
list()->prunedTrees samplingFraction<-0.50 for(i in 1:length(simTrees_extantOnly10plus)) { ntips<-Ntip(simTrees_extantOnly10plus[[i]]) sample(ntips,round(ntips*samplingFraction),replace=FALSE) -> tipsToDrop drop.tip(simTrees_extantOnly10plus[[i]],tipsToDrop,trim.internal=TRUE) -> prunedTrees[[i]] }
Once we’ve done this, we can recalculate the gamma statistic for each tree as well as the critical gamma value to detect a statistically significant slowdown for the tree (incorrectly assuming complete taxon sampling).
#note that this code can take a long time to run. vector()->prunedGammaVal vector()->criticalGamma for(i in 1:length(prunedTrees)) { gammaStat(prunedTrees[[i]]) -> prunedGammaVal[[i]] mccrTest(Ntip(simTrees_extantOnly10plus[[i]]), 0, 100)$critical.value ->criticalGamma[[i]] } summary(prunedGammaVal < criticalGamma)
Do you see any evidence for statistically significant slowdowns in these trees?
Part III: Empirical Examples
Anolis Lizards
Let’s try estimating the gamma parameter on some real trees. Let’s start with
the example of Anolis lizards, a clade that we suspect has slowed over time as
island niches have been filled. We’ll begin by reading in a treefile that includes the trees from the posterior distribution of a Bayesian analysis.
read.nexus(file="anolis_GA_BEAST.nex") -> anoleTrees
Let’s just take a look at one tree and its associated LTT plot:
par(mfcol=c(2,1)) plot(anoleTrees[[1]], show.tip.label=FALSE) ltt.plot(anoleTrees[[1]], log="y")
You may notice that the tree and plot don’t exactly line up along the x-axis but we’re not going to worry about this for the time being – the important thing is to notice how the structure of branching events in the tree mirrors the accumulation of species in the LTT plot below. Now let’s estimate the gamma parameter for each of the trees in our set of trees from the posterior distribution of our BEAST analysis
matrix(NA, 1, length(anoleTrees)) -> anolisGamma for(i in 1:length(anoleTrees)) { gammaStat(anoleTrees[[i]]) -> anolisGamma[1,i] }
Because we expect incomplete sampling will simulate a slow-down, we want to use a
MCCR technique to assess whether the slow-down indicated by the gamma parameter is
significant given the amount of missing taxa in our dataset. To run this test,
we’re going to use the following lines:
Let’s first get the maximum gamma value from the trees in our posterior distribution.
max(anolisGamma[1,])
Then we can estimate the critical value for gamma given that we’ve sampled 85 of the 91 species of Greater Antillean anoles.
mccrTest(91, 6, 100, max(anolisGamma[1,]))
Valerianaceae
Now let’s run some analyses on a phylogeny for the plant group, Valerianaceae. This phylogeny includes ~2800 bp of sequence data from 4 gene regions for 103 of ~300 species and was dated using the UCL relaxed clock model implemented in BEAST.
First let’s load the tree into R and generate a LTT plot for the consensus tree.
read.tree(file="Val_traits.phy") -> ValTree par(mfcol=c(2,1)) plot(ValTree, show.tip.label=FALSE) ltt.plot(ValTree, log="y")
From the LTT plot, it appears that there was a non-zero extinction rate. Let’s calculate gamma using the following command:
gammaStat(ValTrees) -> ValGamma
Then we can estimate the critical value for gamma given that we’ve sampled 103 of the 300 species in the Valerianaceae.
mccrTest(300, 197, 100, ValGamma)