Category Archives: Trees R Us

CoMET_is_coming!!

Hey Treethinkers!
Just a quick update on some recent work—by
the marvelous Mike May and sensational Sebastian Höhna—that we’re very excited about in the Moore lab.

First, we have a paper in review that describes a new Bayesian approach for detecting mass-extinction events. Briefly, this is a novel method for detecting mass-extinction events from phylogenies estimated from molecular sequence data. We develop our approach in a Bayesian statistical framework, which enables us to harness prior information on the frequency and magnitude of mass-extinction events. The approach is based on an episodic stochastic-branching process model in which rates of speciation and extinction are constant between rate-shift events. We model three types of events: (1) instantaneous tree-wide shifts in speciation rate; (2) instantaneous tree-wide shifts in extinction rate, and; (3) instantaneous tree-wide mass-extinction events.

Each of the events is described by a separate compound Poisson process (CPP) model,
where the waiting times between each event are exponentially distributed with event-specific rate parameters. The magnitude of each event is drawn from an event-specific prior distribution. Parameters of the model are then estimated using a reversible-jump Markov chain Monte Carlo (rjMCMC) algorithm. We demonstrate via simulation that this method has substantial power to detect the number of mass-extinction events, provides unbiased estimates of the timing of mass-extinction events, while exhibiting an appropriate (i.e., below 5%) false discovery rate even in the case of background diversification rate variation. Finally, we provide an empirical application of this approach to conifers, which reveals that this group has experienced two major episodes of mass extinction. This new approach—the CPP on Mass Extinction Times (CoMET) model—provides an effective tool for identifying mass-extinction events from molecular phylogenies, even when the history of those groups includes more prosaic temporal variation in diversification rate.

This paper is available from the bioRxiv here.

We’ve also submitted an application note for our new R package, TESS 2.0, a Bayesian software package implementing the CoMET model and many other tasty methods for inferring rates of lineage diversification. Briefly, TESS implements statistical approaches for estimating rates of lineage diversification (speciation — extinction) from phylogentic trees. The program provides a flexible Bayesian framework for specifying an effectively infinite array of diversification models—where diversification rates are constant, vary continuously, or change episodically through time—and implements numerical methods to estimate parameters of these models from molecular phylogenies.

We provide robust Bayesian methods for assessing the relative fit of these models of lineage diversification to a given study tree–-e.g., where stepping-stone simulation is used to estimate the marginal likelihoods of competing models, which can then be compared using Bayes factors. We also provide Bayesian methods for evaluating the absolute fit of these branching-process models to a given study tree—i.e., where posterior-predictive simulation is used to assess the ability of a candidate model to generate the observed phylogenetic data.

This paper is available from the bioRxiv here.

Finally, all this good stuff is implemented in the newly released TESS 2.0 R package (including the source code, comprehensive user manual, and example files) is available from CRAN here.

Trees R Us: Introduction

“What I cannot create, I do not understand.” Richard Feynman

This series of posts is intended to be a hands-on R-based companion to some of the other things our contributors discuss. We might delve deeper into the behavior of the gamma distribution (or any of the many probability distributions popular in phylogenetics), code up an MCMC algorithm, or work through Felsenstein’s pruning algorithm, to name a few exercises. Playing around with these things in R, even in a simple way, can bring understanding that reading the primary literature or staring at Wikipedia cannot.

I hope this series sheds light on some of the more black-boxy aspects of statistical phylogenetics, and also helps beginning R users develop good programming habits. I invite others to contribute to the series as much as they’d like. I assume that most readers have a rudimentary understanding of R, as in have the ability to open their R GUI (or favorite IDE), write a script, and execute it.

As an initial post, I will first provide a very rough sketch of some of the salient features of the R language (with a small dose of personal opinion), introduce some good practices for writing in R, and then make sure readers are up to speed on writing functions, using for loops and apply-like functions, and the supremely important concept of vectorization. A basic understanding of these topics will help you navigate the code that I (and others) write and should form a solid foundation for writing your own scripts.

What’s the deal with R anyways?

R is a flexible, extensible programming language with a relatively gentle learning curve. These days, it seems to be the go-to language for young biologists with little background in computer science (like me, for certain values of young) who are trying to put together their own analyses. R code can be executed line-by-line, which makes writing software much easier for people who are not used to assembling a (buggy) program from scratch.

Continue reading