Summary
Primary Contact: Jeremy Brown
Garli is a program for maximum-likelihood estimation of phylogenetic trees from nucleotide data. Here is a description from the Garli homepage:
GARLI, Genetic Algorithm for Rapid Likelihood Inference is a program for inferring phylogenetic trees. Using an approach similar to a classical genetic algorithm, it rapidly searches the space of evolutionary trees and model parameters to find the solution maximizing the likelihood score. It implements nucleotide, amino acid and codon-based models of sequence evolution, and runs on all platforms. The latest version adds support for partitioned models and morphology-like datatypes. It is written and maintained by Derrick Zwickl.
Outline
Introduction
This tutorial uses Garli v2.0 and one example data file: primates.nex.
Tutorial
Installation
Garli can be downloaded from the Garli website. Pre-compiled, executable versions of Garli for PCs or Macs are available for direct download. Both serial and parallel versions are available. Parallel versions allow a single analysis to take advantage of multiple processors, if they are available. The source code can also be downloaded and compiled locally for use on Unix systems or for command-line use on other systems. Instructions for compilation are given on the website.
Garli Usage Statement
Garli can be run by opening up a Terminal window and changing directories to the folder where the Garli executable is stored. If you don’t remember how to change directories in a Unix environment, see this tutorial. To run Garli without starting an analysis and see a statement about proper Garli usage, type this command:
> ./Garli-2.0 -h
If this ran properly, you should see output from Garli that looks like this:
Usage: ./Garli-2.0 [OPTION] [config filename] Options: -i, --interactive interactive mode (allow and/or expect user feedback) -b, --batch batch mode (do not expect user input) (batch is the default for the version you are running) -v, --version print version information and exit -h, --help print this help and exit -t run internal tests (requires dataset and config file) -V validate: load config file and data, validate config file, data, starting trees and constraint files, print required memory and selected model, then exit NOTE: If no config filename is passed on the command line the program will look in the current directory for a file named "garli.conf"
Notice that you can pass several options to Garli (such as the -h help option that we already used). Most of these will not be used when running a standard maximum likelihood (ML) phylogenetic search. After the options, Garli expects to find the name of a config (configuration) file, if the name of that file is not garli.conf. If it is garli.conf, Garli will automatically find it and use it to run your analysis.
All settings necessary to run Garli are specified in the configuration file. There are three categories of options that can be set. In the first section below, we will briefly cover some of the important options in the [general] settings. In the next section, we will cover the options related to setting specifying a model of sequence evolution in the [model] settings. Options in the [master] category specify the nature of the genetic algorithm that Garli uses to search among trees. In general, the default settings in this category work quite well for most datasets so we will not discuss changes to these options in this tutorial. A detailed explanation of Garli options can be found in the Garli manual (distributed with the program) and on the Garli support wiki.
Configuring [general] settings
There are essentially only two general Garli settings that have to be changed to run a maximum-likelihood search. First, you must tell Garli in which file your data are stored. This is done with the datafname setting. Simply add the name of your NEXUS file on this line (e.g., datafname = primates.nex). The other required option is ofprefix, which specifies the root filename for the output files that Garli will create when performing an analysis (e.g., ofprefix = primates_ML). It is highly recommended that you set the names of your .conf file (e.g., primates.conf), datafile, and output files according to the analysis you are performing to end up having many files on your computer with the same generic names.
Here is brief discussion of a few other general settings that may be particularly useful. A full description of all options is available in the Garli manual or on the support wiki.
- constraintfile – If you wish to search only among trees that contain some particular branch or set of branches, you can specify a constraint. This option asks for the name of a file containing the constraints that you wish to employ. Constraints are specified in a parenthetical format with a + or – at the beginning of the string to indicate whether you want to constrain the presence (+) or absence (-) of that branch. Here is an example of a positive constraint:
+((1,3,5),2,4,6,7,8);
This constraint requires the presence of a branch that has taxa 1, 3, and 5 on one side of the branch and taxa 2, 4, 6, 7, and 8 on the other. Multiple positive constraints can be specified simultaneously in one file, but only one negative constraint can be specified at a time. Positive and negative constraints cannot be mixed. More information about specifying constraints is available here.
- logevery – The frequency at which the best score is written to a log file.
- saveevery – If writecheckpoints or outputcurrentbesttopology are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file.
- genthreshfortopoterm – The first of 2 termination conditions that Garli uses to decide when it’s done a sufficiently thorough search. If no significantly better tree is found in this number of generations, this condition is satisfied. By default, this value is 20,000.
- scorethreshforterm – If the first termination condition is satisfied, Garli checks this second termination condition. If the total improvement in likelihood score over a certain number of generations (500, by default) is less than this value (0.05, by default), Garli terminates a run.
- searchreps – This value determines the number of independent maximum likelihood searches that Garli performs. The default is 2.
- bootstrapreps – This parameter can be set to a number greater than 0 if one wishes to perform a bootstrapping analysis, rather than an ML search. If it is greater than 0, Garli will create this number of pseudoreplicate datasets of the same size as the empirical dataset by sampling columns with replacement from the empirical dataset. It will then perform ML searches using all the other general settings in the config file for each pseudoreplicate dataset.
Configuring [model] settings
For this tutorial, we will assume that you are analyzing nucleotide data. However, it is also possible to analyze such data as codons, or to use protein sequences or morphological data.
- datatype – Use this option to specify what kind of data you are analyzing. For the purposes of this tutorial, we will stick with nucleotide data (datatype = nucleotides).
- ratematrix – This option tells Garli how many different relative rates of substitution (aka exchangeabilities) to include in your model and how they should be linked/unlinked. The three built-in options are 1rate (all rates equal), 2rate (different rates of transitions and transversions), and 6rate (all relative rates are different). Custom patterns of rate differences can also be specified by putting six letters inside parentheses. These letters correspond to relative rates in this order: A-C, A-G, A-T, C-G, C-T, and G-T. If any two rates share a letter these relative rates will be set equal to one another. For instance, if we wished to specify the 2rate model as a custom rate string, we would write:
ratematrix = (a b a a b a)
- statefrequencies – This option tells Garli whether or not to estimate the equilibrium frequencies of the model states (e.g., nucleotides). The most common settings that will be used are equal or estimate, depending on whether the model you’re using assumes equal frequencies of states or allows them to be free parameters of the model. This option can also be specified to use the frequencies of the states in your dataset as the parameter values of your model (empirical). State frequencies can also be fixed at arbitrary user-specified values.
- invariantsites – This option allows one to specify models that include a class of sites that are unable to change (perhaps better termed invariable sites). This allows a binary division of sites into two rate categories (variable and invariable). The self-explanatory possible specifications of this option are none, estimate, and fixed.
- ratehetmodel – This option can be used to include gamma-distributed heterogeneity in rates of evolution across sites. To estimate the shape parameter of this gamma, use gamma. To include gamma-distributed rate variation, but specify your own shape parameter use gammafixed. If you don’t want to include gamma-distributed rate variation across sites, use none.
- numratecats – If a gamma-distributed rates across sites model is used, the gamma distribution is actually discretized for computational efficiency. This option allows the user to specify how many categories are used to approximate the full, continuous gamma distribution. As the number of categories is increased, the approximation becomes better but the computation time increases linearly with the number of categories.
Running Garli
Once a configuration file has been created and customized for your analysis, all you need to do to run the analysis is start Garli on the command line and point it to your .conf file. If you put Garli, your .conf file (e.g., primates.conf), and your data file in the same folder, you can run your analysis simply by typing:
./Garli-2.0 primates.conf
Garli Output
Garli will create two types of log files that end with the .log and .screen.log suffixes. These files provide information about the changes in likelihood scores as the genetic algorithm progresses. You do not need to directly examine these files to see the output of your Garli runs, but they can provide very useful information about the efficiency and progress of your runs. Garli will also output two types of tree files when performing a standard ML search. One file (ending with .best.all.tre) includes the best tree found in each independent replicate search. On each line that contains a tree, these files will also display the log-likelihood score and estimated model parameter values. This file will also flag the best overall tree and identify any other replicate searches that found the same tree topology as the best search. The other tree file (ending with .best.tre) contains the single best tree found across all search replicates.
When have I done enough searches?
Because all searches in Garli are stochastic, there is no guarantee that any search finds the true maximum-likelihood topology and parameter values. Searches may become stuck on local optima and this seems to be a common occurrence in large datasets. One rule of thumb to use when deciding how many independent replicate searches to conduct is to continue searching until you find at least one additional search that finds the same topology as the best overall result.