tdg12 - 源码 - 源码 - 免费下载

开通博客赚积分

发布资源赚积分

tdg12

文件大小： unknow

源码售价： 5 个金币积分规则积分充值

充值1元得10金币

资源说明：This program is no longer being updated. Please use the implementation at https://github.com/tamuri/swmutsel

This program is no longer being updated. Please use the implementation at https://github.com/tamuri/swmutsel

TdG12: Estimating distribution of selection coefficients
--------------------------------------------------------

Program for estimating the distribution of selection coefficients (fitness effects) from phylogenetic data using a site-wise mutation-selection model.

This is an implementation of the model described in: Tamuri AU, dos Reis M, Goldstein RA. Estimating the Distribution of Selection Coefficients from Phylogenetic Data Using Sitewise Mutation-Selection Models. [Genetics vol. 190 no. 3 1101-1115](http://www.genetics.org/content/190/3/1101.full).

The data used in the paper is available in [etc/](https://github.com/tamuri/tdg12/tree/master/etc).

[Binary download](https://github.com/downloads/tamuri/tdg12/tdg12_mathbio.zip) of v1.0. Includes tdg12.jar (with dependencies) and the tutorial.

Tutorial
--------

### Introduction

This document describes how to use the TdG12 site-wise
mutation-selection model ('swMutSel0') to estimate the distribution of
selection coefficients (or 'fitness effects') from an alignment of
protein coding sequences.

### Requirements

The steps in this tutorial have been tested on Linux and Mac OS X 10.6+.
Our program does not estimate tree topology or perform branch length
optimisation, therefore we recommend the use of two popular tools for
this purpose: RAxML and PAML[^1]. Both have Windows versions available
for download, but they have not been tested by us. We assume that you
already have them installed and are able to run them on your computer.
Our software is written in Java and should run on all platforms that
have the Java Runtime Environment (JRE) (6 or later) installed. The JRE
is available for Windows, Linux and Mac OS X 10.6+.

### Installation

You can download the executable files for the program from
http://mathbio.nimr.mrc.ac.uk/. Download the tdg12.zip file and extract
the contents. The download contains the files required for this tutorial
and `tdg12.jar`, which is the Java binary file for the program.

### Citation

Tamuri AU, dos Reis M, Goldstein RA. Estimating the Distribution of Selection Coefficients from Phylogenetic Data Using Sitewise Mutation-Selection Models. Genetics vol. 190 no. 3 1101-1115.

An example analysis
-------------------

### Preparing the alignment and tree

For the purposes of this tutorial, we will be estimating the
distribution of selection coefficients of a set of mammalian
mitochondrial ATP8 protein coding genes. The download provides an
alignment, atp8.phyl (which was built using PRANK) and a tree, atp8.tree
(estimated by RAxML with branch lengths optimised using PAML's codeml).
Full details are available from the program's websites. The options used
for running RAxML were:

> `raxmlHPC -f a -x 12345 -p 12345 -N 100 -m GTRGAMMA -s atp8.phy -n atp8`

and the PAML codeml control file (codeml.ctl) is available in the
download. We optimised the branch lengths using the FMutSel0 model in
codeml.

### Getting estimates of global parameters from codeml

There are a number of global, site-invariant, parameters that are
required for the TdG12 model. They are:

1. τ (tau) - rate of multiple substitutions.

2. κ (kappa) - transition/transversion bias.

3. π (pi) - base nucleotide composition.

4. μ (mu) - branch scaling.

Each of these parameters can be estimated under the TdG12 model but this
requires significant computational resources and the use of a
distributed version of the software (not covered here). However, you can
get obtain good estimates for branch lengths, κ, π and μ
from the faster PAML codeml analysis. The results file (named 'mlc')
contains the new tree (with re-estimated branch lengths) and estimates
of κ and π. An estimate of μ can be calculated using
3 \* T\_dS / T, where T\_dS is the tree length in dS
(synonymous changes) units and T is the total tree length, also found
in the codeml 'mlc' results file. *Remember to use the tree from the
'mlc' file for the TdG12 program.*

Therefore, the estimates for the global parameters for the ATP8 gene
alignment are:

κ = 4.93022
π = {0.25299, 0.22268, 0.43860, 0.08572}
μ = 3 * T_dS / T = 3 * 70.7207 / 113.24814 = 1.8734

We choose a small number for τ, e.g. 1.0 \* 10^-2.

### Analysing the data using the site-wise mutation-selection model

You can now run the TdG12 program to estimate the site-wise fitness of
each amino acid in our alignment. The available options for running the
program are:

-t
: The tree file in NEWICK format (required). Remember to use the tree
supplied by PAML's codeml.

-s
: The protein coding alignment file in PHYLIP sequential format
(required).

-gc
: The genetic code, 'standard' or 'vertebrate\_mit' (required).

-tau
: Rate of multiple substitutions (required).

-kappa
: Transition/transversion bias (required).

-pi
: Comma-separated (with no spaces) base nucleotide frequencies
(T,C,A,G) (required).

-mu
: Branch scaling factor (required).

-useapprox
: Use the approximation for unobserved residues which allows for
faster computation (optional).

-site
: Location of the site to analyse (optional, default = all sites in
the alignment).

-optimruns
: The number of restarts for the optimisation routine (optional,
default = 1).

-threads
: The number of threads to use for processing in a multicore
environment (optional, default = 1).

You start the analysis of the set of ATP8 genes by running:

> `java -cp tdg12.jar tdg.Analyse -s atp8.phy -t atp8.tree -gc vertebrate_mit -tau 1e-2 -kappa 4.93 -pi 0.2530,0.2227,0.4386,0.0857 -mu 1.8734 > tdg.out`

You must add '`> tdg.out`' to redirect the analysis output to a file
named tdg.out.

#### Using multicore/multiple CPUs

If you are running the program on a computer with multicore or multiple
CPUs, you can specify the **-threads** option. Usually, this would be
the number of available cores less 1. For example, to utilise 3 cores you
would run:

> `java -cp tdg12.jar tdg.Analyse -s atp8.phy -t atp8.tree -gc vertebrate_mit -tau 1e-2 -kappa 7.8 -pi 0.25,0.25,0.25,0.25 -mu 2.3 ``-threads 3`` > tdg.out`

### Parsing the results and calculating the distribution

Once the program completes, the results saved in tdg.out need to be
processed to calculate the distribution of selection coefficients. The
output looks something like (truncated):

```
Site 1 - Residues: [1/20] { 1:(12, M), 2:(0, A), 3:(1, R), ... }
Site 1 - Optimisation run (267 evaluations). lnL = -4.939901E-5, Params = {0.0, ...}
Site 1 - Homogeneous model lnL: -4.939901011180276E-5
Site 1 - Fitness: { -13.06494, -18.20550, -16.32137, ... }
Site 1 - Pi: { 4.79931E-6, 5.79345E-8, 2.70189E-7, ... }
...
```

The output is quite verbose. Run the following command (specifying the
name of the result file with '-o tdg.out'):

> `java -cp tdg12.jar tdg.results.All -o tdg.out -gc vertebrate_mit -tau 1e-6 -kappa 7.8 -pi 0.25,0.25,0.25,0.25 -mu 2.3`

The global parameters should be the same as were specified when you run
the analysis. This command reads the results file and the global
parameters from the command-line, and writes the files:

F.txt
: Fitness values for each amino acid at each site

Q0.txt
: Neutral mutation matrix for entire alignment

S.txt
: Selection coefficients matrix for each site

QS.txt
: Mutation with selection matrix for each site

PiS.txt
: Codon frequencies at each site

PiAA.txt
: Amino acid frequencies at each site

distribution.mutations.csv
: the distribution of selection coefficients for mutations

distribution.substitutions.csv
: the distribution for substitutions.

The last two are comma-separated value files that can be opened in a
program like Excel to plot the distribution. The three columns in the
distribution files are (i) the histogram bins for _S_ (ii) all
mutations/subsitutions and (iii) non-synonymous mutations/substitutions.

![](https://raw.github.com/tamuri/tdg12/master/docs/tutorial/results/exact/plot.png)

Simulating data using the mutation-selection model
--------------------------------------------------

### Simulating a single set of fitnesses (for one site)

> `java -cp tdg12.jar tdg.sim.AlignmentSimulator -tree sim.tree -output out.phy -sites 100 -fitness 0.0,0.2,0.3,1.0,0.5,0.6,0.7,0.8 -characters A,R,N,D,C,Q,E,H -tau 0 -pi 0.25,0.25,0.25,0.25 -mu 1.0 -gc standard`

This command will write an alignment file 'out.phy' (in PHYLIP format)
simulating a site (100 times) on the tree 'sim.tree' with the specified
fitnesses for the specified characters. Characters that do not appear in
the -characters option are not observed at that site (i.e. have fitness
of -∞). Set the global parameters as required.

### Simulating multiple sets of fitnesses (for multiple sites)

Create a file with 20 fitnesses values on each line, each line
corresponding to a single location in the alignment. Each fitness must
be separated by a space and each site should be on a new line. For
example, to simulate an alignment with 5 sites:

```
-21 -21 -21 -21 -21 -21 -21 -21 -21 2.19 3.07 -21 2.39 -21 -21 -21 -21 -21 -21 -21
-21 -21 1.74 -21 -21 -21 -21 -21 -21 -21 -21 -21 -21 -21 -21 4.23 -21 -21 -21 -21
0.93 -21 -21 -21 -21 -21 -21 -21 -21 0.31 -21 -21 2.63 -21 -21 -21 4.11 -21 -21 6.39
2.75 -21 -21 -21 -21 -21 -21 -21 -21 0.92 -21 -21 1.71 -21 2.28 0.72 1.33 -21 -21 3.51
-21 -21 -21 -21 -21 -21 -21 -21 -21 -21 -21 -21 -21 -21 -21 0.79 -21 -21 -21 -21
```

Each line specifies the fitnesses of each amino acid in the canonical
IUPAC order. Residues that are unobserved at a site are given a fitness <-20. If this file is saved as 'F.txt', the alignment is generated
using the command:

> `java -cp tdg12.jar tdg.sim.AlignmentSimulator -tree sim.tree -output out.phy -fitnessfile F.txt -tau 0 -pi 0.25,0.25,0.25,0.25 -mu 1.0 -gc standard`

Colophon
--------

TdG12 uses the following libraries:

1. Colt Project (). For linear
algebra.

2. PAL: Phylogenetic Analysis Library
().
Reading/traversing/writing NEWICK trees and PHYLIP alignments.

3. Apache Commons Math (). For
optimisation.

4. Guava: Google Core Libraries
().

5. JCommander (). Parsing and managing
command-line options.

6. Simple Java HTTP server (). Used by
tdg.distributed.Slave.

7. Asynchronous Http Client library for Java
(). Used by
tdg.distributed.Master to make asynchronous HTTP requests to the
slaves.

[^1]: However, you can use other programs if you prefer.

部分文件列表（点击文件名可查看文件内容）

					
									本源码包内暂不包含可直接显示的源代码文件，请下载源码包。

κ =	4.93022
π =	{0.25299, 0.22268, 0.43860, 0.08572}
μ =	3 * T_dS / T = 3 * 70.7207 / 113.24814 = 1.8734