资源说明:Software and data used in Tamuri AU, dos Reis M, Hay AJ, Goldstein RA (2009) Identifying Changes in Selective Constraints: Host Shifts in Influenza. PLoS Comput Biol 5(11): e1000564 doi:10.1371/journal.pcbi.1000564
# TdG09 program: identifying changes in selective constraints This program is an implementation of the model described in: Tamuri AU, dos Reis M, Hay AJ, Goldstein RA (2009) Identifying Changes in Selective Constraints: Host Shifts in Influenza. _PLoS Comput Biol_ 5(11): e1000564. [doi:10.1371/journal.pcbi.1000564](http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000564) This phylogenetic model uses site-specific amino acid frequencies to distinguish patterns of substitution between two (or more) lineages or groups of taxa. We first estimate the site-specific amino acid frequencies at a given location in a protein alignment assuming that the pattern of substitution is homogeneous across all branches. We then estimate multiple sets of site-specific frequencies, allowing them to differ among different branches, producing a non-homogeneous model of evolutionary change. Using statistical tests, we then see whether the non-homogeneous model provides a significantly better fit to the data than the homogenous model. ## Tutorial 1. **Install Java:** The program requires a recent version of the Java Runtime (JRE), which, if not already installed, can be downloaded from [Oracle](http://www.oracle.com/technetwork/java/javase/downloads/index.html). Linux packages are usually available in the distribution's repository (e.g. `sudo apt-get install openjdk-7-jre` for Debian, Ubuntu etc. distributions). 2. **Download the program:** The latest version of the program is available for download from the [repository](https://github.com/tamuri/tdg09/tags). The download includes a compiled binary as well as a `build.xml` to compile from sources using [ant](http://ant.apache.org/). Unzip the download and check that the program works: ``` $ ls -F README.md build.xml dist/ etc/ lib/ src/ $ java -cp dist/tdg09.jar tdg09.Analyse Usage: java -cp tdg09.jar tdg09.Analyse [options] Options: * -alignment Alignment in PHYLIP format * -groups Group labels to partition tree e.g. Av Hu -threads Number of threads to use Default: 1 * -tree Tree in NEWICK format ``` 3. **Preparing your data:** The program requires a protein sequence alignment in [PHYLIP format](http://www.bioperl.org/wiki/PHYLIP_multiple_alignment_format) and the corresponding tree in [Newick format](http://en.wikipedia.org/wiki/Newick_format). The tree branch lengths should be optimised for an amino acid model, such as WAG, using a program such as [RAxML](http://sco.h-its.org/exelixis/software.html) or [PAML](http://abacus.gene.ucl.ac.uk/software/paml.html). Most importantly, sequence names must be prefixed by a two-letter identifier that is used to indicate its lineage/grouping. For example, to identify changes in selective constraints between avian and human flu viral proteins, we can prefix every sequence name with 'Av' or 'Hu' to indicate its lineage: ``` $ cat etc/H1.faa 434 566 Hu_HA_AAX56530_H1N1 MKAKLLVLLCAFTATYADTI... Hu_HA_AAY78939_H1N2 MKVKLLILLCTFTATYADTI... Av_HA_ABB19507_H1N6 MEAKLFVLFCTFTVLKADTI... Av_HA_ABB19518_H1N1 MEAKLFVLFCTFTALKADTI... ... ``` Both the sequence alignment and the tree must follow this convention. ``` $ cat etc/H1.tree ((Av_HA_ABB19607_H1N1:0.0852880,Av_HA_ABG88212_H1N1:0.1036700): 0.0248320,((Av_HA_ABB19618_H1N1:... ``` This is the only way by which the tdg09 program determines which of the different non-homogeneous models a particular tree branch should use. Example data sets of flu viral proteins (used in Tamuri *et al.* 2009) are included in the `etc/` directory. 4. **Running the program:** The command-line options for the program are: + -alignment : the sequence alignment file in PHYLIP format e.g. `etc/H1.faa` + -tree : the tree file in Newick format e.g. `etc/H1.tree` + -groups : the two-letter identifiers used to partition the sequences e.g. `Av Hu` + -threads : specify the numbers of CPU cores/threads to utilise e.g. `2` The program prints messages to standard out, so this should be captured using `>` or `tee`. We are now ready to run the program: ``` $ java -cp dist/tdg09.jar tdg09.Analyse -alignment etc/H1.faa \ -tree etc/H1.tree -groups Av Hu -threads 2 > H1_out.txt ``` or if you have 'tee' installed: ``` $ java -cp dist/tdg09.jar tdg09.Analyse -alignment etc/H1.faa \ -tree etc/H1.tree -groups Av Hu -threads 2 | tee H1_out.txt ``` 5. **Inspect the results:** In this example, program output is captured in `H1_out.txt`: ``` $ cat H1_out.txt StartTime: 2013-03-19 13:44:10.129 WorkingDirectory: /Users/Tester/Documents/tdg09 Options: -alignment etc/H1.faa -tree etc/H1.tree -groups Av Hu -threads 2 TreeFile: /Users/Tester/Documents/tdg09/etc/H1.tree AlignmentFile: /Users/Tester/Documents/tdg09/etc/H1.faa Alignment: SequenceCount: 434 SiteCount: 566 Groups: [Av, Hu] # The internal nodes of the tree are not labelled. Labelling... # Node 432 from [Av, Hu] resolved # Node 432 from [Av, Hu] resolved # Assuming that root of tree is in group [Av] # Switching from group [Av] to [Hu] at branch 432..431 LabelledTree: > (((((((((((Av_HA_ABB19607_H1N1:0.0852880,Av_HA_ABG88212_H1N1:0.1036700) Av:0.0248320,((Av_HA_ABB19618_H1N1:0.0910450,Av_HA_ABG88201_H1N1:0.0810480) ... Hu:0.0618410)Hu:0.0978830)Hu:0.0436020)Hu:0.0482480)Hu:0.0255390)Hu:0.0362960) Hu:0.0696670)Hu:0.0620790)Hu:0.0268645); ``` The output contains a "LabelledTree" that shows the inferred lineage for each ancestral node. This should be checked in a tree viewing program (such as [Dendroscope](http://www-ab.informatik.uni-tuebingen.de/software/dendroscope)) to make sure that the lineages are correct. If not, they can be modified and the analysis can be rerun with the new, custom-labelled, tree. The output continues: ``` # 2013-03-19 22:44:10.33 - site 1 complete. # 2013-03-19 22:44:11.366 - site 3 complete. # 2013-03-19 22:44:11.366 - site 4 complete. ... LrtResults: # Site, delta lnL, dof, LRT, FDR - [ 204, 21.850690, 3, 0.0000000, 0.0000003 ] - [ 169, 11.528011, 1, 0.0000016, 0.0001542 ] - [ 289, 10.733096, 2, 0.0000218, 0.0008550 ] - [ 252, 10.927903, 2, 0.0000180, 0.0008796 ] - [ 9, 8.225516, 1, 0.0000499, 0.0008895 ] - [ 300, 15.027774, 5, 0.0000144, 0.0009396 ] - [ 62, 8.261257, 1, 0.0000481, 0.0009423 ] - [ 303, 8.261836, 1, 0.0000480, 0.0010463 ] - [ 239, 9.647869, 2, 0.0000646, 0.0010545 ] - [ 315, 9.483944, 2, 0.0000761, 0.0011468 ] - [ 253, 8.262439, 1, 0.0000480, 0.0011764 ] ... FullResults: # Site, WAG+ssF params, WAG+ssF lnL, WAG+lssF params, WAG+lssF params, delta lnL, dof, LRT, FDR - [ 1, NA, NA, NA, NA, NA, NA, NA, NA ] - [ 2, 3, -21.488149, 5, -13.378828, 8.109321, 2, 0.0003007, 0.0025627 ] - [ 3, 2, -48.561153, 3, -47.763333, 0.797820, 1, 0.2065223, 0.3489515 ] - [ 4, 2, -24.137478, 3, -23.407924, 0.729554, 1, 0.2270721, 0.3708845 ] - [ 5, 2, -12.611075, 3, -12.456865, 0.154209, 1, 0.5786522, 0.6593943 ] - [ 6, 3, -23.856049, 5, -22.761440, 1.094609, 2, 0.3346706, 0.5084918 ] - [ 7, 2, -44.839152, 3, -43.761797, 1.077355, 1, 0.1421333, 0.2509740 ] - [ 8, NA, NA, NA, NA, NA, NA, NA, NA ] ... ``` Of interest are the "LrtResults" and "FullResults" tables. The LrtResults lists polymorphic sites (on which the non-homogeneous model was estimated) and orders them by the false discovery rate (a correction on the likelihood ratio test P-value required by multiple hypothesis testing). At a given FDR cutoff (e.g. 0.05), these sites are those at which the non-homogeneous model provides a statistically significant improvement over the homogenous model, indicating that the patterns of substitution are different between the different groups/lineages. The FullResults table lists further results from all sites. This includes the log-likelihood for the WAG+ssF (site-specific frequencies or *homogeneous model*) and WAG+lssF (lineage and site-specific frequencies or *non-homogeneous model*). Conserved locations are not analysed, so their entries are 'NA'. Finally, after the FullResults table, the output file contains details of the per-location results, such as the amino acid frequencies estimated by the ssF and lssF models. 6. **Analysing the results**: The output file is in [YAML](http://www.yaml.org/) format, which means it can be read by any other programming language that has a YAML parsing library. Here we show an example of analysing the results using the programming language [R](http://www.r-project.org/). The code is available in the `src/R/example.R` file. Start the R console, install and load the *yaml* library, then load the output file using the `yaml.load_file` function: ``` $ R R version 2.15.3 (2013-03-01) -- "Security Blanket" ... > install.packages("yaml") Installing package(s) into ‘/Users/Tester/Library/R/2.15/library’ (as ‘lib’ is unspecified) ... > library(yaml) > out <- yaml.load_file(input='/Users/Tester/Documents/tdg09/H1_out.txt') > summary(out) Length Class Mode StartTime 1 -none- character WorkingDirectory 1 -none- character Options 1 -none- character TreeFile 1 -none- character AlignmentFile 1 -none- character Alignment 2 -none- list Groups 2 -none- character LabelledTree 1 -none- character LrtResults 196 -none- list FullResults 566 -none- list ConservedPositions 2 -none- list SiteResults 566 -none- list EndTime 1 -none- character ``` To convert the R list objects into a type easier to work with, we convert the LrtResults and FullResults objects into data.frames: ``` > lrt_results <- as.data.frame(matrix(unlist(out$LrtResults), ncol=5, byrow=T)) > names(lrt_results) <- c("site", "deltaLnL", "dof", "lrt", "fdr") > head(lrt_results) site deltaLnL dof lrt fdr 1 204 21.850690 3 0.00e+00 0.0000003 2 169 11.528011 1 1.60e-06 0.0001542 3 289 10.733096 2 2.18e-05 0.0008550 4 252 10.927903 2 1.80e-05 0.0008796 5 9 8.225516 1 4.99e-05 0.0008895 6 300 15.027774 5 1.44e-05 0.0009396 > sum(lrt_results$fdr <= 0.05) # how many sites identified with FDR < 0.05? [1] 55 > full_results <- as.data.frame(matrix(unlist(out$FullResults), ncol=9, byrow=T)) > names(full_results) <- c("site", "ssfParams", "ssfLnL", "lssfParams", "lssfLnL", "deltaLnL", "dof", "lrt", "fdr") > head(full_results) site ssfParams ssfLnL lssfParams lssfLnL deltaLnL dof lrt fdr 1 1 NA NA NA NA NA NA NA NA 2 2 3 -21.488149 5 -13.378828 8.109321 2 0.0003007 0.0025627 3 3 2 -48.561153 3 -47.763333 0.79782 1 0.2065223 0.3489515 4 4 2 -24.137478 3 -23.407924 0.729554 1 0.2270721 0.3708845 5 5 2 -12.611075 3 -12.456865 0.154209 1 0.5786522 0.6593943 6 6 3 -23.856049 5 -22.76144 1.094609 2 0.3346706 0.5084918 ``` To produce a plot of FDR values by site: ``` > fdr <- as.numeric(levels(full_results$fdr)[full_results$fdr]) # FDR column should be numeric Warning message: NAs introduced by coercion > fdr[is.na(fdr)] <- 1.0 # conserved locations implicitly have no evidence of non-homogeneity > sites <- out$Alignment$SiteCount > plot_ranges <- split(seq(1, sites), cut(seq(1, sites), 5)) # split sites in plot into 5 rows > par(mfrow=c(5,1), mar=c(2.0,0.5,0.5,0.5)) for (p in 1:5) { plot(1 - fdr, xlim=c(plot_ranges[[p]][1], tail(plot_ranges[[p]], n=1)), ty='h', lwd=1, main='', xlab='', ylab='', yaxt='n', col="#1B9E77") lines(which(fdr <= 0.20), 1 - fdr[fdr <= 0.20], col="#D95F02", ty='h') abline(h=0.95, lty='dashed') points(which(fdr <= 0.05), 1 - fdr[fdr <= 0.05], pch=20, col="#DE2D26") } ``` This produces the plot show below. Bars drawn in orange indicate locations with FDR < 0.20, and those locations with FDR < 0.05 have a red dot at their value (we drew the plot with "1 - fdr" values so that smaller FDRs are taller). We can see from this plot a cluster of identified sites at locations 200-210. ![image](./figure_fdr_plot.png) The R code in `src/R/example.R` also provides an example of using the [phangorn](http://cran.r-project.org/web/packages/phangorn/index.html) and [ape](http://cran.r-project.org/web/packages/ape/) packages to simulate data under the homogeneous model for a particular site. We can then analyse the synthetic data with the TdG09 software, the parametric bootstrap providing significance of the non-homogeneous model on the original data. For example, here is the Monte Carlo distribution of Δ for 1000 parametric bootstrap replicates for flu protein HA (H1) site 204, showing that the homogenous model can be rejected in favour of the non-homogeneous model with P-value < 0.001. ![image](./figure_cox_plot.png)
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。