资源说明:Pipeline for BLAST analysis of high-throughput DNA sequencing data.
HTS-barcode-checker =================== The correct taxonomic identification of internationally-traded biological materials is crucial for the effective enforcement of the [Convention on International Trade in Endangered Species of Wild Fauna and Flora](http://cites.org/). This project provides a pipeline that automates the putative taxonomic identification of DNA barcodes (e.g. as generated from confiscated materials) by chaining together the steps of: 1. DNA sequence similarity searching in public databases using BLAST 2. Taxonomic name reconciliation of the taxon names of returned, matching sequences with the names listed in the CITES "appendices" (which itemize species and higher taxa in which international trade is restricted). Disclaimer ---------- Although the authors of this pipeline have taken care to consider exceptions such as incorrectly annotated sequence records in public databases, taxonomic synonyms, and ambiguities in the CITES appendices themselves, the user is advised that the results of this pipeline can in no way be construed as conclusive evidence for either positive or negative taxonomic identification of the contents of biological materials. The pipeline and the results it produces are provided for informational purposes only. To emphasize this point, we reproduce the disclaimer of the license under which this pipeline is released verbatim, below: **THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.** Installation instructions ------------------------- ### Dependencies Irrespective of how the tool is installed (that is, whether as a command-line tool, a standalone web application, or as a Galaxy tool), the following dependencies need to be satisfied: * **python**, version 2.7 or 3 - **Note**: make sure python isn't already installed. * **bio-python** - for example `sudo pip install biopython` * **beautiful-soup** - for example `sudo pip install BeautifulSoup` * **requests** - for example `sudo pip install requests` * **ncbi-blast+ 2.2.28** or higher when running local BLAST searches (recommended) ### Command-line tool For command-line usage, the Python script [HTS-barcode-checker](src/HTS-barcode-checker) is provided in the src folder. Assuming the dependencies (listed below) are satisfied, there are no installation steps, the script can simply be run 'as is' with command-line arguments described below. However, for both of the usages described below (i.e. as web application or as Galaxy tool) it is recommended to have the command line tool installed "system-wide", i.e. such that any user can invoke the script from the `$PATH`. ### Standalone web application To install the pipeline as a locally-hosted web application, in addition to satisfying the dependencies listed below, the following steps must be taken: * Place the Python script [HTS-barcode-checker](src/HTS-barcode-checker) in a location where it can be executed by the web server process. * Place the default [CITES CSV database](resources/CITES_db.csv) in a location where it is readable by the web server process. * Edit line 42 in the [HTS-barcode-checker](src/HTS-barcode-checker#L42) script: `resources` should point to the resource folder that comes with the git repository. Given the number of different web server configurations that exist it is best to consult your local system administrator if you don't know how to do this. The general issue is that a web server process is typically run as a special user with limited rights. Hence, the server (and _any_ processes it can launch) may not be allowed to access certain folders, execute certain processes, and so on. ### Galaxy pipeline To run the pipeline in Galaxy: * The [HTS-barcode-checker.xml](galaxy/HTS-barcode-checker.xml) UI configuration file and the [HTS-barcode-checker.py](galaxy/HTS-barcode-checker.py) wrapper have to be placed in Galaxy's `tools` folder. * The`tool_conf.xml` configuration file in the main folder of the Galaxy installation needs to be edited to include the HTS-barcode-checker, see the Galaxy [wiki](http://wiki.galaxyproject.org/Admin/Tools/Adding_Tools) for details. * Finally, in order for the tool to work, the actual script [HTS-barcode-checker](src/HTS-barcode-checker) needs to be added to the system `$PATH`. General usage ------------- The basic command to run the pipeline is: HTS-barcode-checker --input_file--output_file --CITES_db Arguments: * `--input_file` sequence(s) obtained from a mixture whose contents need to be identified * `--output_file` location where to write the results of the identification * `--CITES_db` location of the pre-computed CITES names database This command will run BLAST searches of the provided input FASTA file(s) against the NCBI nucleotide database (by default), then cross-reference the returned taxon IDs with local databases of taxon IDs that were obtained by taxonomic name reconciliation of the names listed in CITES appendices with the NCBI taxonomy. Any matches are recorded in the output file, a CSV spreadsheet, which needs to be evaluated further by the user. By default, the BLAST results are filtered according to the following criteria: a hit must have a minimum match percentage of 97%, a minimum match length of a 100 bp and a maximum e-value of 0.05. These settings can be altered if needed with the advanced command options listed below. By default, identification is done by submitting the BLAST request to NCBI GenBank. However, this can be slow and impractical for larger datasets. A local BLAST run is a more practical method for larger sets. In order to run a local BLAST the NCBI BLAST+ tool needs to be installed and a local BLAST database (for example the non-redundant nucleotide database `nr`) needs to be set up. For more info on installing the BLAST+ tool see the [BLAST+](http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download) webpage. When set up correctly a local BLAST run can be specified with the `-lb` parameter. The invocation will then be: HTS-barcode-checker --input_file --output_file --CITES_db -lb The pipeline flags critical issues that need to be investigated. In particular, in cases of taxonomic heterogeneity including CITES-listed species (i.e. multiple distinct species matching the same sequence, with at least some species being CITES-listed) the pipeline warns about this by emitting a message such as: CRITICAL: X out of a total of Y distinct taxa for "query" are CITES-listed Where X and Y are counts of distinct taxa and "query" is the input sequence identifier that yielded this result. Care must be taken in the interpretation of such results, as they can be a source of both Type I and Type II errors (i.e. both false positives and false negatives). Input data ---------- In a typical use case the input file contains high-throughput DNA sequencing reads for a locus commonly used in DNA barcoding (e.g. COI, matK, rbcL). To limit data volumes the user is advised to consider filtering out duplicate and poor quality reads as well as, possibly, clustering the reads a priori (e.g. using [octopus](http://octupus.sourceforge.net)) and picking an exemplar or computing a consensus for each cluster. An example file is provided in the data folder as _Test_data.fasta_. Full command information ------------------------ Command line arguments: HTS-barcode-checker [-h] [-i fasta file] [-o output file] [-ba algorithm] [-bd database] [-lb] [-hs HS] [-mb] [-mi MI] [-mc MC] [-me ME] [-bl blacklist file [blacklist file ...]] [-cd CITES database file [CITES database file ...]] [-fd] [-ad] [-ah] [-l log level] [-lf log file] All command line arguments and options can be provided in short or long form, as listed below: -h, --help Show help message and exit -i , --input_file Input data in FASTA format. The HTS-barcode-checker is limited to a set of a 100 sequences when running an online BLAST. -o
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。