资源说明:Unofficial repo for software vendoring or packaging purposes
0. Availability ============ The source code for this package is available from http://research-pub.gene.com/gmap. License terms are provided in the COPYING file. 1. Building and installing GMAP and GSNAP ========================================== Prerequisites: a Unix system (including Cygwin on Windows), a C compiler, and Perl Step 1: Set your site-specific variables by editing the file config.site. In particular, you should set appropriate values for "prefix" and probably for "with_gmapdb", as explained in that file. If you are compiling this package on a Macintosh, you may need to edit CFLAGS to be CFLAGS = '-O3 -m64' since Macintosh machines will make only 32-bit executables by default. Step 2: Build, test, and install the programs, by running the following GNU commands ./configure make make check (optional) make install Note 1: Instead of editing the config.site file in step 1, you may type everything on the command line for the ./configure script in step 2, like this ./configure --prefix=/your/usr/local/path --with-gmapdb=/path/to/gmapdb If you omit --with-gmapdb, it defaults to ${prefix}/share. If you omit --prefix, it defaults to /usr/local. Note that on the command line, it is "with-gmapdb" with a hyphen, but in a config.site file, it is "with_gmapdb" with an underscore. Note 2: If you want to keep your version of config.site or have multiple versions, you can save the file to a different filename, and then refer to it like this ./configure CONFIG_SITE=Note 3: GSNAP is designed for short reads of a limited length, and relies upon a maximum read length variable MAX_READLENGTH defined at compile time (default 250). You may set this variable by providing it to configure like this ./configure MAX_READLENGTH= or by defining it in your config.site file (or in the file provided to configure as the value of CONFIG_SITE). Or you may set the value of MAX_READLENGTH as an environment variable before calling ./configure. If you do not set MAX_READLENGTH, it will have the default value shown when you run "./configure --help". Note that MAX_READLENGTH applies only to GSNAP. GMAP, on the other hand, can process queries up to 1 million bp. Note 4: GSNAP can read from gzip-compressed FASTA or FASTQ input files. This feature requires the zlib library to be present (available from http://www.zlib.net). The configure program will detect the availability of zlib automatically. However, to disable this feature, you can add "--disable-zlib" to the ./configure command or edit your config.site file to have the command "disable_zlib". Note 5: GSNAP optionally supports the Goby input and output file formats. To implement this functionality, you need to obtain and compile the libraries from http://campagnelab.org/software/goby. If the resulting header files are located in /path/to/goby/include and the library files are in /path/to/goby/lib, you can then add the flag "--with-goby=/path/to/goby" to your ./configure command or edit your config.site file to have this directory as the value for "with_goby". 2. Downloading a pre-built GMAP/GSNAP database =============================================== A GMAP/GSNAP "database" is a set of genomic index files, representing the genome in a hash table format. You can use the programs gmap_build or gmap_setup to build your own database (as described below), but you can started quickly by downloading a pre-built GMAP/GSNAP database from the same place you obtained the GMAP program (see above for URL). Place the database in the GMAPDB directory you specified in the config.site file when you built the gmap program. You should include a subdirectory for each GMAP database; for example, if you downloaded a database called , your directory structure should look like this /path/to/gmapdb/ / /path/to/gmapdb/ / .chromosome /path/to/gmapdb/ / .chromosome.iit ... /path/to/gmapdb/ / .version Note that the GMAP database format has changed with the 2011-08-15 release. Older versions of GMAP and GSNAP will not work with the newer databases, but the current version of the programs is backward compatible with the older databases. Also, versions of GMAP and GSNAP before 2008 may require symbolic links to work even with the older databases. The old databases have the index files .ref3offsets and .ref3positions. The new databases have the index files .ref12153gammaptrs, .ref12153offsetscomp, and .ref153positions, if built using a base size of 12, a k-mer size of 15, and skipping every 3 bp in the genome. If the k-mer size is equal to the base size, then the gammaptrs file will be absent. Also, the name of the positions file has changed starting with 2012-02-14 version. Previously, the file was named .ref12153positions, but it is now named .ref153positions, since the contents are independent of the base size. If you create a database with a newer version of the package, and want older versions of the GMAP or GSNAP to work with these newer versions, you will need to make a symbolic link like this: ln -s .ref153positions .ref12153.positions 3. Setting up to build a GMAP/GSNAP database (one chromosome per FASTA entry) ============================================================================== You can also build your own genomic database, using one of two utility programs provided with this package: gmap_build (the newer, one-step method) or gmap_setup (the older way that uses Makefile and requires multiple steps). Note that the total sequence length in your database cannot exceed 2^32 = 4,294,967,296 (about 4 billion) bp. This is because the format uses 32-bit pointers. If your total sequence provided to the utility programs exceeds 4 billion bp, the programs will abort. Below I use the "genome" and "chromosome", but the input sequences can be anything you wish to align to, including transcripts or small fragments. You will need to start with a set of FASTA files containing either entire chromosomes or contigs that represent pieces of chromosomes. If your FASTA entries each contain a single chromosome, and the accession for each chromosome is the chromosome number/letter, you can simply run this command gmap_build -d [-k ] which will build and install the GMAP index files in your default GMAPDB location. You can see the full usage of gmap_build by doing "gmap_build --help", but here are some useful flags. If your FASTA files are gzipped, you can add the flag "-g" to gmap_build. You can control the k-mer size for the genomic index with the -k flag, which can range from 12 to 15. The default value for -k is 15, but this requires your machine to have 4 GB of RAM to build the indices. If you do not have 4 GB of RAM, then you will need to reduce the value of -k or find another machine. Here are the RAM requirements for building various indices: k-mer of 12: 64 MB k-mer of 13: 256 MB k-mer of 14: 1 GB k-mer of 15: 4 GB These are the RAM requirements for building indices, but not to run the GMAP/GSNAP programs once the indices are built, because the genomic indices are compressed. For example, the genomic index for a k-mer of 15 gives a gammaptrs file of 64 MB and an offsetscomp file of about 350 MB, much smaller than the 4 GB that would otherwise be required. Therefore, you may want to build your genomic index on a computer with sufficient RAM, and distribute that index to be used by computers with less RAM. If you want to build your genomic databases with more than one k-mer size, you can re-run gmap_build with different values of -k. This will overwrite only the identical files from the previous runs. You can then choose the k-mer size at run-time by using the -k flag for either GMAP or GSNAP. 4. Setting up to build a GMAP/GSNAP database (more complex cases) ================================================================== If gmap_build works for you, you can skip to section 5. Otherwise, if you have more complicated needs than gmap_build can handle, there is a more general build tool called gmap_setup, which creates a Makefile with this command gmap_setup -d [-k ] ... and then has you run a few make commands, based on the directions it provides. Again, you can type "gmap_setup --help" to see the full set of options. Note that the term ... above indicates that multiple files can be listed. The files can be in any order, and the contigs can be in any order within these files. By default, the GMAP setup process will sort the contigs and chromosomes into their appropriate "chrom" order. For the human genome, this order is 1, 2, ..., 10, 11, ..., 22, X, Y, M, followed by all other chromosomes in numeric/alphabetical order. If you don't want this sort, provide the "-s none" flag to gmap_setup or gmap_build. Other sort options besides "none" and "chrom" are "alpha" and "numeric-alpha". We show the full set of gmap_setup options under item 4d below, but we first discuss some specific situations for using the program. 4a. Chromosomes represented as contig pieces ============================================= If your FASTA entries consist of contigs, each of which has a mapping to a chromosomal region in the header, you may need to add the -C flag to gmap_setup, like this gmap_setup -d -C ... Then gmap_setup will try to parse a chromosomal region from each header. The program knows how to parse the following patterns: chr=1:217281..257582 [may insert spaces around '=', or omit '=' character] chr=1 [may insert spaces around '=', or omit '=' character] chromosome 1 [NCBI .mfa format] chromosome:NCBI35:22:1:49554710:1 [Ensembl format] /chromosome=2 [Celera format] /chromosome=2 /alignment=(88840247-88864134) /orientation=rev [Celera format] chr1:217281..257582 chr1 [may insert spaces after 'chr'] If only the chromosome is specified, without coordinates, the program will assign its own chromosomal coordinates by concatenating the contigs within each chromosome. If gmap_setup cannot figure out the chromosome, it will assign it to chromosome "NA". 4b. Using an MD file ===================== Another possibility is that your FASTA entries consist of contigs, each of which has mapping information in an external file. Genomes from NCBI typically include an ".md" file (like seq_contig.md) that specifies the chromosomal coordinates for each contig. To use this information, provide the -M flag to gmap_setup, like this gmap_setup -d -M ... The program will then try to parse the mdfile (which often changes formats) and verify with you which columns contain the contig names and chromosomal coordinates. 4c. Compressed FASTA files or files requiring processing ========================================================= If your genome files don't satisfy any of the cases above, you may need to write a small script that pipes the sequences in FASTA format to gmap_setup. You can tell gmap_setup about your script with the -E flag, like this gmap_setup -d -E 'gunzip -c chr*.gz' gmap_setup -d -E 'cat *.fa | ./add-chromosomal-info.pl' You can think of the command as a Unix pipe for processing each FASTA file before it is read by gmap_setup. 4d. General use of gmap_setup program ====================================== Any of the steps above (4a, 4b, or 4c) will create a Makefile, called Makefile. . You will then use this Makefile to build a GMAP/GSNAP database. You will be prompted to use this Makefile through the following commands: make -f Makefile. coords make -f Makefile. gmapdb make -f Makefile. install Note that older versions of GMAP allowed the building of genomic databases containing lower-case characters by doing "make -f Makefile. gmapdb_lc" or "make -f Makefile. gmapdb_lc_masked", but these will not work with GSNAP, and I am not certain if these still work with the most recent GMAP either, so they are not currently supported. The first step in using this Makefile is to create a file called coords. . You may manually edit this file, if you wish, before proceeding with the rest of the Makefile steps. The coords file contains one contig per line, in the following format: where the chromosomal_mapping is in the form : .. . Here are some examples: NT_077911.1 1:217281..257582 NT_091704.1 22U:1..166566 If you want the contig to be inserted as its reverse complement, then list the coordinates in the reverse direction (starting with the higher number), like this: NT_039199.1 1:61563373..61273712 You may delete lines or comment them out with a '#' character, which will effectively omit those contigs from your genome build. You may also change chromosomal assignments (in column 2) at this stage. Note: Previous versions of GMAP allowed you to specify alternate strains in column 3, but this feature added too much complexity and is no longer supported. You then will run "make -f Makefile. gmapdb". This creates a compressed version of the genome, in the file .genomecomp, which can hold only the standard, upper-case A, C, G, T, N, and X characters. It converts all lower-case characters to upper-case, and all non-ACGTNX characters to 'N'. This command also creates a hash table of the genome, with files that end with "gammaptrs", "offsetscomp", and "positions". Finally, running "make -f Makefile. install" will place all database files in a subdirectory specified by your "-d" flag under the directory specified either by the "-D" flag or, if not specified, the value of --with-gmapdb you provided at configure time. Running GMAP ============ To see the full set of options, type "gmap --help". The following are some common examples of usage. For more examples, see the document available at http://www.gene.com/share/gmap/paper/demo-slides.pdf For each of the examples below, we assume that you have installed a genome database called in your GMAPDB directory. (If your database is located elsewhere, you can specify the -D flag to gmap or set the environment variable GMAPDB to point to that directory.) * Mapping only: To map one or more cDNAs in a FASTA file onto a genome, run GMAP as follows: gmap -d * Mapping and alignment: If you want to map the cDNAs to a genome, and show the full alignment, provide the -A flag: gmap -d -A * Alignment only: To align one or more cDNAs in a FASTA file onto a given genomic segment (also in a FASTA file), use the -g flag instead of the -d flag: gmap -g -A * Batch mode: If you have a large number of cDNAs to run, and you have sufficient RAM to run in batch mode, add the "-B 3", "-B 4", or "-B 5" option. Details for these options are provided by running "gmap --help". gmap -d -B 5 -A * Multithreaded mode: If your machine has several processors, you can make batch mode run even faster by specifying multiple threads with the -t flag: gmap -d -B 5 -A -t Note that with multiple threads, the output results will appear in random order, depending on which thread finishes its computation first. If you wish your output to be in the same order as the input cDNA file, add the '-O' (letter O, not the number 0) flag to get ordered output. Guidelines: The -t flag specifies the number of computational threads. In addition, if your machine supports threads, GMAP also uses one thread for reading the input query sequences, and one thread for printing the output results. Therefore, the total number of threads will be 2 plus the number you specify. The program will work optimally if it uses one thread per available processor. If you specify too many threads, you can cause your computer to thrash and slow down. Note that other programs running on your computer also need processors. * Compressed output: If you want to store the alignment results in a compressed format, use the -Z flag. You can uncompress the results by using the gmap_uncompress.pl program: gmap -d -Z > x cat x | gmap_uncompress Building map files ================== This package includes an implementation of interval index trees (IITs), which permits efficient lookup of interval information. The gmap program also allows you (with its -m flag) to look up pre-mapped annotation information that overlaps your query cDNA sequence. These interval index trees (or map files) are built using the iit_store program included in this package. To build a map file, do the following: Step 1: Put your map information for a given genome into a map file with the following FASTA-like format: >label coords optional_tag optional_annotation (which may be zero, one, or multiple lines) For example, the label may be an EST accession, with the coords representing its genomic position. Labels may be duplicated if necessary. The coords should be of the form chr:position chr:startposition..endposition The term "chr:position" is equivalent to "chr:position..position". If you want to indicate that the interval is on the minus strand or reverse direction, then may be less than . Tags are very general and can be used for a variety of purposes. For example, you could Step 2: Run iit_store on this map file as follows cat | iit_store -o The program will create a file called .iit. Now you can retrieve this information with iit_get iit_get .iit where has the format "chr:position" or "chr:startposition..endposition". The iit_get program has other capabilities, including the ability to retrieve information by label, like this: iit_get .iit
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。