GMAP-GSNAP
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:*UNOFFICIAL, UNMAINTAINED and OUTDATED*: This was an unofficial archive of GMAP-GSNAP releases. Please use the original website for current versions of the source code.
0.  Availability
============

The source code for this package is available from
http://research-pub.gene.com/gmap.  License terms are provided in the
COPYING file.


1.  Building and installing GMAP and GSNAP
==========================================

Prerequisites: a Unix system (including Cygwin on Windows), a C
compiler, and Perl

Step 1: Set your site-specific variables by editing the file
config.site.  In particular, you should set appropriate values for
"prefix" and probably for "with_gmapdb", as explained in that file.
If you are compiling this package on a Macintosh, you may need to edit
CFLAGS to be

CFLAGS = '-O3 -m64'

since Macintosh machines will make only 32-bit executables by default.


Step 2: Build, test, and install the programs, by running the
following GNU commands

    ./configure
    make
    make check   (optional)
    make install

Note 1: Instead of editing the config.site file in step 1, you may type
everything on the command line for the ./configure script in step 2,
like this

    ./configure --prefix=/your/usr/local/path --with-gmapdb=/path/to/gmapdb

If you omit --with-gmapdb, it defaults to ${prefix}/share.  If you
omit --prefix, it defaults to /usr/local.  Note that on the command
line, it is "with-gmapdb" with a hyphen, but in a config.site file,
it is "with_gmapdb" with an underscore.


Note 2: If you want to keep your version of config.site or have
multiple versions, you can save the file to a different filename, and
then refer to it like this

    ./configure CONFIG_SITE=


Note 3: GSNAP is designed for short reads of a limited length, and
uses a configure variable called MAX_READLENGTH (default 300) as a
guide to the maximum read length.  You may set this variable by
providing it to configure like this

    ./configure MAX_READLENGTH=

or by defining it in your config.site file (or in the file provided to
configure as the value of CONFIG_SITE).  Or you may set the value of
MAX_READLENGTH as an environment variable before calling ./configure.
If you do not set MAX_READLENGTH, it will have the default value shown
when you run "./configure --help".

Note that MAX_READLENGTH applies only to GSNAP.  GMAP, on the other
hand, can process queries up to 1 million bp.

Also, starting with version 2014-08-20, if your C compiler can
handle stack-based memory allocation using the alloca() function,
GSNAP ignores MAX_READLENGTH, and can handle reads longer than that
value.


Note 4: GSNAP can read from gzip-compressed FASTA or FASTQ input
files.  This feature requires the zlib library to be present
(available from http://www.zlib.net).  The configure program will
detect the availability of zlib automatically.  However, to disable
this feature, you can add "--disable-zlib" to the ./configure command
or edit your config.site file to have the command "disable_zlib".


Note 5: GSNAP can read from bzip2-compressed FASTA or FASTQ input
files.  This feature requires the bzlib library to be present.  The
configure program will detect the availability of bzlib automatically.
However, to disable this feature, you can add "--disable-bzlib" to the
./configure command or edit your config.site file to have the command
"disable_bzlib".


2.  Possible issues during compilation
======================================

Recent versions of GMAP and GSNAP use certain techniques to achieve
increased speed.  One of these techniques is special SIMD
(single-instruction, multiple data) instructions that can perform some
computations in parallel.  There are various levels of SIMD
instructions, and we use those from the SSE2 and SSE4.1 instruction
sets.  Most computers built within the last 10 years should support
SSE2 at least.  However, you may run into the following issues:


Compiler issue 1.  If you compile the code successfully, but when you
run the program, you see something that says "Illegal instruction",
then you must be running GMAP or GSNAP on a different computer than
the one where you compiled it.  You may be using a computer cluster
with a variety of different computer models.  Each time you start GMAP
or GSNAP, the program tests to see if the computer can run the same
set of SSE2 or SSE4.1 instructions as were available when the programs
were compiled.  (The programs also check for if the popcnt
instructions work, but popcnt is so widely implemented that they
generally do not cause any problems.)

In that case, you may need to compile your program for the lowest
common denominator by by providing --disable-avx, --disable-sse4.1, or
--disable-sse2 to ./configure as necessary.  Alternatively, your
computer cluster may have the ability to detect the capabilities of
each computer when it receives a job.  Then, you may want to create
different compiled versions of GMAP and GSNAP, and call the
appropriate binary for that particular job.  You will have to work
with your system administrator if you want to accomplish this.


Compiler issue 2.  The most recent versions of GSNAP (starting with
2013-10-01) build a suffix array to help with speed.  If your computer
does not have sufficient RAM, it is possible that the gmap_build 
step fails after printing something like this:

    SACA_K called with n = 3095693984, K = 5, level 0

If this happens, you either need to find a computer with more RAM, or
you can build a genome without a suffix array, by providing
--no-sarray to gmap_build.  The GSNAP program will still work with the
resulting genome index, but it won't achieve optimal speed.  GMAP does
not use the suffix array at all.  Also, large genomes of over 4
billion bp also will not create or use a suffix array.



3.  Downloading a pre-built GMAP/GSNAP database
===============================================

A GMAP/GSNAP "database" is a set of genomic index files, representing
the genome in a hash table format.  You can use the gmap_build program
to build your own database (as described below), but you can started
quickly by downloading a pre-built GMAP/GSNAP database from the same
place you obtained the GMAP program (see above for URL).

Place the database in the GMAPDB directory you specified in the
config.site file when you built the gmap program.  You should include
a subdirectory for each GMAP database; for example, if you downloaded
a database called , your directory structure should look like
this

    /path/to/gmapdb//
    /path/to/gmapdb//.chromosome
    /path/to/gmapdb//.chromosome.iit
    ...
    /path/to/gmapdb//.version

Note that the GMAP database format has changed with the 2013-10-01
release to add a suffix array and other new file formats.  However,
GMAP and GSNAP are both backwards and forwards compatible, meaning
that new source code will work with older genome indices and that old
source code will work with newer genome indices.  Mixing up the two,
though, will result in slower running speed.  To achieve optimal
speed, you should use both the latest source code and rebuild your
genome indices with the latest gmap_build program.  You can tell if
your database has the most recent format if it has files of the type
.sarray, .ref12153bitpackptrs, and
.ref12153bitpackcomp.

Note that the hg19 database provided on the Web site lacks the suffix
array, although GMAP and GSNAP will work fine, albeit more slowly.
The suffix array for hg19 requires 15 GB of disk space.  If you want
the suffix array, you will have to build it yourself, as discussed in
the next section.



4.  Setting up to build a GMAP/GSNAP database (one chromosome per FASTA entry)
==============================================================================

You can also build your own genomic database, using the gmap_build
program provided with this package.  (Note: the gmap_setup
method from older versions is no longer provided or supported.)

Previous versions limited the total sequence length in your database
to 2^32 = 4,294,967,296 (about 4 billion) bp.  However, starting with
version 2013-04-01, a total "genome" may now contain up to 2^64 bp.
However, each individual chromosome is still limited to 2^32 bp.  The
utility programs below will automatically recognize when a genome is
larger than 2^32 bp and build the index files accordingly.  Below I
use the "genome" and "chromosome", but the input sequences can be
anything you wish to align to, including transcripts or small
fragments.

You will need to start with a set of FASTA files containing either
entire chromosomes or contigs that represent pieces of chromosomes.
For example, for the human genome, you can retrieve all of the FASTA
files under the ftp directory at

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

If your FASTA entries each contain a single chromosome, and the
accession for each chromosome is the chromosome number/letter, you can
simply run this command

    gmap_build -d  [-k ] 

which will build and install the GMAP index files in your default
GMAPDB location.

You can see the full usage of gmap_build by doing "gmap_build --help",
but here are some useful flags.  If your FASTA files are compressed
using gzip, you can add the flag "-g" to gmap_build.  You can control
the k-mer size for the genomic index with the -k flag, which can range
from 12 to 15.  The default value for -k is 15, but this requires your
machine to have 4 GB of RAM to build the indices.  If you do not have
4 GB of RAM, then you will need to reduce the value of -k or find
another machine.  Here are the RAM requirements for building various
indices:

    k-mer of 12: 64 MB
    k-mer of 13: 256 MB
    k-mer of 14: 1 GB
    k-mer of 15: 4 GB

These are the RAM requirements for building indices, but not to run
the GMAP/GSNAP programs once the indices are built, because the
genomic indices are compressed.  For example, the genomic index for a
k-mer of 15 gives a gammaptrs file of 64 MB and an offsetscomp file of
about 350 MB, much smaller than the 4 GB that would otherwise be
required.  Therefore, you may want to build your genomic index on a
computer with sufficient RAM, and distribute that index to be used by
computers with less RAM.

The amount of compression can be controlled using the -b or --basesize
parameter to gmap_build.  By default, the value for k-mer size is 15,
and the value for basesize is 12.  If you select a different value for
k-mer size, then basesize is made by default to be equal to that k-mer
size.

If you want to build your genomic databases with more than one k-mer
size, you can re-run gmap_build with different values of -k.  This
will overwrite only the identical files from the previous runs.  You
can then choose the k-mer size at run-time by using the -k flag for
either GMAP or GSNAP.

Finally, you can provide information to gmap_build that certain
chromosomes are circular, with the -c or --circular flag.  The value
for these flags is a list of chromosomes, separated by commas.  The
gmap_build program will then allow GSNAP and GMAP to align reads
across the ends of the chromosome.  For example, the mitochondrial
genome in human beings is circular.



5.  Setting up to build a GMAP/GSNAP database (more complex cases)
==================================================================

If you are indexing a genome, where each chromosome is contained in a
separate FASTA entry, or a set of contigs, where the contigs are
independent of each other, then you can just skip to section 6.
Otherwise, if you have more complicated needs, you may need to add
options to gmap_build.

Note that the term

    ... 

above indicates that multiple files can be listed.  The files can be
in any order, and the contigs can be in any order within these files.
By default, the gmap_build process will sort the contigs and
chromosomes into their appropriate "chrom" order.  For the human
genome, this order is 1, 2, ..., 10, 11, ..., 22, X, Y, M, followed by
all other chromosomes in numeric/alphabetical order.  If you don't
want this sort, provide the "-s none" flag to gmap_build.  Other sort
options besides "none" and "chrom" are "alpha" and "numeric-alpha".

You can type "gmap_build --help" to see the full set of options.  We
discuss some specific situations below.


5a.  Contigs are mapped to chromosomes
======================================

If your FASTA entries consist of contigs, each of which has a mapping
to a chromosomal region in the header, you may add the -C (or
--contigs-are-mapped) flag to gmap_build, like this

    gmap_build -d  -C ...

Then gmap_build will try to parse a chromosomal region from each
header.  The program knows how to parse the following patterns:

    chr=1:217281..257582  [may insert spaces around '=', or omit '=' character]
    chr=1                 [may insert spaces around '=', or omit '=' character]
    chromosome 1                                                         [NCBI .mfa format]
    chromosome:NCBI35:22:1:49554710:1                                    [Ensembl format]
    /chromosome=2                                                        [Celera format]
    /chromosome=2 /alignment=(88840247-88864134) /orientation=rev        [Celera format]
    chr1:217281..257582
    chr1                  [may insert spaces after 'chr']

If only the chromosome is specified, without coordinates, the program
will assign its own chromosomal coordinates by concatenating the
contigs within each chromosome.  If gmap_build cannot figure out the
chromosome, it will assign it to chromosome "NA".


5b.  Using an MD file
=====================

Another possibility is that your FASTA entries consist of contigs that
mapped to chromosomes, where the mapping information is in an external
file.  Genomes from NCBI typically include an ".md" file (like
seq_contig.md) that specifies the chromosomal coordinates for each
contig.  To use this information, provide the -M (or --mdfile) flag to
gmap_build like this

    gmap_build -d  -M  ...

The program will then try to parse the mdfile (which often changes
formats) and verify with you which columns contain the contig names
and chromosomal coordinates.


5c.  Compressed FASTA files or files requiring processing
=========================================================

If your genome files are compressed using gzip, you can simply add the
flag "-g" to gmap_build.  But if your genome files require some other
type of processing, you may need to write a small script that pipes
the sequences in FASTA format to gmap_build.  You can tell gmap_build
about your script with the -E (or --fasta-pipe) flag, like this

    gmap_build -d  -E 'gunzip -c chr*.gz'
    gmap_build -d  -E 'cat *.fa | ./add-chromosomal-info.pl'

You can think of the command as a Unix pipe for processing each FASTA
file before it is read by gmap_build.



6.  Running GMAP
================

To see the full set of options, type "gmap --help".  The following are
some common examples of usage.  For more examples, see the document
available at http://www.gene.com/share/gmap/paper/demo-slides.pdf

For each of the examples below, we assume that you have installed a
genome database called  in your GMAPDB directory.  (If your
database is located elsewhere, you can specify the -D flag to gmap or
set the environment variable GMAPDB to point to that directory.)

* Mapping only: To map one or more cDNAs in a FASTA file onto a
  genome, run GMAP as follows:

    gmap -d  

* Mapping and alignment: If you want to map the cDNAs to a genome,
  and show the full alignment, provide the -A flag:

    gmap -d  -A 


* Alignment only: To align one or more cDNAs in a FASTA file onto a
  given genomic segment (also in a FASTA file), use the -g flag
  instead of the -d flag:

    gmap -g  -A 


* Batch mode: If you have a large number of cDNAs to run, and you have
  sufficient RAM to run in batch mode, add the "-B 3", "-B 4", or "-B
  5" option.  Details for these options are provided by running "gmap
  --help".

    gmap -d  -B 5 -A 


* Multithreaded mode: If your machine has several processors, you can
  make batch mode run even faster by specifying multiple threads with
  the -t flag:

    gmap -d  -B 5 -A -t  

  Note that with multiple threads, the output results will appear in
  random order, depending on which thread finishes its computation
  first.  If you wish your output to be in the same order as the 
  input cDNA file, add the '-O' (letter O, not the number 0) flag
  to get ordered output.

  Guidelines: The -t flag specifies the number of computational
  threads.  In addition, if your machine supports threads, GMAP also
  uses one thread for reading the input query sequences, and one
  thread for printing the output results.  Therefore, the total number
  of threads will be 2 plus the number you specify.  The program will
  work optimally if it uses one thread per available processor.  If
  you specify too many threads, you can cause your computer to thrash
  and slow down.  Note that other programs running on your computer
  also need processors.


* Compressed output: If you want to store the alignment results in a
  compressed format, use the -Z flag.  You can uncompress the results
  by using the gmap_uncompress.pl program:

    gmap -d  -Z  > x
    cat x | gmap_uncompress


Note that gmap is written for small genomes (less than 2^32 bp in
total length).  With large genomes, there is an equivalent program
called gmapl, which you should run instead of gmap.  The gmapl program
is equivalent to gmap, and is based on the same source code, but is
compiled to use 64-bit index files instead of 32-bit files.  The gmap
and gmapl programs will detect whether the genomes are the correct
size, and will exit if you try to run them on the wrong-sized genomes.



7.  Building map files
======================

This package includes an implementation of interval index trees
(IITs), which permits efficient lookup of interval information.  The
gmap program also allows you (with its -m flag) to look up pre-mapped
annotation information that overlaps your query cDNA sequence.  These
interval index trees (or map files) are built using the iit_store
program included in this package.  To build a map file, do the
following:

Step 1: Put your map information for a given genome into a map file
with the following FASTA-like format:
   
    >label coords optional_tag
    optional_annotation (which may be zero, one, or multiple lines)

For example, the label may be an EST accession, with the coords
representing its genomic position.  Labels may be duplicated if
necessary.

The coords should be of the form

    chr:position
    chr:startposition..endposition

The term "chr:position" is equivalent to "chr:position..position".  If
you want to indicate that the interval is on the minus strand or
reverse direction, then  may be less than
.

Tags are very general and can be used for a variety of purposes.  For
example, you could 


Step 2: Run iit_store on this map file as follows

    cat  | iit_store -o 

The program will create a file called .iit.

Now you can retrieve this information with iit_get

    iit_get .iit 

where  has the format "chr:position" or
"chr:startposition..endposition".  The iit_get program has other
capabilities, including the ability to retrieve information by label,
like this:

    iit_get .iit 

本源码包内暂不包含可直接显示的源代码文件,请下载源码包。