gblite.pl
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:gblite.pl is a program for retrieving batches of sequences by taxonomic clade and molecule type. It is meant to be simple, fast, and portable, but not fully-featured. It uses AB-BLAST programs for handling sequence data and SQLite for everything else.
INTRODUCTION

	gblite.pl is a program for retrieving batches of sequences by
	taxonomic clade and molecule type. It is meant to be simple, fast,
	and portable, but not fully-featured. It uses WU-BLAST programs for
	handling sequence data and SQLite for everything else.

LICENSE

	Public Domain

INSTALLATION INSTRUCTIONS

	It is assumed you have a standard Unix/Linux installation with Perl
	(built with threading).

	1. Install SQLite (free from http://www.sqlite.org and various package
	management solutions). sqlite3 must be in the executable PATH.

	2. Install AB-BLAST (free for academic use from http://blast.advbiocomp.com/).
	Both xdformat and xdget must be in the exectuable PATH.

	3. Download GenBank files (ftp://ftp.ncbi.nih.gov/genbank/) and put
	them in a single directory. gblite.pl looks for gb*seq.gz files in
	the directory, so keep them compressed.

	4. Download the taxonomy tar-ball (ftp://ftp.ncbi.nih.gov/pub/taxonomy/)
	The names.dmp and nodes.dmp files are required.

	5. Set the GBLITE environment variable to point to a directory where
	you want to store the database. Make sure this has enough space. The
	SQLite and WU-BLAST databases will be stored in this location.

	6. To build the database, enter a command similar to the one below
	where  is the directory of GenBank files  is the
	uncompressed taxonomy tar-ball, and  is some place to
	keep the temporary build files. During the build procedure, there
	will be a variety of diagnostic messages, which may be useful if
	something goes wrong (which is why I have tee-ed the output to a log
	file). If you have a computer with several CPUs, you can speed up
	the build with the -p option (shown below for 8 CPUs).
	
		gblite.pl -p 8 build    | tee logfile

	7. If the build does not complete for some reason you can rebuild
	from where you left off without having to recompute everything. Just
	issue the same command again (you may have to remove some of the
	latest build files if the build died in the middle of a gzip).
	Ultimately, you are looking for the following message at the end.
	
		*** gblite build complete ****
	
	8. After the build completes, you can remove the $GBLITE/build
	directory as well as the GenBank and taxonomy directories. It may be
	best to keep these around in case you need them.
	
RETRIEVING SEQUENCES

	You must specify if the batch of sequences are nucleotide or
	protein. By deafult, it is assumed you want nucleotide. Use the -a
	option for protein. The -m option is used to specify different kinds
	of nucleotide sequences.
	
	It is a good idea to perform a count before retrieving sequences in
	case you enter the criteria incorrectly (e.g. so you don't end up
	with millions of sequences when you expected hundreds).
	
	In the following example, I am counting all primate nucleotide
	sequences (-c option). The Primates have taxonomy identifer 9443.
	You can also use the taxonomic name rather than the numeric
	identifier, so the following commands are equivalent.
	
		gblite.pl -c -t 9443 search
		gbliet.pl -ct Primates search
	
	Here, I am limiting the search to the mRNA moltype using the -m
	option.
	
		gblite.pl -c -t 9443 -m mRNA search
	
	For a complete list of moltypes, examine the SQLite database.
	
		sqlite3 $GBLITE/gblite.db
		select distinct moltype from GenBank;

	To retrieve sequences, remove the -c option.
	
		gblite.pl -t 9443 -m mRNA search > primate-mRNAs
	
NOTES

	Some sequences (like patents) may not have a taxonomy id. These are
	given a fake taxid of -1.

	The taxonomy database is not always synchronized with the rest of
	GenBank. For this reason, some sequences have valid taxids (derived
	from the GenBank entry) but no species name or parent taxid (from
	the Taxonomy database).
	
	The version number for sequences is not used. Why? There is only one
	version for each release, so any given accession number is unique to
	a particular release.
	
	What about daily updates? gblite.pl does not do incremental updates.
	One reason is that it's much easier to reference (and reproduce) a
	particular release than a release with a certain number of updates.
	
	What about other sequence formats? Well, I'd like to report GFF3
	some day, but that day is not today.


RELEASE NOTES

	Version 2008-02-08:
		GenBank 161 clean build. 41 GB. ~3 hours on 8 x 3.0 GHz Mac Pro.
		GenBank 163...

本源码包内暂不包含可直接显示的源代码文件,请下载源码包。