资源说明:
# CCLS Sequencing This is simply a collection of notes and scripts. And now, other developer's source code. [RINS Package](http://khavarilab.stanford.edu/resources.html) Eventually, I'd prefer to understand exactly what RINS and the collection of apps that it uses does and then replicate in ruby with testing. RINS includes some examples, but none have worked for me, so far. Parameters have changed. Databases are corrupt. Executables are seg faulting. Processing "chopped" files finds nothing. Parafly java crashes... ( removed --compatible_path_extension) blastn output changed? modified blastn_cleanup.pl and write_results.pl I have begun including source code and making modifications to allow for the complete install of Blat, Blast, Bowtie, Bowtie2, Trinity and RINS from this repository. It should be clearly noted that the aforementioned software packages ARE NOT MINE and belong to those that wrote them. I created the root Makefile to control the making and installing of these packages on MY MACHINE (Mac Pro 10.6.8) and so far this works. Used GCC 421 ( No longer including Trinity. Installation of it is straight forward. ) ( No longer including Blast. Installation of it is straight forward. ) ( No longer including Blat. Now part of Kent Utils and installation of it is straight forward. ) ( No longer including Bowtie. Latest version is a MacPort. ) [RINS Package](http://khavarilab.stanford.edu/resources.html) Not including the virus index. [RINS Data (not included)](https://s3.amazonaws.com/changseq/kqu/rins/rins.tar.gz) [BLAST 2.2.27](http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download) ( 'make'ing this generates about 4000 additional files. Odd. ) [Blat 34](http://users.soe.ucsc.edu/~kent/src/) THIS IS NOT THE LATEST VERSION. [Blat 35](http://users.soe.ucsc.edu/~kent/src/) [Bowtie 0.12.8 and Bowtie 2.0.0 beta 7](http://bowtie-bio.sourceforge.net/) [Trinity 2012-06-08](http://trinityrnaseq.sourceforge.net) Not including the ~50MB of sample data [Trinity 2012-10-05](http://trinityrnaseq.sourceforge.net) Not including the ~50MB of sample data [BWA 0.6.2](http://sourceforge.net/projects/bio-bwa/files/) [PRICE 20120527](http://derisilab.ucsf.edu/software/price/) [Velvet](http://www.ebi.ac.uk/~zerbino/velvet/) [VICUNA](http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/vicuna) [MIRA](http://www.chevreux.org/projects_mira.html) [Cufflinks](http://cufflinks.cbcb.umd.edu) [Tophat](http://tophat.cbcb.umd.edu/) [BOOST](http://www.boost.org/) [SAMtools](http://samtools.sourceforge.net/) [BioRuby](http://bioruby.org/) ## Indexes Used Manually downloaded, but then for some reason couldn't actually use? ftp://ftp.ncbi.nlm.nih.gov/blast/db/ ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz So used update\_blastdb.pl to get them update_blastdb.pl --decompress nt update_blastdb.pl --decompress nr This will download them to whereever you are so make sure you have ~50GB of space! http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit twoBitToFa hg19.2bit hg19.fa # generate ebwt files bowtie-build hg19.fa hg19 The big rins.tar.gz includes hg18 and virus ## General modifications Changed all perl scripts she-bang to use environment selected perl. Blat doesn't have any perl, so this was just Bowtie, Bowtie2, Blast and Trinity. From ... #!/usr/bin/perl ... to ... #!/usr/bin/env perl ## Included BLAT 34 modifications required for me to compile blatSrc34/jkOwnLib/gfPcrLib.c errAbort(buf); ... changed to ... errAbort("%s", buf); blatSrc34/lib/errCatch.c warn(errCatch->message->string); ... changed to ... warn("%s",errCatch->message->string); blatSrc34/lib/htmlPage.c warn(errCatch->message->string); ... changed to ... warn("%s", errCatch->message->string); ## Trinity 2012-06-08 modifications ( scripts to be copied in via Makefile ) SAM\_filter\_out\_unmapped\_reads.pl was modified due to some test data triggering a divide by zero error. Add an empty target to a Makefile.in. I doesn't seem needed on my Mac Pro, just my MacBook Pro? Different version of make? XCode? > trinityrnaseq_r2012-06-08/trinity-plugins/jellyfish-1.1.5/Makefile.in doc/jellyfish.man: # this is just an empty placeholder In addition, I think that the latest Trinity and its changes are responsible for the contig naming changes. Its output file, Trinity.fasta, is just the short name comp5_c0_seq2 versus the long name style in the examples comp0_c0_seq1_FPKM_all:1241158.424_FPKM_rel:626919.264_len:482_path:[1032,746,1102] This difference has led to the modifications of the parsing in 2 of the scripts and I think that that is causing some problems. This also effected the write\_results.pl script in another place. The sequences weren't included due to this change in name. The hash then didn't contain any matching keys. Also, many of these scripts seem to expect a "standard" sequence name style of having a trailing /1 or /2 for the respective left and right lanes. Our data is not like this. ### TODO Is this really a standard? Change our sequence names or modify all of the scripts to work with both style? ## RINS modifications ( scripts to be copied in via Makefile ) Modified rins.pl for several reasons. One major change is to the parameters passed to Trinity. It appears that this was written for trinityrnaseq-r20110519. Removed ... --run_butterfly : no longer needed (or allowed) as is default --compatible_path_extension : triggers java exception as is invalid Exception in thread "main" java.lang.NullPointerException at gnu.getopt.Getopt.checkLongOption(Getopt.java:869) at gnu.getopt.Getopt.getopt(Getopt.java:1119) at TransAssembly_allProbPaths.main(TransAssembly_allProbPaths.java:193) Not valid since trinityrnaseq-r20110519 ( Butterfly/src/src/TransAssembly_allProbPaths.java ) Changed ... --paired_fragment_length changed to --group_pairs_distance Although I don't know that that was the right thing to do. Added ... --JM 1G : or some size 10G caused a lot of disk activity as I don't have 10G of memory. Downsized to 1G and the test_sample run was done in 20 minutes vs 12 hours! Some file checking as been added. Many defaults are also now set and are unnecessary at the command line or in the config file. Modified blastn\_cleanup.pl and write\_result.pl as the parsing no longer works. It appears that the field content has changed? ## INSTALLATION Requirements may include libpng, coreutils, texlive sudo port install autoconf automake libpng coreutils texlive make make install This will create ~/RINS\_BASE with all of the scripts and binaries in it. You will need to modify your PATH to include this. Feel free to rename it. setenv PATH ${HOME}/RINS\_BASE/bin:${PATH} ### TODO Make FASTA copies of our FASTQ data and use it to avoid all the copying and converting. ### TODO blat can run for quite a long time without ever producing any output. Make it more verbose. ( using -dots=10000 is helpfulish ) ### TODO Add testing. ### TODO Merged PathSeq fasta files for all\_bacterial and virus\_all Blat can't use it as it is too big (>4GB). Using it to create a blastdb makeblastdb -dbtype nucl -in all_bacterial_and_viral.fa -out all_bacterial_and_viral -title all_bacterial_and_viral -parse_seqids Having trouble. Generated db isn't searchable. Think I need the parse seqids option... Yes. -parse\_seqids is really needed when creating from a fasta file. Otherwise, new ones are made up which serves no purpose. ### NOTE Always use -parse\_seqids option when making a blastdb from a fasta file. Otherwise the sequence names in the fasta file are not used and new ones are made up. makeblastdb -dbtype nucl -in fallon_unknown_non_human.fasta -out fallon_unknown_non_human -title fallon_unknown_non_human -parse_seqids ### NOTE How to dump a blast database into a fasta file... blastdbcmd -db SOMEBLASTDB -entry all > MYNEWFASTAFILE blastdbcmd -db nt -entry all -outfmt '%a,%g,%T' -out nt.accession_gi_taxid.csv The sequence titles contain nearly every character from commas to colons, so I'm using a special character to delineate the fields. "option 8" on a Mac produces a simple dot which does not appear to be used in any of the 50 million or so sequence titles, although for what I'm doing with the file, it is unnecessary. I may use this to create a list of gi numbers to not include in blastn output by using the -negative\_gilist. (using the dot seemed to anger the terminal? perhaps don't use.) blastdbcmd -db nt -entry all -outfmt '%a•%g•%t' -out nt.accession_gi_title.csv ### TODO Bowtie2 won't run on my MacPro Executing ... bowtie2 -p 6 -f /Volumes/cube/working/indexes/hg19 leftlane_not_hg18.fa -S leftlane_not_hg18_hg19.sam --un leftlane_not_hg18_hg19.fa bowtie2-align(20745) malloc: *** mmap(size=965771264) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug Out of memory allocating the ebwt[] array for the Bowtie index. Please try again on a computer with more memory. Error: Encountered internal Bowtie 2 exception (#1) Command: /Users/jakewendt/RINS_BASE/bin/bowtie2-align --wrapper basic-0 -p 6 -f --passthrough /Volumes/cube/working/indexes/hg19 leftlane_not_hg18.fa I downloaded the mac binaries and they run fine. ## SUPER NOTE Some scripts have the expectation of a trailing lane identifier (either /1 or /2) Our data is not like this. It will have something like "\_1:Y:0:CGTACG" To replace our special lane identifier with the "standard" try something like ... cat leftlane.fa | sed 's/_([12]).*$/\/\1/' > relaned.fa The underscore is added by the fastq2fasta.pl script so if using fastq try ... cat leftlane.fq | sed 's/ ([12]).*$/\/\1/' > relaned.fq In addition, some scripts, samtools in particular, REQUIRE that the sequence names DO NOT begin with an @ symbol. cat raw.fa | sed 's/^>@/>/' > renamed.fa These @ symbols are being left behind by RINS' fastq2fasta.pl script The problem with both of these problems is that they happen quietly in the night. Trinity's Chrysalis complains about the lane names too. util/scaffold_iworm_contigs.pl warning, ignoring read: HWI-ST977:132:C09W8ACXX:7:1101:10003:190695_2:N:0:GTGGCC since cannot decipher if /1 or /2 of a pair. Use of uninitialized value $core_acc in string ne at scaffold_iworm_contigs.pl line 64, <$fh> line 1. Use of uninitialized value $frag_end in hash element at scaffold_iworm_contigs.pl line 71, <$fh> line 1. ### TODO Can't make trinity r2012-10-05 on my MacPro or Steve's It will generate many, seemingly random, ... /bin/sh: fork: Resource temporarily unavailable Solution. Well, not quite. These failures appear to occur during the compilation of the coreutils. The script actually deals with the failed compilation so this newer script is not required. However, Chrysalis has a path hard coded in its code so it is expecting a sort command in trinity-plugins/coreutils/bin. The sole purpose is to allow or deal with a sort command which can do so --parallel. sudo port install coreutils will install the same, newer actually, however it will be gsort. The contents of /opt/local/libexec/gnubin/ will contain links to these "g" utils. We could install this port, and manually copy gsort to our path and then this script will notice it and not run the part that will fail. ### TODO Change all of the perl shebang lines to "/usr/bin/env perl" ### TODO Try bwa? ### TODO Try some other assemblers .... Price Velvet http://www.ebi.ac.uk/~zerbino/velvet/ https://github.com/dzerbino/velvet Ray/RayPlatform I compiled and ran this. Seemed to work, but was taking forever so it was killed. Perhaps give it another go on a big machine. http://denovoassembler.sourceforge.net/ https://github.com/sebhtml/ray.git https://github.com/sebhtml/RayPlatform ( have "ray" and "RayPlatform" parallel. "ray" has a link to this expected location of "RayPlatform". ) If it works, add them to our repo as a submodule ... git submodule add https://github.com/sebhtml/ray.git git submodule add https://github.com/sebhtml/RayPlatform.git submodules suck. They get "dirty" and would need cleaned before commit just going to download the code locally. mira Cufflinks Downloaded and untarred source code ./configure --prefix=/Users/jakewendt/RINS_BASE failed as needed bamlib sudo port install samtools configure ran but don't think that it is the latest bamlib make failed as no Eigen Downloaded Eigen. Unbzipped. Untarred. Copied Eigen to include cp -r Eigen ~/RINS_BASE/include/ had to reconfigure with ./configure --prefix=/Users/jakewendt/RINS_BASE --with-eigen=/Users/jakewendt/RINS_BASE working .... make make install rehash cufflinks ./test_data.sam ### Notable Links for Reference I was researching filtering and repeat masking ... http://www.animalgenome.org/bioinfo/resources/manuals/blast2.2.24/ http://www.girinst.org/server/RepBase/index.php http://www.repeatmasker.org/ http://www.clarkfrancis.com/blast/Blast_Nmer_Masking.html http://www.compbio.ox.ac.uk/analysis_tools/BLAST/BLAST_filtering.shtml Even looked into creating a megablast index http://bioinformatics.oxfordjournals.org/content/early/2008/06/23/bioinformatics.btn322.full.pdf ---------- Copyright (c) 2012 [Jake Wendt], released under the MIT license
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。