资源说明:A series of scripts to process Illumina Bisulfite-Seq Data
Bisulfite-seq-pipeline verson 0.1 Released under the GPLv3 Written by Aaron Statham (a.statham@garvan.org.au) A PhD student in the Cancer Epigenetics lab at the Garvan Institute of Medical Research, Sydney, Australia Requires: * bowtie (http://bowtie-bio.sourceforge.net/index.shtml - tested with version 0.11.3) * sqlite3 (tested with version 3.6.22) * samtools (from http://samtools.sourceforge.net/ - tested with version 0.1.7-6 r530) * R (http://www.r-project.org/ - tested with version 2.11.0) * BSgenome package for R (http://bioconductor.org/ - tested with version 1.16.1) NOTE: At the moment, only the Paired End Bisulfite sequencing pipeline is in working order Workflow -------- 1. Creating indexes (once off for each reference organism) * Edit the paths in data/genome.config to point to your paths of bowtie-build and the Bisulfite-seq-pipeline installation path * Create a directory to house your genome files * Populate this directory with all the fasta files of your reference genome you wish to align against - Files must use the .fa extension - Single sequence per file - The fasta header line must match the filename eg "chr1.fa" must have the header ">chr1" * In this directory run the make_bis_ref.sh script with your genome.conf as an argument eg cd ~/data/genome/hg18 cp ~/workspace/Bisulfite-seq-pipeline/data/genome.config . ~/workspace/Bisulfite-seq-pipeline/make_bis_ref.sh genome.conf -> This script in silico bisulfite converts each fasta file for both the plus and minus strands of the genome, and places the output into the plusU/ and minusU/ directories respectively. The bowtie-build program is then called to create separate bowtie indexes for each strand (This may take some hours and hard drive space - an index of the human hg18 reference uses ). Additionally some extra files are created for downstream applications: reflist - contains the name of each reference chromosome CpGsites.csv - contains the position of every CpG site in the reference genome in "chromosome,position" format samdir - a generic header section for SAM format files which contains the name and size of each chromosome (used during alignment) 2. Creating a config file for each lane of alignment * Copy the template in data/Bis-PE.template.config to a directory that will house the alignment results * Edit the options to suit your experiment PIPELINE_PATH - directory where this pipeline is installed GENOME_PATH - directory where your reference genome is housed (where make_bis_ref.sh was executed - must be an absolute path, tilde expansion will not work BOWTIE_PARAMS - paramaters passed to bowtie during alignment (see the bowtie manual page for explanations) MAX_MM - maximum number of mismatches from the genome allowable (in PE mode, the sum of mismatches in both reads is used) MIN_MM_DIFF - how many mismatches better than then the next best reported matches to be considered a unique match READ_LENGTH - the read length used (the same for both read 1 and read 2) FASTQ1 - relative path to the gzipped read 1 fastq file FASTQ2 - relative path to the gzipped read 2 fastq file PROJECT - name of the project, all output files will use this as their prefix 3. Alignment * Run Bis-seq-pipeline-PE.sh with your config file created above as its argument -> This script goes through multiple steps and will take hours to run (A 75bp PE experiment took ~6 hours to map to the human genome hg18 reference using 7 cores during the bowtie step). A sample output (on only 250,000 reads) would look like: ~/workspace/Bisulfite-seq-pipeline/scripts/Bis-seq-pipeline-PE.sh test.config Thu Jun 10 20:27:04 EST 2010 - Reading config file: test.config Thu Jun 10 20:27:04 EST 2010 - Initialising the database Thu Jun 10 20:27:04 EST 2010 - Importing reads into the database Thu Jun 10 20:27:13 EST 2010 - Bisulfite converting reads Thu Jun 10 20:27:14 EST 2010 - Bowtie mapping against forward strand Thu Jun 10 20:28:37 EST 2010 - Bowtie mapping against reverse strand Thu Jun 10 20:30:04 EST 2010 - Getting the reference sequence of reads mapping positions Thu Jun 10 20:52:43 EST 2010 - Adjusting number of mismatches for C->T errors in mapping Thu Jun 10 20:53:03 EST 2010 - Combining forward and reverse mappings and filtering out duplicate mappings Thu Jun 10 20:53:08 EST 2010 - Exporting database to BAM files [samopen] SAM header is present: 25 sequences. [samopen] SAM header is present: 25 sequences. Thu Jun 10 20:53:27 EST 2010 - Creating coverage bed and GenomeData files Thu Jun 10 20:53:36 EST 2010 - Determining context of C residues Thu Jun 10 21:16:06 EST 2010 - Determining conversion % Thu Jun 10 21:16:16 EST 2010 - Creating mapping log 4. Outputs * Alignments to the plus and minus strands are exported in separate BAM format files - sorted and indexed for downstream analysis or viewing using samtools tview. * Similarly, alignments are exported in separate bed files, and converted into BSgenome GenomeData objects for analysis in R. * A summary of alignment success is placed in the mapping.log file, for example: # reads processed: 250000 # reads with at least one reported alignment: 149378 # reads that failed to align: 100622 # reads with alignments suppressed due to -m: 18154 Reported 131224 alignments to 1 output stream(s) * Also, an analysis of the occurence and context of C residues is created in the .context file, for example: 135 AA 65 AC 23987 AG 69 AT 30149 CA 19427 CC 320437 CG 24449 CT 135 GA 108 GC 18723 GG 74 GT 146 TA 141 TC 30154 TG 124 TT No of C residues in the read sequences: 468323 No of C residues in the reference sequences: 4608038 * In this sample, 468323 C residues occurred in the aligned read sequences, whereas 4608038 C residues appear in the corresponding reference locations, giving a bisulfite conversion efficiency of 89.83%. However, 320437 of these C residues occur in the CpG context where DNA methylation mostly occurs in mammals, so the non-CpG conversion efficienty is actually 96.79%. 5. Downstream steps * If multiple lanes of the same sample have been run, the BAM files can be pooled by using samtools merge * A CSV of the number of A,C,G,T and N calls on each strand from the plus and minus strand BAM files can be generated by running CpGcalls.sh with the same config file as used during alignment. TODO ---- * Go back and make single ended sequencing work * Lots more documentation * A few more optimisations can be made to avoid so much disk i/o * You tell me!
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。