Bisulfite-seq-pipeline
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:A series of scripts to process Illumina Bisulfite-Seq Data
Bisulfite-seq-pipeline verson 0.1
Released under the GPLv3
Written by Aaron Statham (a.statham@garvan.org.au)
A PhD student in the Cancer Epigenetics lab at the Garvan Institute of Medical Research, Sydney, Australia

Requires:
* bowtie (http://bowtie-bio.sourceforge.net/index.shtml - tested with version 0.11.3)
* sqlite3 (tested with version 3.6.22)
* samtools (from http://samtools.sourceforge.net/ - tested with version 0.1.7-6 r530)
* R (http://www.r-project.org/ - tested with version 2.11.0)
* BSgenome package for R (http://bioconductor.org/ - tested with version 1.16.1) 

NOTE: At the moment, only the Paired End Bisulfite sequencing pipeline is in working order


Workflow
--------

1. Creating indexes (once off for each reference organism)

  * Edit the paths in data/genome.config to point to your paths of bowtie-build and the Bisulfite-seq-pipeline installation path
  * Create a directory to house your genome files
  * Populate this directory with all the fasta files of your reference genome you wish to align against
    - Files must use the .fa extension
    - Single sequence per file
    - The fasta header line must match the filename eg "chr1.fa" must have the header ">chr1"
  * In this directory run the make_bis_ref.sh script with your genome.conf as an argument eg
  
    cd ~/data/genome/hg18
    cp ~/workspace/Bisulfite-seq-pipeline/data/genome.config .
    ~/workspace/Bisulfite-seq-pipeline/make_bis_ref.sh genome.conf
  
  -> This script in silico bisulfite converts each fasta file for both the plus and minus strands of the genome, and places the output into the plusU/ and minusU/ directories respectively. The bowtie-build program is then called to create separate bowtie indexes for each strand (This may take some hours and hard drive space - an index of the human hg18 reference uses ). Additionally some extra files are created for downstream applications:
     
     reflist      - contains the name of each reference chromosome 
     CpGsites.csv - contains the position of every CpG site in the reference genome in "chromosome,position" format
     samdir       - a generic header section for SAM format files which contains the name and size of each chromosome (used during alignment)


2. Creating a config file for each lane of alignment

  * Copy the template in data/Bis-PE.template.config to a directory that will house the alignment results
  * Edit the options to suit your experiment

    PIPELINE_PATH - directory where this pipeline is installed
    GENOME_PATH   - directory where your reference genome is housed (where make_bis_ref.sh was executed - must be an absolute path, tilde expansion will not work
    BOWTIE_PARAMS - paramaters passed to bowtie during alignment (see the bowtie manual page for explanations)
    MAX_MM        - maximum number of mismatches from the genome allowable (in PE mode, the sum of mismatches in both reads is used)
    MIN_MM_DIFF   - how many mismatches better than then the next best reported matches to be considered a unique match
    READ_LENGTH   - the read length used (the same for both read 1 and read 2)
    FASTQ1        - relative path to the gzipped read 1 fastq file
    FASTQ2        - relative path to the gzipped read 2 fastq file
    PROJECT       - name of the project, all output files will use this as their prefix


3. Alignment

  * Run Bis-seq-pipeline-PE.sh with your config file created above as its argument

  -> This script goes through multiple steps and will take hours to run (A 75bp PE experiment took ~6 hours to map to the human genome hg18 reference using 7 cores during the bowtie step). A sample output (on only 250,000 reads) would look like:

     ~/workspace/Bisulfite-seq-pipeline/scripts/Bis-seq-pipeline-PE.sh test.config     
     Thu Jun 10 20:27:04 EST 2010 - Reading config file: test.config
     Thu Jun 10 20:27:04 EST 2010 - Initialising the database
     Thu Jun 10 20:27:04 EST 2010 - Importing reads into the database
     Thu Jun 10 20:27:13 EST 2010 - Bisulfite converting reads
     Thu Jun 10 20:27:14 EST 2010 - Bowtie mapping against forward strand
     Thu Jun 10 20:28:37 EST 2010 - Bowtie mapping against reverse strand
     Thu Jun 10 20:30:04 EST 2010 - Getting the reference sequence of reads mapping positions
     Thu Jun 10 20:52:43 EST 2010 - Adjusting number of mismatches for C->T errors in mapping
     Thu Jun 10 20:53:03 EST 2010 - Combining  forward and reverse mappings and filtering out duplicate mappings
     Thu Jun 10 20:53:08 EST 2010 - Exporting database to BAM files
     [samopen] SAM header is present: 25 sequences.
     [samopen] SAM header is present: 25 sequences.
     Thu Jun 10 20:53:27 EST 2010 - Creating coverage bed and GenomeData files
     Thu Jun 10 20:53:36 EST 2010 - Determining context of C residues
     Thu Jun 10 21:16:06 EST 2010 - Determining conversion %
     Thu Jun 10 21:16:16 EST 2010 - Creating mapping log
  

4. Outputs

  * Alignments to the plus and minus strands are exported in separate BAM format files - sorted and indexed for downstream analysis or viewing using samtools tview.
  * Similarly, alignments are exported in separate bed files, and converted into BSgenome GenomeData objects for analysis in R.
  * A summary of alignment success is placed in the mapping.log file, for example:
  
    # reads processed: 250000
    # reads with at least one reported alignment: 149378
    # reads that failed to align: 100622
    # reads with alignments suppressed due to -m: 18154
    Reported 131224 alignments to 1 output stream(s)
  
  * Also, an analysis of the occurence and context of C residues is created in the .context file, for example:
  
        135 AA
         65 AC
      23987 AG
         69 AT
      30149 CA
      19427 CC
     320437 CG
      24449 CT
        135 GA
        108 GC
      18723 GG
         74 GT
        146 TA
        141 TC
      30154 TG
        124 TT
    No of C residues in the read sequences: 468323
    No of C residues in the reference sequences: 4608038
    
  * In this sample, 468323 C residues occurred in the aligned read sequences, whereas 4608038 C residues appear in the corresponding reference locations, giving a bisulfite conversion efficiency of 89.83%. However, 320437 of these C residues occur in the CpG context where DNA methylation mostly occurs in mammals, so the non-CpG conversion efficienty is actually 96.79%.


5. Downstream steps

  * If multiple lanes of the same sample have been run, the BAM files can be pooled by using samtools merge
  * A CSV of the number of A,C,G,T and N calls on each strand from the plus and minus strand BAM files can be generated by running CpGcalls.sh with the same config file as used during alignment.

  
TODO
----

  * Go back and make single ended sequencing work
  * Lots more documentation
  * A few more optimisations can be made to avoid so much disk i/o
  * You tell me!

 

本源码包内暂不包含可直接显示的源代码文件,请下载源码包。