资源说明:A new implementation of PANGEA pipeline for metagenomics with multiple classification methods and consensus analysis
#PANGEA+ A new implementation of PANGEA pipeline for faster and more accurated metagenomics with multiple classification methods and consensus analysis. #Download (LINUX): wget https://github.com/Bioinfo-Tools/PANGEA-plus/tarball/master -O BioinfoTools_PANGEA-plus.tar.gz (MAC): curl https://github.com/Bioinfo-Tools/PANGEA-plus/tarball/master -o BioinfoTools_PANGEA-plus.tar.gz #Extract the files: tar –xvf BioinfoTools_PANGEA-plus.tar.gz Your work dir should be set as the PANGEA-plus directory. cd BioinfoTools_PANGEA-plus export PANGEAWD=$PWD #Install parallel BLAST (for High Performance Computing clusters) cd $PANGEAWD/Classify/Runblast sh install_MPI_blast.sh #Trimming your input sequences cd $PANGEAWD/Trim perl trim2.3.pl -a ../input_A.txt -b ../input_B.txt -g 100 where: perl trim2.3.pl ... -a raw illumina input file read 1 -b raw illumina input file read 2 (if any) -g size of GAP between paired-ends (if any) -t truncate size (if any) -q quality file (in case of FASTA input) -qc quality cutoff value -lc minimum length Supported formats: FASTA, FASTQ and QSEQ. Results will be saved in $PANGEAWD/output/trim2 folder #Download / Install Blast cd $PANGEAWD/Classify/Runblast sh install_blast.sh #Download NCBI database for classification cd $PANGEAWD wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz gunzip nt.gz #Format the database $PANGEAWD/Classify/Runblast/makeblastdb -in $PANGEAWD/nt -out $PANGEAWD/nt -dbtype nucl #Classify your sequences using parallel BLAST search cd $PANGEAWD/Classify/Runblast Example of parallel BLAST (MPI-blastn) executed in a PBS/Torque/Maui HPC cluster: Use an example submission script available in $PANGEAWD/Scripts directory *EDIT THE FILE submit_MPI-blast.job FIRST! qsub $PANGEAWD/Scripts/submit_MPI-blast.job where: input.fasta refers to your sequences after trimming. For running parallel blast for multiple input files at the same time: *EDIT THE FILE submit_multiple_MPI-blast.job FIRST! Follow the instructions in the file. *Replace ./dir/ by your input directory and change the values of these parameters before running: "database="; "total_processes="; "nodes=" for i in ./dir/*.fasta; do qsub submit_multiple_MPI-blast.job -v in=`echo $i`,out=`echo $i.txt`,database=database_name,nodes=4, total_processes=16; done where: ./dir/ is your input sequences directory nodes= is the number of requested nodes total_processes= is the total number of processes requested database= is the name of database The output files will have the same name of your inputs, but with .txt suffix. Example using your own blastn installation: export PATH=$PANGEAWD/Classify/Runblast:$PATH blastn -query input.fasta -db database.formated -outfmt 6 -out blast_output.txt #Parse the taxonomic classification based on the NCBI taxonomy databases Running NCBI-taxcollector: cd $PANGEAWD/Tax_class make all wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.tar.gz tar -xvf taxdump.tar.gz tar -xvf gi_taxid_nucl.dmp.tar.gz ./tax_class -c perl NCBI-taxcollector-0.01.pl –f $PANGEAWD/parallel_output.txt -o $PANGEAWD/parallel_output_class.txt > report.txt where: parallel_output.txt is the mpiblastn results parallel_output_class.txt are the parsed and classified output generated by this program. #Classify your sequences using RDP Classifier search cd $PANGEAWD/Classify/RunRDP/ sh install_RDPClassifier.sh java -Xmx1g -jar rdp_classifier-2.5.jar -q $PANGEAWD/input_trimmed.txt -o output_rdp.txt Where: -q refers to the query file. -o refers to the output file. #Classify your sequences using parallel SOAP Aligner search Format your database: cd $PANGEAWD/Classify/Runsoap/soap2.21release ./2bwt-builder $PANGEAWD/database.fasta Run sequence search: ./soap -a $PANGEAWD/input_trimmed.fasta -D $PANGEAWD/database.fasta.index -o $PANGEAWD/output_soap.txt -p 8 -M 4 Where: -D Prefix name for reference index [*.index]. -a Query file, for SE reads alignment or one end of PE reads. -b Query b file, one end of PE reads. -o Ouput file -p Number of threads -M INT Match mode for each read or the seed part of read, which shouldn't contain more than 2 mismaches, [4] 0: exact match only 1: 1 mismatch match only 2: 2 mismatch match only 3: [gap] (coming soon) 4: find the best hits #Run Consensus Analysis cd $PANGEAWD/Consensus perl Consensus_BLAST_SOAP_RDP-1.1.pl -b output_blast_class.txt -r output_rdp.txt -o output_consensus.txt Where: -b Classification results (Blast) parsed by NCBI-taxcollector -r Classification results (RDP) -s Classification results (SOAP2) -o Output file (txt) The output shall look like this: S000008953 [0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6; 92.61 1435 81 23 29 1452 1 1421 0.0 2039 #Matches found: 4 S000010870 [0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6; 91.78 1435 90 26 49 1469 1 1421 0.0 1971 #Matches found: 4 S000014058 [0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6; 92.20 1435 88 22 29 1453 1 1421 0.0 2008 #Matches found: 4 S000016099 [0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6; 91.66 1438 86 29 49 1469 1 1421 0.0 1960 #Matches found: 4 #Cluster your results by identity: Example for 80% identity*: perl $PANGEAWD/Megaclust/megaclust2.pl -i $PANGEAWD/output_consensus.txt -o $PANGEAWD/output_consensus.megaclust_80_hits.txt -b 100 -s 80 -e 1e-20 *More examples and automatic scripts at $PANGEAWD/Scripts #Generate summary table for classified results: Example for Domain level (80%) similarity*: perl $PANGEAWD/Megaclustable/megaclustable.pl -m $PANGEAWD/output_consensus.megaclust_80_hits.txt -t 0 -o $PANGEAWD /results/megaclustable/DomainTable.txt *Note: in the –m option you shall list all the ouput files generated by the megaclust execution for every sample. More examples and automatic scripts at $PANGEAWD/Scripts. The classification output should be like this: 1 2 3 4 5 6 7 8 9 10 Bacteria 479 4 32 7507 11977 13245 2129 11222 539 2411 Eukaryota 1 4 5 5 2 17 78 3 10 3 Archaea 1 0 0 0 0 0 0 0 0 1 #References
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。