资源说明:Hadoopizer is a generic tool for the parallelisation of bioinformatics analysis in the cloud using the MapReduce paradigm.
Hadoopizer Help =============== Overview -------- Hadoopizer is a generic tool for the parallelisation of bioinformatics analysis in the cloud using the MapReduce paradigm. The source code is publicly available at http://github.com/genouest/hadoopizer Installation ------------ Download the latest version of Hadoopizer from the official website (http://github.com/genouest/hadoopizer/downloads). Copy the hadoopizer.jar file somewhere on the master node of hadoop cluster. Hadoopizer has the same dependencies as Hadoop, so it is usable directly on a machine where hadoop is correctly installed. Creating a job -------------- ### Writing an xml config file The first thing to do when you want to create a hadoopizer job is to create a xml file describing the command you would like to run. Take a look at the following example:In this example, we want to launch a mapping command. There are 2 input files: 'query' and 'db'. The 'query' file will be splitted and is in the fastq format. The 'db' input will be automatically copied on each compute node. The result of the command line is in the sam format and will be placed in the /local/foo/bar/output/ directory. ### Launching the job Once you have written your xml file, launching the job is simple. Just login to your hadoop cluster master node and issue the following command: hadoop jar hadoopizer.jar -c your_config_file.xml -w hdfs://your_hdfs_master_node/a_temp_folder The last option specifies a directory on your HDFS filesystem where hadoopizer will write some temporary data. It must be a non existing directory, and it is safe to delete it once the job is finished. Advanced usage -------------- ### Supported protocols for data transfers Hadoopizer is able to read/write data from/to several kind of filesystems. The supported protocols are so far: local HDFS S3, http or ftp could be implemented soon. In the configuration file, just write the urls according to the data location: mapper -query ${query} -db ${db} -out ${res} /local/foo/bar/myfile.fastq /local/foo/bar/mydb.fasta /local/foo/bar/output/ /local/foo/bar/mydb.fasta hdfs://your_hdfs_master_node/foo/bar/mydb.fasta This works for data output too. ### Sequence files If you want to reuse some output data as the input of another job, you can improve performances by writing the output in a Hadoop-specific binary format that offers better i/o performances. To do this, add the sequence option in your xml file:In this example, only the sam file will be written in this binary format. For better performances, you can also write this file directly to HDFS: /local/foo/bar/output/ Then, to use this data in another job, add an input element that looks like this: hdfs://your_hdfs_master_node/foo/bar/output/ /local/foo/bar/myfile.fastq ### Multiple output data It is possible to specify several output files for your command line. To do this, simply write one output element for each output file:Both output files will be placed in the output directory (/local/foo/bar/output/). ### Multiple input data In some cases, it is needed to split input data coming from different files. One frequent use case in bioinformatics is when you want to analyse paired end sequences. In this case, you start from 2 files that need to be read synchronously: each line from the file 1 needs data from corresponding line in file 2. This is possible in Hadoopizer by specifying multiple url in the input element: mapper -query ${q} -db ${db} -out ${res}; wc -l ${res} > ${count} [...]/local/foo/bar/output/ /local/foo/bar/myfile1.fastq /local/foo/bar/myfile2.fastq This will add a supplementary map-reduce job before the execution of the command specified in the config file. During this step, all data will be joined and placed in a temporary file. If you plan to use the same groups of input files for several jobs, see the 'Reusing multiple input data' section below. When using multiple input file, you have to write where each file will be used in the command line:mapper -q1 ${query#1} -q ${query#2} -db ${db} -out ${res} ### Reusing multiple input data If you want to reuse a same group of input files for several job, you can refer to a temp file containing already joined data. This file is created in (assuming you are using the command line provided above to launch Hadoopizer): hdfs://your_hdfs_master_node/a_temp_folder/temp_joined_data/part-r-00000 You can reuse this data like this:hdfs://your_hdfs_master_node/a_temp_folder/temp_joined_data/part-r-00000 hdfs://your_hdfs_master_node/a_temp_folder/temp_joined_data/part-r-00000 Yes, you have to write the same 2 url in order to Hadoopizer to know that the file contains data from 2 original files. ### Compression By default, data is compressed for all the transfers during the map-reduce steps. It is also possible to automatically compress the output data by adding the compressor setting:Different compressor are available: gzip and bzip2. You can also use compressed data as input of your job. In this case, Hadoopizer will automatically detect it and decompress on-the-fly. Be warned that depending on the compression format of input files, Hadoopizer may not be able to perform the spliting of your data. In this case, all the data will be sent to a single compute node. ### Hadoop options It is possible to add some Hadoop options directly within the config file. See the example below. gzip hdfs://your_hdfs_master_node/foo/bar/output/ [...] This way you can easily adapt your Hadoop cluster settings (size of data chunks, number of reduce tasks, ...) to the kind of analysis you are performing. ### Input path autocomplete mode Sometimes you may need to write in a command line a path referring to multiple files with the same prefix, but different extensions. This situation happens for example with the database parameter in blast command lines: You can write the following option: -db /local/foo/bar/mydb In this example, /local/foo/bar/mydb refers to several files in /local/foo/bar/: mydb.pal, mydb.ppi, mydb.pin, ... Using the 'autocomplete' attribute, we tell hadoopizer to consider all the files begining with the 'mydb' prefix:-Xmx1024m ### Deploying software If you want to use a software that is not available on the compute nodes, you can automatically deploy it while launching your Hadoopizer job. To do so, first prepare an archive containing the binaries you would like to deploy. You can organize the content as you want. Then, when launching your Hadoopizer job, add the following option: -b /path/to/your/binary/archive.tar.gz The archive will then be extracted in a directory named 'binaries' in the work directory of each node. To use it, simply adapt your xml file as follow: blast -query ${q} -db ${db} -out ${res} [...]/local/foo/bar/mydb [...]binaries/your_binary -some ${options} [...]
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。