getting_started.txt
上传用户:quxuerui
上传日期:2018-01-08
资源大小:41811k
文件大小:9k
- Getting Started With Hadoop On Demand (HOD)
- ===========================================
- 1. Pre-requisites:
- ==================
- Hardware:
- HOD requires a minimum of 3 nodes configured through a resource manager.
- Software:
- The following components are assumed to be installed before using HOD:
- * Torque:
- (http://www.clusterresources.com/pages/products/torque-resource-manager.php)
- Currently HOD supports Torque out of the box. We assume that you are
- familiar with configuring Torque. You can get information about this
- from the following link:
- http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki
- * Python (http://www.python.org/)
- We require version 2.5.1 of Python.
-
- The following components can be optionally installed for getting better
- functionality from HOD:
- * Twisted Python: This can be used for improving the scalability of HOD
- (http://twistedmatrix.com/trac/)
- * Hadoop: HOD can automatically distribute Hadoop to all nodes in the
- cluster. However, it can also use a pre-installed version of Hadoop,
- if it is available on all nodes in the cluster.
- (http://hadoop.apache.org/core)
- HOD currently supports Hadoop 0.15 and above.
- NOTE: HOD configuration requires the location of installs of these
- components to be the same on all nodes in the cluster. It will also
- make the configuration simpler to have the same location on the submit
- nodes.
- 2. Resource Manager Configuration Pre-requisites:
- =================================================
- For using HOD with Torque:
- * Install Torque components: pbs_server on a head node, pbs_moms on all
- compute nodes, and PBS client tools on all compute nodes and submit
- nodes.
- * Create a queue for submitting jobs on the pbs_server.
- * Specify a name for all nodes in the cluster, by setting a 'node
- property' to all the nodes.
- This can be done by using the 'qmgr' command. For example:
- qmgr -c "set node node properties=cluster-name"
- * Ensure that jobs can be submitted to the nodes. This can be done by
- using the 'qsub' command. For example:
- echo "sleep 30" | qsub -l nodes=3
- * More information about setting up Torque can be found by referring
- to the documentation under:
- http://www.clusterresources.com/pages/products/torque-resource-manager.php
- 3. Setting up HOD:
- ==================
- * HOD is available under the 'contrib' section of Hadoop under the root
- directory 'hod'.
- * Distribute the files under this directory to all the nodes in the
- cluster. Note that the location where the files are copied should be
- the same on all the nodes.
- * On the node from where you want to run hod, edit the file hodrc
- which can be found in the <install dir>/conf directory. This file
- contains the minimal set of values required for running hod.
- * Specify values suitable to your environment for the following
- variables defined in the configuration file. Note that some of these
- variables are defined at more than one place in the file.
- * ${JAVA_HOME}: Location of Java for Hadoop. Hadoop supports Sun JDK
- 1.5.x
- * ${CLUSTER_NAME}: Name of the cluster which is specified in the
- 'node property' as mentioned in resource manager configuration.
- * ${HADOOP_HOME}: Location of Hadoop installation on the compute and
- submit nodes.
- * ${RM_QUEUE}: Queue configured for submiting jobs in the resource
- manager configuration.
- * ${RM_HOME}: Location of the resource manager installation on the
- compute and submit nodes.
- * The following environment variables *may* need to be set depending on
- your environment. These variables must be defined where you run the
- HOD client, and also be specified in the HOD configuration file as the
- value of the key resource_manager.env-vars. Multiple variables can be
- specified as a comma separated list of key=value pairs.
- * HOD_PYTHON_HOME: If you install python to a non-default location
- of the compute nodes, or submit nodes, then, this variable must be
- defined to point to the python executable in the non-standard
- location.
- NOTE:
- You can also review other configuration options in the file and
- modify them to suit your needs. Refer to the file config.txt for
- information about the HOD configuration.
- 4. Running HOD:
- ===============
- 4.1 Overview:
- -------------
- A typical session of HOD will involve atleast three steps: allocate,
- run hadoop jobs, deallocate.
- 4.1.1 Operation allocate
- ------------------------
- The allocate operation is used to allocate a set of nodes and install and
- provision Hadoop on them. It has the following syntax:
- hod -c config_file -t hadoop_tarball_location -o "allocate
- cluster_dir number_of_nodes"
- The hadoop_tarball_location must be a location on a shared file system
- accesible from all nodes in the cluster. Note, the cluster_dir must exist
- before running the command. If the command completes successfully then
- cluster_dir/hadoop-site.xml will be generated and will contain information
- about the allocated cluster's JobTracker and NameNode.
- For example, the following command uses a hodrc file in ~/hod-config/hodrc and
- allocates Hadoop (provided by the tarball ~/share/hadoop.tar.gz) on 10 nodes,
- storing the generated Hadoop configuration in a directory named
- ~/hadoop-cluster:
- $ hod -c ~/hod-config/hodrc -t ~/share/hadoop.tar.gz -o "allocate
- ~/hadoop-cluster 10"
- HOD also supports an environment variable called HOD_CONF_DIR. If this is
- defined, HOD will look for a default hodrc file at $HOD_CONF_DIR/hodrc.
- Defining this allows the above command to also be run as follows:
- $ export HOD_CONF_DIR=~/hod-config
- $ hod -t ~/share/hadoop.tar.gz -o "allocate ~/hadoop-cluster 10"
- 4.1.2 Running Hadoop jobs using the allocated cluster
- -----------------------------------------------------
- Now, one can run Hadoop jobs using the allocated cluster in the usual manner:
- hadoop --config cluster_dir hadoop_command hadoop_command_args
- Continuing our example, the following command will run a wordcount example on
- the allocated cluster:
- $ hadoop --config ~/hadoop-cluster jar
- /path/to/hadoop/hadoop-examples.jar wordcount /path/to/input /path/to/output
- 4.1.3 Operation deallocate
- --------------------------
- The deallocate operation is used to release an allocated cluster. When
- finished with a cluster, deallocate must be run so that the nodes become free
- for others to use. The deallocate operation has the following syntax:
- hod -o "deallocate cluster_dir"
- Continuing our example, the following command will deallocate the cluster:
- $ hod -o "deallocate ~/hadoop-cluster"
- 4.2 Command Line Options
- ------------------------
- This section covers the major command line options available via the hod
- command:
- --help
- Prints out the help message to see the basic options.
- --verbose-help
- All configuration options provided in the hodrc file can be passed on the
- command line, using the syntax --section_name.option_name[=value]. When
- provided this way, the value provided on command line overrides the option
- provided in hodrc. The verbose-help command lists all the available options in
- the hodrc file. This is also a nice way to see the meaning of the
- configuration options.
- -c config_file
- Provides the configuration file to use. Can be used with all other options of
- HOD. Alternatively, the HOD_CONF_DIR environment variable can be defined to
- specify a directory that contains a file named hodrc, alleviating the need to
- specify the configuration file in each HOD command.
- -b 1|2|3|4
- Enables the given debug level. Can be used with all other options of HOD. 4 is
- most verbose.
- -o "help"
- Lists the operations available in the operation mode.
- -o "allocate cluster_dir number_of_nodes"
- Allocates a cluster on the given number of cluster nodes, and store the
- allocation information in cluster_dir for use with subsequent hadoop commands.
- Note that the cluster_dir must exist before running the command.
- -o "list"
- Lists the clusters allocated by this user. Information provided includes the
- Torque job id corresponding to the cluster, the cluster directory where the
- allocation information is stored, and whether the Map/Reduce daemon is still
- active or not.
- -o "info cluster_dir"
- Lists information about the cluster whose allocation information is stored in
- the specified cluster directory.
- -o "deallocate cluster_dir"
- Deallocates the cluster whose allocation information is stored in the
- specified cluster directory.
- -t hadoop_tarball
- Provisions Hadoop from the given tar.gz file. This option is only applicable
- to the allocate operation. For better distribution performance it is
- recommended that the Hadoop tarball contain only the libraries and binaries,
- and not the source or documentation.
- -Mkey1=value1 -Mkey2=value2
- Provides configuration parameters for the provisioned Map/Reduce daemons
- (JobTracker and TaskTrackers). A hadoop-site.xml is generated with these
- values on the cluster nodes
- -Hkey1=value1 -Hkey2=value2
- Provides configuration parameters for the provisioned HDFS daemons (NameNode
- and DataNodes). A hadoop-site.xml is generated with these values on the
- cluster nodes
- -Ckey1=value1 -Ckey2=value2
- Provides configuration parameters for the client from where jobs can be
- submitted. A hadoop-site.xml is generated with these values on the submit
- node.