Hive-Demo
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Following along with the Hive tutorial at StrataConf / HadoopWorld
  


  
  
  README
  
  



  

README for “Hadoop Data Warehousing with Hive”

Strata + Hadoop World 2012 Tutorial Exercises

Dean Wampler
academy@thinkbiganalytics.com
@thinkBigA

Welcome! Please follow these instructions to download the tutorial presentation and exercises.

About this Hive Tutorial

This Hive Tutorial is adapted from a longer Think Big Academy course on Hive. (The Academy is the education arm of Think Big Analytics.) We offer various public and private courses on Hadoop programming, Hive, Pig, etc. We also provide consulting on Big Data problems and their solutions, especially using Hadoop. If you want to learn more, visit thinkbiganalytics.com or send us email.

We’ll log into Amazon Elastic MapReduce (EMR) clusters[1] to do the exercises. Feel free to pair program with a neighbor, if you want.

NOTE: The exercises should work with any version of Hive, v0.7.1 or later.

Getting Started

Download the following zip file that contains a PDF of the tutorial presentation, the exercises, the data used for the exercises, and a Hive cheat sheet:

Unzip the tutorial.zip in a convenient place on your laptop.

If you are on Windows, you’ll need the ssh client application putty to log into the EMR servers. You can download and install it from here:

Manifest for Tutorial Zip File

Item Whazzat?
README.html What you’re reading!
ThinkBigAcademy-Hive-Tutorial.pdf The tutorial presentation.
exercises The exercises we’ll use. They are also installed on the clusters, but you’ll open them “locally” in an editor, then use copy and paste.
data The data files we’ll use. They are here only for your reference later. We’ll use the copies already on the clusters.
HiveCheatSheat.html A Hive cheat sheet.
exercises/.hiverc Drop this file in the home directory on any machines where you will normally run the hive command-line interface (CLI). Hive will run the commands it contains when it starts. This file is a great place to put commands you always run on startup, such as property settings. Already on the cluster.

Log into one of the Amazon Elastic MapReduce Clusters

We have several EMR clusters running and you’ll log into one of them according to the first one or two letters of your last name, using the following table[2]:

Letters Server Name JobFlow ID
A ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Ba - Bh ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Bi - Bz ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Ca - Ch ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Ci - Cz ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
D ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
E - F ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
G ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
H ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
I - J ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
K - L ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Ma - Mh ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Mi - Mz ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
N - P ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Q - R ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Sa - Sh ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Si - Sz ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
T - V ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Wa - Wh ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK
Wi - Z ec2-50-19-185-170.compute-1.amazonaws.com j-1R3E26P0T3IBK

(We’ll explain the JobFlow ID later.)

Once you have picked the correct server, use the following ssh command, for Linux, Mac OSX, or use the equivalent putty command to log into your server. You’ll be user hadoop:

ssh hadoop@ec2-NN-NN-NNN-NNN.compute-1.amazonaws.com

The password is:

strata

Finally, since you are sharing the primary user account on the cluster, create a personal work directory using mkdir for any file editing that you’ll do today. Pick a name for the directory without spaces, i.e., like a typical user name. You will use that same name for another purpose shortly, as we’ll see. After creating it, change to that directory with the cd command:

mkdir myusername
cd myusername

Please don’t break anything! ;^) Remember, you’re sharing this cluster.

Feel free to snoop around if you’re waiting for others. Note that all the Hadoop software is installed in the hadoop user’s $HOME directory, /home/hadoop.

Quick Cheat Sheet on Linux Shell Commands

If you’re not accustomed to the Linux or Mac OSX bash shell, here are a few hints[3]:

Print your current working directory

pwd

List the contents of a directory

Add the -l option to show a longer listing with more information. If you omit the directory, the current directory is used:

ls some-directory
ls -l some-directory

Change to a different directory

Four variants; using i) an absolute path, ii) a subdirectory of the current directory, iii) the parent directory of the current directory, and iv) your home directory:

cd /home/hadoop
cd exercises
cd ..
cd ~

Page through the contents of a file.

Hit the space bar to page, q to quit:

more some-file  

Dump the contents without paging

I.e., “concatenate” or “cat” the file:

cat some-file

For More Information

For more information on Amazon Elastic MapReduce commands, see the Quick Reference Guide and the Developer Guide.

For more details on Hive, see Programming Hive or the Hive Wiki.


  1. Visit The AWS EMR Page and the EMR Documentation page for more information about EMR.  ↩

  2. I used the following information to determine a good distribution of users across these clusters. Note that these EMR clusters will only be available during the time of the tutorial.  ↩

  3. You should learn how to use bash if you want to use Hadoop.  ↩


本源码包内暂不包含可直接显示的源代码文件,请下载源码包。