README README for “Hadoop Data Warehousing with Hive”
Strata + Hadoop World 2012 Tutorial Exercises
Dean Wampler academy@thinkbiganalytics.com
@thinkBigAWelcome! Please follow these instructions to download the tutorial presentation and exercises.
About this Hive Tutorial
This Hive Tutorial is adapted from a longer Think Big Academy course on Hive. (The Academy is the education arm of Think Big Analytics.) We offer various public and private courses on Hadoop programming, Hive, Pig, etc. We also provide consulting on Big Data problems and their solutions, especially using Hadoop. If you want to learn more, visit thinkbiganalytics.com or send us email.
We’ll log into Amazon Elastic MapReduce (EMR) clusters[1] to do the exercises. Feel free to pair program with a neighbor, if you want.
NOTE: The exercises should work with any version of Hive, v0.7.1 or later.
Getting Started
Download the following zip file that contains a PDF of the tutorial presentation, the exercises, the data used for the exercises, and a Hive cheat sheet:
Unzip the
tutorial.zip
in a convenient place on your laptop.If you are on Windows, you’ll need the
ssh
client application putty to log into the EMR servers. You can download and install it from here:Manifest for Tutorial Zip File
Item Whazzat? README.html
What you’re reading! ThinkBigAcademy-Hive-Tutorial.pdf
The tutorial presentation. exercises
The exercises we’ll use. They are also installed on the clusters, but you’ll open them “locally” in an editor, then use copy and paste. data
The data files we’ll use. They are here only for your reference later. We’ll use the copies already on the clusters. HiveCheatSheat.html
A Hive cheat sheet. exercises/.hiverc
Drop this file in the home directory on any machines where you will normally run the hive
command-line interface (CLI). Hive will run the commands it contains when it starts. This file is a great place to put commands you always run on startup, such as property settings. Already on the cluster.Log into one of the Amazon Elastic MapReduce Clusters
We have several EMR clusters running and you’ll log into one of them according to the first one or two letters of your last name, using the following table[2]:
Letters Server Name JobFlow ID A
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Ba - Bh
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Bi - Bz
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Ca - Ch
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Ci - Cz
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
D
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
E - F
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
G
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
H
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
I - J
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
K - L
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Ma - Mh
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Mi - Mz
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
N - P
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Q - R
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Sa - Sh
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Si - Sz
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
T - V
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Wa - Wh
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
Wi - Z
ec2-50-19-185-170.compute-1.amazonaws.com
j-1R3E26P0T3IBK
(We’ll explain the JobFlow ID later.)
Once you have picked the correct server, use the following
ssh
command, for Linux, Mac OSX, or use the equivalentputty
command to log into your server. You’ll be userhadoop
:ssh hadoop@ec2-NN-NN-NNN-NNN.compute-1.amazonaws.com
The password is:
strata
Finally, since you are sharing the primary user account on the cluster, create a personal work directory using
mkdir
for any file editing that you’ll do today. Pick a name for the directory without spaces, i.e., like a typical user name. You will use that same name for another purpose shortly, as we’ll see. After creating it, change to that directory with thecd
command:mkdir myusername cd myusername
Please don’t break anything! ;^) Remember, you’re sharing this cluster.
Feel free to snoop around if you’re waiting for others. Note that all the Hadoop software is installed in the
hadoop
user’s$HOME
directory,/home/hadoop
.Quick Cheat Sheet on Linux Shell Commands
If you’re not accustomed to the Linux or Mac OSX
bash
shell, here are a few hints[3]:Print your current working directory
pwd
List the contents of a directory
Add the
-l
option to show a longer listing with more information. If you omit the directory, the current directory is used:ls some-directory ls -l some-directory
Change to a different directory
Four variants; using i) an absolute path, ii) a subdirectory of the current directory, iii) the parent directory of the current directory, and iv) your home directory:
cd /home/hadoop cd exercises cd .. cd ~
Page through the contents of a file.
Hit the space bar to page,
q
to quit:more some-file
Dump the contents without paging
I.e., “concatenate” or “cat” the file:
cat some-file
For More Information
For more information on Amazon Elastic MapReduce commands, see the Quick Reference Guide and the Developer Guide.
For more details on Hive, see Programming Hive or the Hive Wiki.
Visit The AWS EMR Page and the EMR Documentation page for more information about EMR. ↩
I used the following information to determine a good distribution of users across these clusters. Note that these EMR clusters will only be available during the time of the tutorial. ↩
You should learn how to use
bash
if you want to use Hadoop. ↩