Monday, October 28, 2013

Hadoop Hands-On exercise with Hortonworks: Setting up Hadoop sandbox and loading the data

Requirements:
1. You have a software like VitualBox, VMware, Hyper -V installed which can deal with the virtual environment.
2. You have downloaded appropriate sandbox image from here.
3. You have minimum 4 GB RAM
4. Optional: To get familiar with few basic concepts referred in this post (like HDFS, node) you might like to read previous post here

Lets get started!

Load the downloaded file, which is "Hortonworks+Sandbox+1.3+VirtualBox+RC6.ova" in my case  and start the virtual machine.


After successful booting you should get something similar to following,

 
Now you can go to the browser (prefer Chrome, Safari or Internet Explorer ) and enter the url "http://127.0.0.1:8888/". You will get a registration form where you can enter details and click submit.

The next webpage is something super-cool! A web page containing many leading big data project in single web page.  It is called Hue, graphical interface for Hadoop and related projects. It should look something like following,


You can see the split display of the web page. On the left side we have collapsible pane, you can go ahead and hide it. So all we should be able to see now it right pane which is Hue.

Now you can download test data (11 mb) from here.

Once you have downloaded the file, go to file browser (5th icon from left).

You should be able to see a file browser like following. Go to upload > files and upload the file NYSE-2000-2001.tar.gz. Note that you don't need to unzip/untar the file.

Once you upload the data you should be able to see following, which is similar to "ls -l" option on linux. It will give list of all files in HDFS along with size, user, group, permission and modification timestamp

You can click on the file "NYSE-2000-2001.tar.gz" and see detailed data in tabular format like following. At the top you will be able to see the path where this file is stored. On the left hand few more operations like view as binary, download will be displayed to you.


Now we have loaded the data in to HDFS successfully. Next step is loading the data in to HCatalog so that it is accessible for querying by Hive or Pig. The HCatalog is something which will manage metadata like schema so that it can be shared with other programs. 

Lets have a possible analogy example to understand exactly what is going on here and why we are doing it. 
  • Say, you are trying to load dataset from your local machine on server (sandbox here) 
  • You will upload and store the data on server's file system (HDFS in our case) 
  • You will load the data in to database (here its somewhat similar to HCataog which maintains schema like information) 
  • You will use SQL to query the data (here we will use Pig, Hive).

Click on HCatalog.



Click on the "Create a new table from a file". You will get few input fields. Enter table name and choose the same file (NYSE-2000-2001.tar.gz). In my case the "choose a file" option was not visible on Firefox but it worked fine on Chrome.


You can change the data type from double to float when the file is loaded and then click "create table".

Tutorial Summary: 

At this point we have setup a self-contained virtual machine with single node. An easy-to-use graphical user interface (HUE) can be used as soon as virtual machine is up. Using file browser functionality we uploaded a data set into HDFS. To make data set accessible for querying by Hive or Pig we loaded data into HCatalog.  In next tutorial we will try to query the data.

Note: This post is my implementation of Hortonworks tutorial for learning purpose.

1 comment: