Massive data can be very difficult to analyze and Query and traditional mechanisms cannot be good tools for processing data. Cloud computing is one of the best solutions for processing huge data repositories. This is the first part of a very fast forward tutorials that published for who needs to setup and run a fully distributed model of a Hadoop cluster. The tutorial uses four Ubuntu Linux nodes for setting the cluster up. There is no big deal to use this applying other OSs such as other Linux flavors or MS-Windows. This tutorial will be continued to cover a real world huge data processing scenario. I’ve used Chuck Lam “Hadoop In Action” book widely in this tutorial.
Preparing the Play Ground
First of all I installed 4 instances of Ubuntu 11.04 Linux on my Mac Book Pro (Lion) applying Sun Virtual Box, each one has 512 MB RAM and 20 GB hard disk space. You can setup your cloud with more Linux machines if you use dedicated machines or you have more resources on the host machine. The following snapshot shows the nodes specifications:
I’ve also used Bridge network adapter model. The others didn’t work for me.
I addressed the Linux machines with the following names and IPs:
As you can see I have 2 client nodes. You can have more as I mentioned.
Be sure you all nodes has configured within a local area network and the nodes can ping together.
Also be sure you already have Java run-time installed on all Linux VMs.
Be sure you have an exported environment variable to locate Java home:
We will need to let one node accessing another. This access is from a user account on one node to another user account on the target machine. For Hadoop the accounts should have the same username on all of the nodes (I use amirsedighi in this tutorial).
Hadoop uses SSH for communicating between machines. Here I used amirsedighi with the same password for all machines.
SSH utilizes standard public key cryptography to create a pair of keys for user verification—one public, one private. The public key is stored locally on every node in the cluster, and the master node sends the private key when attempting to access a remote machine. With both pieces of information, the target machine can validate the login attempt. As all machines should be able to have a trusted communication through SSH protocol with the master node (Alpha) check whether SSH is installed on your nodes:
If you get any error message just install OpenSSH (www.openssh.com). It could be better to do this with the OS default package manager ( I’ve used Synaptic ).
Use the master node (Alpha) to generate a SSH key applying the following command:
Now you need to copy the generated SSH key from Alpha to other nodes (Beta, Delta, Gamma).
You need to have a .ssh folder in root folder of the common user in all nodes. Create it if you have not the .ssh folder. The go to the nodes and run the following command:
Now you need to run two other commands in terminal: “Chmod 700 .ssh” and “chmod 600 .ssh/authorized_keys”.
You should be able to connect to Beta, Delta and Gamma without giving password applying the following commands:
Setting Hadoop Up
Just download the latest version of Hadoop from Apache download page. Follow the instructions and install it wherever you want. I installed it into the root folder of the common Hadoop user (amirsedighi) . I use the following version on my cloud:
You need to export an env variable that locate Hadoop folder.
Find conf folder from $HADOOP_HOME$. There are Hadoop configuration files.
In core-site.xml and mapred-site.xml we specify the hostname and port of the NameNode and the JobTracker, respectively. In hdfs-site.xml we specify the default replication factor for HDFS, which should only be one because we’re running on only one node.
The following snapshot shows those files you should modify for to set up you cloud:
While you just need to modify core-site.xml, mapred-site.xml, masters and slaves files and nothing more; Moreover you can set a replication factor in hdfs-site.xml if you need to change it:
You should do the same across all the nodes in your cluster. The easiest way is just copy the configured Hadoop folder across all machines in the same place.
You are almost ready to start Hadoop. But first you’ll need to format your HDFS by using the following command:
You can now launch the daemons by use of the start-all.sh script:
The Java jps command will list all daemons to verify the setup was successful:
You have a functioning cluster!
Hadoop provides a Web UI. The browser interface allows you to access information you desire much faster than digging through logs and directories. The NameNode hosts a general report on port 50070. It gives you an overview of the state of your cluster’s HDFS: