Huge Data Processing Applying Hadoop Cluster – Part 1

Introduction
Massive data can be very difficult to analyze and Query and traditional mechanisms cannot be good tools for processing data. Cloud computing is one of the best solutions for processing huge data repositories. This is the first part of a very fast forward tutorials that published for who needs to setup and run a fully distributed model of a Hadoop cluster. The tutorial uses four Ubuntu Linux nodes for setting the cluster up. There is no big deal to use this applying other OSs such as other Linux flavors or MS-Windows. This tutorial will be continued to cover a real world huge data processing scenario. I’ve used Chuck Lam “Hadoop In Action” book widely in this tutorial.

Preparing the Play Ground
First of all I installed 4 instances of Ubuntu 11.04 Linux on my Mac Book Pro (Lion) applying Sun Virtual Box, each one has 512 MB RAM and 20 GB hard disk space. You can setup your cloud with more Linux machines if you use dedicated machines or you have more resources on the host machine. The following snapshot shows the nodes specifications:
Virtual Box and the specification of the nodes.

I’ve also used Bridge network adapter model. The others didn’t work for me.
The network adapter of the nodes.

I addressed the Linux machines with the following names and IPs:

  • Alpha (192.168.200.201) – The master node of the cluster and host of the NameNode and Job- Tracker daemons
  • Beta (192.168.200.202) – The server that hosts the Secondary NameNode daemon
  • Delta (192.168.200.203) – The slave box of the cluster running both DataNode and TaskTracker daemons
  • Gamma (192.168.200.204) – Another slave box of the cluster running both DataNode and TaskTracker daemons
  • As you can see I have 2 client nodes. You can have more as I mentioned.

    Be sure you all nodes has configured within a local area network and the nodes can ping together.

    Java Runtime
    Also be sure you already have Java run-time installed on all Linux VMs.
    Checking Java version.

    Be sure you have an exported environment variable to locate Java home:
    Be sure you have exported a Java_Home variable.

    Common User
    We will need to let one node accessing another. This access is from a user account on one node to another user account on the target machine. For Hadoop the accounts should have the same username on all of the nodes (I use amirsedighi in this tutorial).

    SSH Communication
    Hadoop uses SSH for communicating between machines. Here I used amirsedighi with the same password for all machines.
    SSH utilizes standard public key cryptography to create a pair of keys for user verification—one public, one private. The public key is stored locally on every node in the cluster, and the master node sends the private key when attempting to access a remote machine. With both pieces of information, the target machine can validate the login attempt. As all machines should be able to have a trusted communication through SSH protocol with the master node (Alpha) check whether SSH is installed on your nodes:
    Checking if SSH installed or not.

    If you get any error message just install OpenSSH (www.openssh.com). It could be better to do this with the OS default package manager ( I’ve used Synaptic ).

    Use the master node (Alpha) to generate a SSH key applying the following command:
    Generating a SSH key in the master node (Alpha) .

    Now you need to copy the generated SSH key from Alpha to other nodes (Beta, Delta, Gamma).
    Coping SSH key from master to other nodes.

    You need to have a .ssh folder in root folder of the common user in all nodes. Create it if you have not the .ssh folder. The go to the nodes and run the following command:
    Coping SSH key to .ssh folder of all nodes except master.

    Now you need to run two other commands in terminal: “Chmod 700 .ssh” and “chmod 600 .ssh/authorized_keys”.

    You should be able to connect to Beta, Delta and Gamma without giving password applying the following commands:
    Checking if SSH works correctly.

    Setting Hadoop Up
    Just download the latest version of Hadoop from Apache download page. Follow the instructions and install it wherever you want. I installed it into the root folder of the common Hadoop user (amirsedighi) . I use the following version on my cloud:
    The version of hadoop of this tutorial.

    You need to export an env variable that locate Hadoop folder.
    Hadoop home.

    Find conf folder from $HADOOP_HOME$. There are Hadoop configuration files.
    In core-site.xml and mapred-site.xml we specify the hostname and port of the NameNode and the JobTracker, respectively. In hdfs-site.xml we specify the default replication factor for HDFS, which should only be one because we’re running on only one node.
    The following snapshot shows those files you should modify for to set up you cloud:
    Hadoop configuration files.

    While you just need to modify core-site.xml, mapred-site.xml, masters and slaves files and nothing more; Moreover you can set a replication factor in hdfs-site.xml if you need to change it:
    Hadoop configuration files.

    You should do the same across all the nodes in your cluster. The easiest way is just copy the configured Hadoop folder across all machines in the same place.

    Running Hadoop
    You are almost ready to start Hadoop. But first you’ll need to format your HDFS by using the following command:
    Formating HDFS.

    You can now launch the daemons by use of the start-all.sh script:
    Starting Hadoop up.

    The Java jps command will list all daemons to verify the setup was successful:
    JPS on Aplpa
    JPS on Beta
    JPS on Delta
    JPS on Gamma

    You have a functioning cluster!

    Web UI
    Hadoop provides a Web UI. The browser interface allows you to access information you desire much faster than digging through logs and directories. The NameNode hosts a general report on port 50070. It gives you an overview of the state of your cluster’s HDFS:
    Hadoop Web UI.

    Next Steps
    What we will see in these tutorials are huge size files processing applying HDFS and Google map-reduce algorithm.

    This entry was posted in Cloud Computing, Java, Linux, Networking, Open Source, Software Engineering. Bookmark the permalink.

    One Response to Huge Data Processing Applying Hadoop Cluster – Part 1

    1. elham says:

      Thank you so much , it helped me a lot.

    Leave a Reply