It was always strange to me how Google searches quickly through millions of pages just in a glance. I just heard about MapReduce algorithm and its Could Computing server farms. But it doesn’t make sense to me because Google search was even faster than local searches which I ran over fastetest RDBMSs. Essentially RDBMS databases aren’t able to probe gigantic result sets as much fast as Google searches.
After a while during chatting with an old friend, I’ve become familiar with Apache Hadoop. “Ara Abrahamian” introduced a fantastic book, “Hadoop In Action” which he was one of its reviewers. I cant stop reading it. I wanna to refer to a part of the book which explains why it has been called Hadoop:
When naming software projects, Doug Cutting seems to have been inspired by his family. Lucene is his wife’s middle name, and her maternal grandmother’s first name. His son, as a toddler, used Nutch as the all-purpose word for meal and later named a yellow stuffed elephant Hadoop. Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
I think Hadoop is the most practical enterprise framework which I’ve ever seen in Java stack. Hadoop works for enterprise level of business. It is very easy to setup Hadoop in your local machine or distributed servers. I just wondered how it produces such a big performance just by running simple terminal commands.
Today we all have surrounded by raw, unformatted informations. Hadoop is the right tool for organizing big amount of information to extract clear facts through millions lines of documents. So I think it will capture a big share in market in the software industry and will be as a complementary of every enterprise information system.
I was going to use Hadoop practically, so I asked a friend in DatisPars to define a project which gathers log files of different Apache web servers from a number of Linux/Windows servers of an ISP and merges them all into a huge map-reduced file which stores into the cloud of Hadoop cluster. At first I just started with standalone and pseudo distributed styles which are more productive for development phase. It was awesome but I didn’t satisfy. So I started with a number of Linux servers which was installed in a Virtualbox on my Mac. It seems Ubuntu gives more performance when hosted in Mac OS X. Working with Hadoop is some kind of computing which we have seen in the black and white movies of 1950. A huge amount of process which specialized for transforming and processing digits.
Our goal is to have an estimation of what happen totally in different application servers by analyzing their huge log files applying MapReduce and centralizing the result on a HDFS for next processes such as counting specific exceptions, error codes and warnings.