How To Customize “Hibernate Order By”

There was an issue with ordering date fields that has been stored as string type in database-side. I needed to fix the issue applying minimum changes.
So I tried to apply a type casting over string fields that contain date contents, but it wasn’t work for me.
Finely I made an override to the Hibernate Order By criteria and It works like a charm.

The following class extends Hibernate Order class overrides the behavior I needed to customize…

package com.hyperbase.core.util;

/**
 * Created by IntelliJ IDEA.
 * User: amirsedighi
 * Date: May 15, 2012
 * Time: 6:59:38 PM
 * To change this template use File | Settings | File Templates.
 */

public class CustomizedOrderBy extends Order {
    private String sqlExpression;

    protected CustomizedOrderBy(String sqlExpression) {
        super(sqlExpression, true);
        this.sqlExpression = sqlExpression;
    }

    public String toSqlString(Criteria criteria, CriteriaQuery criteriaQuery) throws HibernateException {
        return sqlExpression;
    }

    public static Order sqlFormula(String sqlFormula) {
        return new CustomizedOrderBy(sqlFormula);
    }

    public String toString() {
        return sqlExpression;
    }

}

The way I used it also is very simple:

            Criteria criteria = hbSession.getPublishModeSession().createCriteria("User_" + table.getGeneratedID());
            if (sortBy != null){
                if (sortDir.equalsIgnoreCase("asc")){
                    if (sortBy.getType().equalsIgnoreCase("date")){
                        criteria.addOrder(CustomizedOrderBy.sqlFormula("convert( datetime, "+sortBy.getInnerName()+"  , 101) "+ sortDir));
                    } else {
                        criteria.addOrder(Order.asc(sortBy.getInnerName()));
                    }

                } else if (sortDir.equalsIgnoreCase("desc")) {
                    if (sortBy.getType().equalsIgnoreCase("date")){
                        criteria.addOrder(CustomizedOrderBy.sqlFormula("convert( datetime, "+sortBy.getInnerName()+"  , 101) "+ sortDir));
                    } else {
                        criteria.addOrder(Order.desc(sortBy.getInnerName()));
                    }

                }
            }

I like it.

Posted in Java | Leave a comment

Huge Data Processing Applying Hadoop Cluster – Part 4

Finely we are getting close to the project target which was finding spam tweets in twitter logs and find out if the spams occur more in the advertisements or not.

In the previous posts from this series we learned how to setup and deploy a hadoop cluster. Then we developed a GetLogs command for hadoop, for importing files from nodes and creating a merged one in HDFS. We also learned how to count occurence of each tweet in the log file. In current post we will learn how to count spam words such as ‘F’ words. We will need also to count all words of the documents for an overall comparison through different situations.

Preparing
First of all we need to have a prepared HDFS. Run the following command to import all tweets placed in the ‘input’ folder into the HDFS. You don’t need this if you made it in the previous posts.

bin/hadoop fs -mkdir /tweets
bin/hadoop jar GetLogs/getlogs.jar com.hexican.hadoop.GetLogs
                             input /tweets/giantTweet.txt

WordCount
WordCount should be the simplest hadoop command. It is also the base of all MapReduce classes we have in this series.

package com.hexican.hadoop;

import java.io.IOException;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

    public static class TokenizerMapper extends Mapper&ltObject, Text, Text, IntWritable&gt {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            Set<String> nullWords = new HashSet<String>();
            nullWords.add("null");
            // add whatever you want to be rejected

            StringTokenizer itr = new StringTokenizer(value.toString()); // Tokenizing 

            while (itr.hasMoreTokens()) {
                String nextToken = itr.nextToken(); // Cast token into a String variable.
                if (nullWords.contains(nextToken.toLowerCase())){ // ignore null words.
                    continue;
                }
                word.set(nextToken); // Moving word word into a Text object
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result); // Output count of each token
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <inFile> <outFile>");
            System.exit(2);
        }
        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Just the same as GetLogs and TweetCount make it ready by creating WordCount, WordCount/src and WordCount/classes folders. Then run the following commands to build the jar file:

javac -classpath hadoop-core-0.20.203.0.jar:lib/commons-cli-1.2.jar -d WordCount/classes/ WordCount/src/WordCount.java
jar -cvf WordCount/wordcount.jar -C WordCount/classes/ .

Note: I’ve assumed you run eveything in the hadoop folder.
Now you should have wordcount.jar in the WordCount folder.

We need another class for counting spam words. I called it CountSpam. This is just antoher simple MapReduce code.

package com.hexican.hadoop;

import java.io.IOException;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class SpamCount {

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable&gt {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString()); /* Tokenizing */
            Set spams = new HashSet<String>();

            //            Spam words,	    Just add whatever you want.
            spams.add("-online");
            spams.add("4u");
            spams.add("adipex");
            spams.add("advicer");
            spams.add("ass");
            spams.add("baccarrat");
            spams.add("blackjack");
            spams.add("bllogspot");
            spams.add("booker");
            spams.add("byob");
            spams.add("car-rental-e-site");
            spams.add("car-rentals-e-site");
            spams.add("carisoprodol");
            spams.add("casino");
            spams.add("casinos");
            spams.add("celebrity");
 	    spams.add("chatroom");
            spams.add("cialis");
            spams.add("coolcoolhu");
            spams.add("coolhu");
            spams.add("credit-card-debt");
            spams.add("credit-report-4u");
            spams.add("cute");
            spams.add("cutes");
	    spams.add("cwas");
            spams.add("cyclen");
            spams.add("cyclobenzaprine");
            spams.add("dating-e-site");
            spams.add("dating");
            spams.add("date");
            spams.add("day-trading");
            spams.add("debt-consolidation");
            spams.add("debt-consolidation-consultant");
            spams.add("discreetordering");
            spams.add("duty-free");
            spams.add("dutyfree");
            spams.add("equityloans");
            spams.add("fioricet");
            spams.add("flowers-leading-site");
            spams.add("freenet-shopping");
            spams.add("freenet");
            spams.add("fuc");
            spams.add("fuck");
            spams.add("gambling");
            spams.add("girl");
            spams.add("girls");
            spams.add("hair-loss");
            spams.add("health-insurancedeals-4u");
            spams.add("homeequityloans");
            spams.add("homefinance");
            spams.add("holdem");
            spams.add("holdempoker");
            spams.add("holdemsoftware");
            spams.add("holdemtexasturbowilson");
            spams.add("hotel-dealse-site");
            spams.add("hotele-site");
            spams.add("hotelse-site");
            spams.add("incest");
            spams.add("insurancedeals-4u");
            spams.add("jrcreations");
            spams.add("levitra");
            spams.add("macinstruct");
            spams.add("mortgage-4-u");
            spams.add("mortgagequotes");
            spams.add("online-gambling");
            spams.add("onlinegambling-4u");
            spams.add("ottawavalleyag");
            spams.add("ownsthis");
            spams.add("palm-texas-holdem-game");
            spams.add("paxil");
            spams.add("penis");
            spams.add("pharmacy");
            spams.add("phentermine");
            spams.add("poker-chip");
            spams.add("poze");
            spams.add("pussy");
            spams.add("punk");
            spams.add("rental-car-e-site");
            spams.add("ringtones");
            spams.add("roulette");
            spams.add("shemale");
            spams.add("shoes");
            spams.add("slot-machine");
            spams.add("strip");
            spams.add("strios");
            spams.add("texas-holdem");
            spams.add("thorcarlson");
            spams.add("top-site");
            spams.add("top-e-site");
            spams.add("tramadol");
            spams.add("trim-spa");
            spams.add("ultram");
            spams.add("valeofglamorganconservatives");
            spams.add("viagra");
            spams.add("vioxx");
            spams.add("xanax");
            spams.add("zolus");
            spams.add("کنسرت");
            spams.add("missing-and-abused-kids");
            spams.add("s p a n k e");
            spams.add("fook");
            Long totalWords = 0L;
            while (itr.hasMoreTokens()) {
                totalWords++;
                String nextToken = itr.nextToken(); // Cast token into a String variable.

                if (spams.contains(nextToken.toLowerCase())){  // collect spams.
                    word.set(nextToken.toLowerCase()); // Moving spam word into a Text object
                    context.write(word, one);
                }
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result); // Output count of each token
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: spamcount <inFile> <outFile>");
            System.exit(2);
        }
        Job job = new Job(conf, "spam count");
        job.setJarByClass(SpamCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Do the same as previous classes and build spamcount.jar.

javac -classpath hadoop-core-0.20.203.0.jar:lib/commons-cli-1.2.jar
                    -d SpamCount/classes/ SpamCount/src/SpamCount.java
jar -cvf SpamCount/spamcount.jar -C SpamCount/classes/ .

Running MapReduce Commands
We’ve created a big merged file from all tweets. The file placed in the following path:

/tweets/giantTweet.txt

We use it as input file for next two MapReduce commands that we provided:

bin/hadoop jar WordCount/wordcount.jar com.hexican.hadoop.WordCount
                          /tweets/giantTweet.txt /tweets/giantTweet_wordCounted
bin/hadoop jar SpamCount/spamcount.jar com.hexican.hadoop.SpamCount
                         /tweets/giantTweet.txt /tweets/giantTweet_spamCounted

The result could be found in the following paths in HDFS:

/tweets/giantTweet_wordCounted/part-r-00000
/tweets/giantTweet_spamCounted/part-r-00000

At this moment I need to count them all. So I get them by the following hadoop file system command:

bin/hadoop fs -get /tweets/giantTweet_wordCounted/part-r-00000 result/giantTweet_wc.txt
bin/hadoop fs -get /tweets/giantTweet_spamCounted/part-r-00000 result/giantTweet_sc.txt
bin/hadoop fs -get /tweets/giantTweet.txt result/.

I assumed you’ve created the ‘result’ folder in the hadoop folder. That is the place we put MapReduced results.

Have a look into them. The giantTweet_wc.txt consists all words each one counted. The giantTweet_wc.txt also consits all spam words occured in the giantTweet.txt each one counted. Also you can see giantTweet.txt consists all tweets merged into itself.

Count their lines applyin the following Linux command:

wc -l [fileName]

Now you can tune the TweetCount by giving the proper value for finding advertisements. I tried it with 3, 10, 15 and 20 as the minimum occurence of each tweet. I assumed repeated tweets are some kind of advertisement.

bin/hadoop jar TweetCount/tweetcount.jar com.hexican.hadoop.TweetCount /tweets/giantTweet.txt /tweets/ads

Also repeat counting words and spams by the following commands for TweetCount results by the following commands:

bin/hadoop jar WordCount/wordcount.jar com.hexican.hadoop.WordCount
                             /tweets/ads/part-r-00000 /tweets/ads/wordCounted
bin/hadoop jar SpamCount/spamcount.jar com.hexican.hadoop.SpamCount
                           /tweets/ads/part-r-00000 /tweets/ads/spamCounted

Just get the results and count the lines as what you did for giantTweet.txt.

bin/hadoop fs -get /tweets/ads/part-r-00000 result/ads.txt
bin/hadoop fs -get /tweets/ads/wordCounted/part-r-00000 result/ads_wc.txt
bin/hadoop fs -get /tweets/ads/spamCounted/part-r-00000 result/ads_sc.txt
wc -l result/ads.txt
wc -l result/ads_wc.txt
wc -l result/ads_sc.txt

The following table shows spam words almost are within the massive sent tweets such as advertisements.

Occ=1 Occ>3 Occ>10 Occ>15 Occ>20
Tweets No. 54431 1164 354 230 159
Words No. 203010 8034 2674 1857 1422
Spams No. 1984 1289 1265 1152 1051

The following chart has painted by the above table data:

Note: The Spam No. row consists total number of spam words. The following is the result of counting spam words through content of tweets that each one repeated more than 20 times (The red one cell.):

Date	1
Fuc	1
PUSSY	1
ass	1
cute	3
date	2
girl	1
missing-and-abused-kids	1
punk	2

Then I counted each spam word exactly how many times repeated in the result of its specific TweetCount result.

So the real occurrence of each one is:

Spam MapReduce Repeated Occurrence
Date 1 24
Fuc 1 188
PUSSY 1 30
ass 1 21
cute 3 96
date 1 21
date 1 22
girl 1 191
missing-and-abused-kids 1 213
punk 1 26
punk 1 27
SUM 1051

The “cute” has repeated 3 times in a single tweet.

cat ads.txt | grep ' cute '
       The saying "U cute 2 be" is soooooooo funny 2 me! U cute 2 be big, or u cute 2 be dark skin....lma	96

As you can see the “punk” has repeated in two different tweets:

cat ads.txt | grep ' punk '
	Get to it! LOL RT @cesleyb: Im about to show this phony punk a fook!! RT @mrsdarian: @cesleyb you're a FOOL! Smh..	26
	Get to it! LOL RT @cesleyb: Im about to show this phony punk a fook!! RT @mrsdarian: @cesleyb you're a FOOL! Smh..	27

Seems more repeated tweets have more spam words! We assumed repeated tweets are almost advertisement.

OK, I am going to optimize the procedures and make them all more automated.

Posted in Cloud Computing, Java, Software Engineering | Leave a comment

Getting Familiar with Alternative Technologies

My professional work as a software developer has been starting at the time DOS was the only available OS. I didn’t even use SQL. Pascal was the only rapid development language. The application doesn’t have any certain architecture. Everything was written in a single messed up layer. The LAN was just a new concept. Seems everything has changed among years. A huge improvement has happened. The most significant improvement is not the development methodologies or platforms in my opinion. Developers have got choice and this is the biggest improvement.

I think the time of developing a medium size software application applying just a single development language has been finishing. We are entering a new era in software development that bases on combining a number of technologies for producing a single solution. Because we got choice. We have been surrounded by a number of good mature software technologies for every purpose.

At first glance, it doesn’t seem like we would need anything beyond a design and a development tool set. But when we begin to understand real circumstances and new age high profile requirements, we understand there is not enough to choose a single enterprise platform such as .Net or Java. The time of multipurpose technologies has gone. This is the age of professional tools.

I believe a perfect solution is almost a good combination of a number of software languages and technologies while each one is one of the best in its professional purpose. The combination of different technologies such as C++, Python, Java and JavaScript is just a good sample.

In a higher level also we are able to use thousands of open-source products as a part of our solutions. Also in OS level we have a wide range of different operating systems each one customized and optimized for specific purposes.

Something interesting with this story is having the chance of using a new huge data-processing service applying cloud computing and NoSQL databases instead of conventional data processing applying RDBMSs. For instance the Apache Hadoop is a versatile tool that can be used for many different cases of data processing. Adding its capabilities is the shortest way for bringing enterprise scalability to your solution.

What all of this means to software developers is that if they are working in the enterprise application world, then this is the time to start getting familiar with alternative technologies.

Posted in Software Engineering, Software Market Demands | 1 Comment

Huge Data Processing Applying Hadoop Cluster – Part 3

MapReduce
In the previous posts we have deployed a Hadoop cluster, then we’ve developed a Java class which can be used as an additional Hadoop command for gathering Tweets from different log files and merge them all into a single file in HDFS.
Now we need to process the huge file applying MapReduce algorithm. MapReduce is about manipulating key/value pairs. In this post we try to understand how MapReduce works. Also we will develop a new command for Hadoop to be used for detecting must repeated tweets. We assume that a spam almost occur frequently. So we develop a MapReducer for detecting spams. Essentially we have defined two approach for detecting spams. This is just one and another is looking for reserver words.

Actually MapReduce is just a data processing method. The most interesting thing with MapReduce is the ability of running it across multiple computers (node).

MapReduce has two main basic particle called “mappers” and “reducers”. It is possible to transfer process to each node just by manipulating configurations. This gives a brilliant scalability to MapReduce model. Consider how we’ve used MapReduce on a simple statement.

How MapReduce Works
MapReduce uses “lists” and key/value pairs as data primitives. The keys and values are almost strings. The input to our application is tweets as we talked. Tweets are almost key/values. The user name of one who has twitted can be used as the key. This gives us a great opportunity to chose the proper key regarding our business policy.

1. In our project tweets structured as a list of key/value pairs, list(). The input format for processing tweets through the large file we have created is: list( , ).

2. The list of (key/value) pairs is broken up and each individual (), is processed by calling the map function of the mapper. The mapper we will develop here transforms each into a list of . For spam detecting, our mapper takes . It’s output should be a list of . The output also can be simpler. The counts will be aggregated in a later stage, so we can output a list of with repeated pairs and let it be aggregated later. So we can have both of the following results for first step:
<"Hey, I am a just an annoying spam", 3>
OR
<"Hey, I am a just an annoying spam", 1>
<"Hey, I am a just an annoying spam", 1>
<"Hey, I am a just an annoying spam", 1>

The second approach is much easier to develop while the first approach has some performance benefits. Regardless of the approach we chose we get able to calculate any tweet occurrence and this is what we were looking for.

3. The output of all mappers are aggregated into one giant list of pairs. All pairs sharing the same k2 are grouped together into a new (key/value) pair, .

The following pseudo-code is map and reduce functions for a word counter:

map(String filename, String document) {
    List<String> tokenizedList = tokenize(document);
    for(token: tokenizedList) {
        doSomething ((String)token, (Integer) 1);
    }
}

reduce(String token, List<Integer> values) {
	Integer sum=0;
	for(value:values){
		sum = sum + value;
	}
	doSomething ((String) token, (Integer) sum);
}

How Could We Recognize Spam Tweets?
The following image is a snapshot of a twitter log file. This is possible to get files applying Twitter API.

The tweets log file which I am going to process are two huge size text files. This is really difficult to find out what is a spam or not when you traverse it. The log files are really bizarre. As I mentioned I assume the tweets containing some special words such as Viagra, are spams. But this is not enough. We almost receive a lot of advertising tweets with no special words. I use a simple formula for finding the second group. I believe if there is a tweet that has been sending frequently should be a spam. The following code is simple sample that detects spams by processing the giant log file that we’ve created by merging Twitter log files.

package com.hexican.hadoop;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class TweetCount {

  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

      StringTokenizer itr = new StringTokenizer(value.toString(),"\n\r\f"); // Tokenize using newline

      while (itr.hasMoreTokens()) {
        String nextToken = itr.nextToken(); // Cast token into a String variable.
        if (nextToken.trim().length() <= 80 ){  // Ignoring empty tweets.
		continue;
	}

        word.set(nextToken.substring(60, nextToken.length()-1)); // Moving tweet content int a Text object
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      if (sum>10){  // Adding whatever repeated more than 10 times as a spam
      	result.set(sum);
      	context.write(key, result);
      }
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: tweetcount [inFile] [outFile]");
      System.exit(2);
    }
    Job job = new Job(conf, "Tweet (Spam) count");
    job.setJarByClass(TweetCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

This is just a modified version of WordCount, a Hadoop sample code.

To use it just go to the hadoop folder in Aplha machine (NameNode) and do the following steps:
Create TweetCount, TweetCount/src and TweetCount/classes folders just the same as GetLogs code.
Then create TweetCount/src/TweetCount.java applying above code.
Now we need to compile TweetCount.

The following command creates a jar file from TweetCount class.

And finally we can run TweetCount over the giant log file from previous post.

The result is interesting! Seems our code catches spam tweets very well:

This can be optimized to work better. I next post we will add a reserve word based spam detecting method for using more Hadoop’s data processing power.

Posted in Cloud Computing, Java, Software Engineering | Leave a comment

Huge Data Processing Applying Hadoop Cluster – Part 2

The previous post has been learning us how to set up and deploy a real Hadoop cluster. As I mentioned the target of this little project is to port a number of huge log files such as Twitter logs into a cluster for next processing. So at this moment we need to put a number of huge size files into the HDFS we have created previously.

The following code makes us able to pull source files within a certain folder and merge them all into one files during putting them into HDFS:

package com.hexican.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import java.io.IOException;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;

public class GetLogs {
    public static void main(String[] args) throws IOException {
        if(args.length != 2) {
            System.out.println("Usage: GetLogs [FolderName] [MergedFile]");
            System.exit(1);
        }

        Configuration conf = new Configuration();
        FileSystem hdfs = FileSystem.get(conf);
        FileSystem local = FileSystem.getLocal(conf);
        int filesProcessed = 0;

        Path inputDir = new Path(args[0]);
        Path hdfsFile = new Path(args[1]);

        try {
            FileStatus[] inputFiles = local.listStatus(inputDir);
            FSDataOutputStream out = hdfs.create(hdfsFile);
            for(int i = 0; i < inputFiles.length; i++) {
                if(!inputFiles[i].isDir()) {
                    System.out.println("\t Adding and Merging... <" + inputFiles[i].getPath().getName() + ">");
                    FSDataInputStream in = local.open(inputFiles[i].getPath());

                    byte buffer[] = new byte[256];
                    int bytesRead = 0;
                    while ((bytesRead = in.read(buffer)) > 0) {
                        out.write(buffer, 0, bytesRead);
                    }
                    filesProcessed++;
                    in.close();
                }
            }
            out.close();
            System.out.println("\n " + filesProcessed + " file successfully added  and merged into [" + hdfsFile.getName() + "].");
        } catch (IOException ioe) {
            ioe.printStackTrace();
        }
    }
}

Create GetLogs.java and save it into GetLogs/src folder. You have to created the following folders:

Now you can compile it and create getlogs.jar file. Just run the following commands:

The following is the command you need to run for running GetLogs:

./hadoop jar ../GetLogs.jar come.hexican.hadoop ~/twitter /twitterLogs/allTweets.log

[SourceFolder] located on your local machine. This is the place you have been putting your source files to be processed. [MergedFile] is a huge size file that placed in HDFS and we use it for map-reduce process.

The following shows how to use GetLogs class for loading files into HDFS and merging them simultaneously.

I’ve added 1.6 MB files from Twitter folder:

After running GetLogs, we have all log files merged in to a single file:

In next step we will find spams by map-reducing merged imported log files.

Posted in Cloud Computing, Java, Linux | 2 Comments

Developing Objective-C with Pleasure

Java developers know how to develop with pleasure using Intellij IDEA. A year ago I tried to develop an iPhone application applying Objective-C and XCode. I made it after a lot of challenges with XCode, while I just wondered why Apple didn’t use good ideas of a more comfortable IDE such as IntellijIDEA. After moving to XCode 4 I was really disappointed. It looks there was a long way for Apple to produce something comfortable and well designed for modern development. I’ve found Objective-C powerful but the IDE was really poor. Now the AppCode could be the tool which many developers was looking for. An IDE for Apple developers by the Jetbrains approach..

Posted in Mac OS X, Objective C | Leave a comment

Hotspot, Alfredo, Cmd+N

I have a respect for Alfredo, the gadget that improves Mac OS X hotspot behaviors. It works much better than Mac OS default searching tool.
Also ‘Command’+ ‘N’ and ‘Command’ + ‘Shift’ + ‘N’ are the keys that we developer use during coding applying IntelijIdea to do the same for searching through classes and files.
It seems that is pretty possible to use an integrated gadget to search through more abstract paradigms such as programming concepts. Then in this case I prefer Alfredo.

Posted in Mac OS X | Leave a comment

Huge Data Processing Applying Hadoop Cluster – Part 1

Introduction
Massive data can be very difficult to analyze and Query and traditional mechanisms cannot be good tools for processing data. Cloud computing is one of the best solutions for processing huge data repositories. This is the first part of a very fast forward tutorials that published for who needs to setup and run a fully distributed model of a Hadoop cluster. The tutorial uses four Ubuntu Linux nodes for setting the cluster up. There is no big deal to use this applying other OSs such as other Linux flavors or MS-Windows. This tutorial will be continued to cover a real world huge data processing scenario. I’ve used Chuck Lam “Hadoop In Action” book widely in this tutorial.

Preparing the Play Ground
First of all I installed 4 instances of Ubuntu 11.04 Linux on my Mac Book Pro (Lion) applying Sun Virtual Box, each one has 512 MB RAM and 20 GB hard disk space. You can setup your cloud with more Linux machines if you use dedicated machines or you have more resources on the host machine. The following snapshot shows the nodes specifications:
Virtual Box and the specification of the nodes.

I’ve also used Bridge network adapter model. The others didn’t work for me.
The network adapter of the nodes.

I addressed the Linux machines with the following names and IPs:

  • Alpha (192.168.200.201) – The master node of the cluster and host of the NameNode and Job- Tracker daemons
  • Beta (192.168.200.202) – The server that hosts the Secondary NameNode daemon
  • Delta (192.168.200.203) – The slave box of the cluster running both DataNode and TaskTracker daemons
  • Gamma (192.168.200.204) – Another slave box of the cluster running both DataNode and TaskTracker daemons
  • As you can see I have 2 client nodes. You can have more as I mentioned.

    Be sure you all nodes has configured within a local area network and the nodes can ping together.

    Java Runtime
    Also be sure you already have Java run-time installed on all Linux VMs.
    Checking Java version.

    Be sure you have an exported environment variable to locate Java home:
    Be sure you have exported a Java_Home variable.

    Common User
    We will need to let one node accessing another. This access is from a user account on one node to another user account on the target machine. For Hadoop the accounts should have the same username on all of the nodes (I use amirsedighi in this tutorial).

    SSH Communication
    Hadoop uses SSH for communicating between machines. Here I used amirsedighi with the same password for all machines.
    SSH utilizes standard public key cryptography to create a pair of keys for user verification—one public, one private. The public key is stored locally on every node in the cluster, and the master node sends the private key when attempting to access a remote machine. With both pieces of information, the target machine can validate the login attempt. As all machines should be able to have a trusted communication through SSH protocol with the master node (Alpha) check whether SSH is installed on your nodes:
    Checking if SSH installed or not.

    If you get any error message just install OpenSSH (www.openssh.com). It could be better to do this with the OS default package manager ( I’ve used Synaptic ).

    Use the master node (Alpha) to generate a SSH key applying the following command:
    Generating a SSH key in the master node (Alpha) .

    Now you need to copy the generated SSH key from Alpha to other nodes (Beta, Delta, Gamma).
    Coping SSH key from master to other nodes.

    You need to have a .ssh folder in root folder of the common user in all nodes. Create it if you have not the .ssh folder. The go to the nodes and run the following command:
    Coping SSH key to .ssh folder of all nodes except master.

    Now you need to run two other commands in terminal: “Chmod 700 .ssh” and “chmod 600 .ssh/authorized_keys”.

    You should be able to connect to Beta, Delta and Gamma without giving password applying the following commands:
    Checking if SSH works correctly.

    Setting Hadoop Up
    Just download the latest version of Hadoop from Apache download page. Follow the instructions and install it wherever you want. I installed it into the root folder of the common Hadoop user (amirsedighi) . I use the following version on my cloud:
    The version of hadoop of this tutorial.

    You need to export an env variable that locate Hadoop folder.
    Hadoop home.

    Find conf folder from $HADOOP_HOME$. There are Hadoop configuration files.
    In core-site.xml and mapred-site.xml we specify the hostname and port of the NameNode and the JobTracker, respectively. In hdfs-site.xml we specify the default replication factor for HDFS, which should only be one because we’re running on only one node.
    The following snapshot shows those files you should modify for to set up you cloud:
    Hadoop configuration files.

    While you just need to modify core-site.xml, mapred-site.xml, masters and slaves files and nothing more; Moreover you can set a replication factor in hdfs-site.xml if you need to change it:
    Hadoop configuration files.

    You should do the same across all the nodes in your cluster. The easiest way is just copy the configured Hadoop folder across all machines in the same place.

    Running Hadoop
    You are almost ready to start Hadoop. But first you’ll need to format your HDFS by using the following command:
    Formating HDFS.

    You can now launch the daemons by use of the start-all.sh script:
    Starting Hadoop up.

    The Java jps command will list all daemons to verify the setup was successful:
    JPS on Aplpa
    JPS on Beta
    JPS on Delta
    JPS on Gamma

    You have a functioning cluster!

    Web UI
    Hadoop provides a Web UI. The browser interface allows you to access information you desire much faster than digging through logs and directories. The NameNode hosts a general report on port 50070. It gives you an overview of the state of your cluster’s HDFS:
    Hadoop Web UI.

    Next Steps
    What we will see in these tutorials are huge size files processing applying HDFS and Google map-reduce algorithm.

    Posted in Cloud Computing, Java, Linux, Networking, Open Source, Software Engineering | 1 Comment

    How Geeks Act

    I just started my weekend by reviewing the personal blog of a famous IT man. I was reading Linus’s Blog from the beginning when I noticed he has mentioned how he thinks about himself. Have a look in “Tracking the time kids spend online” from Linus’s blog. He explained how he has developed a specific project for reviewing his family member to find out how much they’ve got online. The project started simply: “But I’m a geek, and I’m not at all interested in trying to do any of this manually.“.

    So interesting how he has expected himself. This could be the reason of Linux and GIT projects. What we imagine about ourself defines how much we can fly.

    Posted in Uncategorized | 1 Comment

    The Story of A Succeeded Optimization

    Give me six hours to chop down a tree and I will spend the first four sharpening the axe. – Abraham Lincoln

    Introduction
    I feel so lucky. Because my boss considers performance matters fanatically. So he gave me enough time to optimize a software application.
    This is just an after action review (AAR) of what I’ve done during testing and optimizing a web-based medium size Java application. The project which I work on is a web-based online database which has been built based on a couple of frameworks such as Spring, Hibernate, DWR and Ext.Js.

    Load Testing
    The experience tells us that regular coding practices that are perfectly legal for small projects should never be used in a medium or large scale software. A large amount of concurrent requests can be affects regular behavior of the system . A fast method may be become the slowest one. So performance tuning is a mandatory step for preparing a software which aimed to be used by a large number of online users. Finding the areas which needed to be optimized could be the first step. This can be done by performing a load testing.

    There are a number of tools for doing that. I always use JMteter for simulating real world situation in a laboratory environment. It is easy to define test scenarios and scaling them up to find out break points. JConsole also provides invaluable informations about the amount of used resources.

    When I started to test the first results were awful. We were so far from an acceptable situation for the stakeholder. During the time I was adding new features I was notified that this code needs to be optimized so I preferred to have a wide and deep optimization.

    Finely after a couple of weeks the application was ready for final tests. The following just shows how it became fast during optimization.

    The following snapshot shows how we were not able to finish the test with just 20 simultaneity users because of wasting heap memory.
    A 20 Users test has got a problem with memory.
    JMeter shows how response times deviated.

    I’ve done the following steps to reduce needed runtime memory and increase the performance at the same time:

    Optimizing DTO Processors
    By performing an optimization over DTO processors and reducing the size DTO objects the result sets became smaller.
    In a special case I even prevented to use DTO objects and I just put a simple hash map to make it more fast loading. DTO processing is from the kind of iterative actions which uses a lot of resource under presser.
    The DTOs optimization wasn’t enough to me. So I changed the manner of loading model objects from greedy to lazy.

    Lazy Loading Instead Of Greedy Loading.
    As I mentioned I made object loading lazy. Spring and Hibernate made it easy to me. It was enough to mark an object by @Lazy annotation.

    Replacing HQLs With SQLs
    Hibernate HQL is just an automatic gearbox in my opinion. You may be loosed the performance to gain easy development. I modified HQL statements in the critical method calls and replaced them by optimized SQL queries. It was really effective specially in paginations. I really don’t know why Microsoft doesn’t care about SQLServer pagination! Anyway I implemented a more optimized method manually. Consider getFormRecord in the following snapshot:
    getFormRecord before query optimization.

    And check the following out how it became faster after query optimization:
    getFormRecord after query optimization.

    Adding Cache
    During testing I notified that a small number of objects may be accessed frequently. I knew while the focus of caching is on improving performance, caching also reduces load by reducing the time of process. So our objects needed to be cached. I added EHCache as Hibernate second level cache. It is really nice. Terracota let us cluster EHCaches when we need more servers. That is really cool.

    Dynamic Pagination
    We had some routines for exporting user tables which had a significant load on overall performance. I re-developed the routine totally. Using stream instead of string made exporting very reliable. But the large amount of process which It used during exporting should be managed. So I added a little intelligent mechanism which calculates the amount of free heap memory and decides how many rows should be fetched and exported as stream.

    Summary
    The result of all we have made are good as my assumption. There is nothing achievable in software word and this is what make it fantastic.
    The followings are the snapshots of the optimized version. It became faster and more reliable in compare with the previous charts:
    Optimized version with 10 concurrent threads.

    The following is the most interesting result to me. This is the result of running test with 200 simultaneity users with 10 seconds ram-up applying just 2GB heap size. This scenario just can be happend by an unmanaged DOS attack. Assume 200 threads call a number of methods seamlessly. Wow, I just love it.

    And the following is a comparison of running test with a range of users from 10 to 200.

    The application which wasn’t able to serve 20 concurrent users became ready to be host of 200 users which all clicks the same scenario seamlessly. Now it looks very stable with a large safety margine. The tests all ran on my notebook. The DB was a minimum size VM on an old fashioned machine.

    The most important lissons which I got from that optimization are the following:

  • Your application needs free heap memory at the times of data processing. So you always need to keep heap free. Monitor your resources applying JConsole. Keep watching heap memory changes. It is the most important thing. Without free heap nothing can be done even with the fastest CPUs.
  • Although greedy loading looks faster, there must be available free heap memory to perform requests. Lazy loading prepares application for being responsible to thousands requests in a limited period of time by releasing more heap memory.
  • Heavy load affects response times in a nonlinear manner. A fast method may be become the slowest one. Assume a simple fetching data which returns 20 records of data. It would be very fast loading when you don’t have lot of pressur. This will be changed under load. Rendering of each row will have a lot of pressure for server when it serve hundreds of the same request.
  • DTO objects should carry the minimum requirements. They also can be replaced with simpler structures.
  • HQL and whatever make development faster, will make runtime slower. Do favor SQL instead of HQL in critical methods.
  • Try your software applying heavy loads before to be tested by end-users. What ever you do during development is just an optimization. But whatever end-users report is just a malfunction or failure. Try to be the first one who finds breaking points.
  • Posted in Java, Software Engineering | 3 Comments