Creating Tomorrow is the first and the only recommender as service in Iran. I developed it three years ago.

It was around two years that was working and surprisingly, I was looking for a website which accepts using it! I had to explain, what a recommender system is and how it works. During the past three years, I’ve met many business owners.  Unfortunately they almost weren’t motivated to accept the risk of migrating a new technology. They just wanted their business working up. There wasn’t any reason to accept new challenges. The comfort zone looks awesome! At those time, GoodCo, a small innovative startup, just became interested in our solution. Parisa one of GoodCo co-founders just got it, when I mentioned it, in one of my BigData workshops at SBU. We made an easy deal soon.

Right 8 months ago something interesting happened.  At the time I was trying to convince some, It was Aparat, the biggest dotcom in Iran, who came with a great knowledge and solid experience in this area!

Nima just emailed me. He mentioned that the CEO of Aparat is interested to give our solution a try.

M.J. Shakouri Moghadam, the CEO/CTO of Sabavision, is an intellectual leader with a great resume in creating big startups from scratch. He is a target oriented entrepreneur, who believes in tech. His company, Sabavision, leads the content sharing, providing and streaming market in Iran.

During the first meeting, I noticed, they already have their own matured, well working recommender system! Nevertheless M.J. accepted the risk of replacing something working, with something new! They are pretty passionate. His company, started using in one of their awesome projects, Filimo.

Our first tries on Aparat were doomed to fail. While was working pretty good at Filimo scale, it wasn’t able to serve Aparat. Aparat provides ten million contents to tens of million users. It is growing dramatically as well.

After it, I’ve spent a sensible time to improve and make it an Aparat scale, reliable solution. I started it by developing an Aparat simulator project. It simulates the behavior of thousands online user. Now I can shape different situations at development time. Hasan, an Aparat highprofile technical guy, helped me a lot. He shared his rare experience handsomely. Frankly they are the most knowledgeable WWW team that I’ve ever seen. They are master in agility either. now works much faster than ever. It also provides more processing power, on cheaper machines. I’ve tried all tricks in the book, such as; different kinds of hashing, sketching, matrix factorization, Kohonen network, approximation and multithreading to make it a superfast realtime solution. Through a graph, containing tens of million user/item, It just takes a few micro seconds (10 −6) to predict, who looks for what. We know this is just a start. So we are working on designing and developing some extensions to our solution.

Some guys have collaborated in development. Saeid just developed a WordPress plugin. He also developed the project websiteArghavan has helped us in using matrix factorization technique. Some other talented guys just started helping us as well.

There are hackers who create awesome solutions, beyond the majority demands. And there are rare star companies, who stand ahead, by accepting the risk of creating tomorrow.

Posted in Big Data, Cloud Computing, Cluster, Java, Linux, Mac OS X, Machine Learning, Networking | 2 Comments

Becoming Student at 41

After college I went to start working for software development companies in Tehran. They almost were the best ones in their business. Originally my plan was to go back to grad school after a few years. Time flies when you are surrounded by cool workmates and learning awesome stuffs. After 14 years I realized I had to go back to school. I was feeling that I need to increase my classic knowledge in the area I was working on. It was Big-Data and Stream processing. While I was pretty experienced at that point and was getting paid very well, I needed to know more. I didn’t want to do night school or online learning. I wanted to find colleagues and having fun as well. I needed to sit on campus and catch everything a university has to offer. Friends and colleague know me as a very social one with a big passion for talking about technologies or everything else. I am really talkative. The best part of college is not just classes you take or research you do, there is so much more like finding friends, attending to seminars, joining to teams and learning the theories behind what you did many times before.

Sometime in 2013 I was working for HyperOffice. It was three years that I was working on HyperBase project. I was really good in my job. The company was paying me in USD, so everything was looking awesome. Right at that sweet moment, I decided to go to grad school. I knew that it won’t be easy to work and study together. Unfortunately the company became unsatisfied soon. I wanted to be a good student so couldn’t miss classes. The company wanted me to be dedicated to the project. There was no option to continue collaborating, so I quit my job. Actually they gave me a period of time to leave and I left much sooner.

The first sessions at campus, were enough to be sure that I am on my way. During past two years I’ve learned a lot. Attending in some awesome courses such as “Advanced Algorithms”, “Parallel Algorithms”, “System Modeling and Evaluation”, “Distributed Database Systems” and “Advanced Software Engineering” were what I really was looking for. I had seen algorithms before. It was 20 years that I was coding. But those were different. Applying probability to algorithms, parallelism concepts and core theories were really awesome. I soak up them all. Now I know much more than two years ago. I’ve got much more friends, each one highly professional in some fields. This was all about growing.

Now I am finalizing my thesis and becoming ready to defend it. I seriously suggest everyone to apply and go to grad school even after years of working. You will have lot of fun while learning. You will find sexy jobs much easier than ever right after graduation. This is a simple way to grow yourself and your network right in the best possible shape. Give it a try. :)

Posted in Software Engineering | Leave a comment

Running HDFS on an Ubuntu Cluster Hosted by VirtualBox

Posted in Big Data, Cloud Computing, Cluster, Java, Linux, Mac OS X, Networking, Open Source, Software Engineering, Software Market Demands | Leave a comment

Running Elasticsearch on a Virtualbox Cluster

Posted in Big Data, Cloud Computing, Cluster, Java, Linux, Mac OS X, Networking, Open Source, Software Engineering, Software Market Demands, Web | Leave a comment

An Introduction to Apache Spark

I’ve prepared a short presentation on Apache Spark. I presented it yesterday morning. Just wanted to share it here as well.

Posted in Big Data, Cloud Computing, Cluster, Java, Linux, Machine Learning, Networking, Open Source, Software Engineering, Software Market Demands | Leave a comment

Trying Big-Data frameworks on your PC

As the first part of a 4 sessions workshop on distributed data processing, I will be speaking about setting up a Linux cluster on your PC.


Posted in Big Data, Cloud Computing, Cluster, Java, Linux, Mac OS X, Networking, Software Engineering, Software Market Demands | 1 Comment

An Introduction to Elasticsearch

I just prepared a presentation on Elasticsearch.

Posted in Big Data, Java, Linux, Mac OS X, Open Source, Software Engineering, Web | Leave a comment

Elasticsearch Daily Indexing On Old Fashion Spinning Disks

I believe that, if it takes long, something is wrong. Regarding this minimalistic approach I’ve resolved something for one of my clients recently.

There was a single node with an Elasticsearch instance running on. I had to keep it as much fast as it was at the first days of running. I had no cluster, no scaling out option. Because the rack was full!!! It was a powerful old fashion machine, with 32 cores, 96GB RAM and 10 TB spinning disk capacity. Moreover there was a 160GB hourly incoming log stream!

The matter is, when you use Elasticsearch, you would need to be careful about the index size. Very large indexes make the Elasticsearch’s engine slow. Actually it is not an Elasticsearch issue. This is the increasing burden of indexing which is highly I/O consuming. Inside each index there are many segments. ES creates new segments by each refresh interval (1 second or more). Each new segment would merge within the previous ones. These all are pretty IO consuming. So the ingesting process looses it’s performance by growing the data size. The river that supposed to import incoming data massively, gets slow as well. Because the rivers are pretty IO bounded. Finely the disks will be full by raw data or billion records would be lost.

Sharding and partitioning into different indexes are similar approaches to slicing your data to prepare for handling massive amount of data. We notified in a short period of time (such as half a week) it works like a charm. Later it goes slow. One index per a day, were enough good for our issue.

The following script is really simple and easy to run and manipulate. I just wrote it when I needed to find a solution for resolving the I/O read/write burden issue within multi Tera byte size indexes were placed on mechanical spinning disks.

amirsedighi@amirMacBookPro~: cat
## This simply creates and maintains daily
## elasticsearch indexes.
## by: amir sedighi

today=$(date '+%Y-%m-%d')

curl -XPUT 'localhost:9200/--index-'$today'-index--'

sleep 5

curl -XPUT 'localhost:9200/--index-'$today'-index--/--type-'$today'-type--/_mapping' -d '


"--type-'$today'-type--" : {

"properties" : {



sleep 5

curl -XPUT localhost:9200/_river/--river-$today-river--/_meta -d '

"index" : {

"index" : "--index-'$today'-index--",

"type" : "--type-'$today'-type--",

"bulk_size" : 4000,

"bulk_threshold" : 10

yest=$(date --date="yesterday" +"%Y-%m-%d")

curl -XDELETE localhost:9200/_river/--river-$yest-river--

I’ve made a simple scheduling by crontab which is running the every 10 minutes. Re-creation of an existed object won’t be affected. Deletion of non-existed items won’t make anything wrong as well. So just let it go.

The front-end handles how many indexes should be queried. So the ES instance now works blazing fast within importing and indexing data no matter of the volume.

While it works fine, I would appreciate, you let me know if there is an automated method to resolve the issue of becoming slow on a single node spinning disk hardware.

I’ve asked @foundsays idea about the current post and she just answered. The following is the transcript:
@foundsays: We would recommend avoiding using rivers. Also, you could consider using Logstash to achieve your goal…
@amirsedighi: I’m already using Logstash for another client. This one has got it’s own log collector and insists to use it.

Posted in Big Data, Cloud Computing, Cluster, Java, Linux, Mac OS X, Networking, Open Source, Software Engineering, Software Market Demands | 4 Comments

Manually Scaling Ubuntu Cluster Out

I am not sure. May be this is the fastest native way for adding new node to the clusters of Hadoop, Redis or Elasticsearch or this kind of stuffs.
I’ve tested it within a few number of clusters hosted in VritualBox.
Each machine got two network adapters. The first one is NAT for providing Internet. The second one is Host-Only for keeping the machines connected together. The Host-Only needs to be updated after cloning.
I Assumed you have already installed everything that the new node should live with.

0. Make a complete network initialized clone of the source node.
1. Start the new machine.
2. sudo nano /etc/hosts
3. sudo nano /etc/hostname
4. sudo nano /etc/network/interfaces
5. sudo rm /etc/udev/rules.d/70-persistent-net.rules
6. sudo reboot

I love it. I am adding a new node by a few clicks and a little editing less than a couple of minutes.
It could be better to copy all you would need to the seed node before.

Posted in Big Data, Cluster, Linux, Networking, Software Engineering | 1 Comment

Migrating From RDBMS to Document Store

A few months ago I joined a team as a Big-Data consultant. They had many issues with their log management system. The solution relied on an RDBMS centric architecture. They were facing a collection of issues such as run-time error, memory leak and I/O malfunctions. Everything were pretty messy and they were trying every possible way to make things working. They became disappointed. It was an opportunity to making them familiar with the Big-Data concepts and platforms. I’ve started it by setting up a few number of meetings, each one including a presentation or demo. During the meetings I explained how RDBMSs are too complex to process such a big amount of incoming streams. After a few demonstration, they finely accepted to try a document store instead.

The new solution contains an orchestration of Flume instances, a cluster of Elasticsearch nodes and customized rivers. Increasing the compression ratio, Automatic daily indexing and decoding based64 contents where a number of extra stuffs that I’ve applied.

Despite our pretty short switch from MySQL to a Elasticsearch, we have already come to archive around 500GB data per a day, including a super fast analytical searching service. The most big improvement I’ve made, wasn’t in the code or architecture. It was in the team’s approach. I just showed them how a commodity cluster computing platform provides a huge performance, reliability and availability.

We had sort of challenges as well. The most important issue we were facing was the rapidly growing archive files. We’ve got a big amount of input streams. Very big file bulk read/write and indexing the growing files were killing the limited I/O. The nodes had just old spinning disks. So Indexing went slow day by day. To fix it, I started to break the main index file into a number of smaller indexes and it was simply a perfect idea.

Later may I migrate it to HDFS for some reasons, while it already rocks.

Posted in Big Data, Java, Linux, Machine Learning, Networking | Leave a comment