A few months ago I joined a team as a Big-Data consultant. They had many issues with their log management system. The solution relied on an RDBMS centric architecture. They were facing a collection of issues such as run-time error, memory leak and I/O malfunctions. Everything were pretty messy and they were trying every possible way to make things working. They became disappointed. It was an opportunity to making them familiar with the Big-Data concepts and platforms. I’ve started it by setting up a few number of meetings, each one including a presentation or demo. During the meetings I explained how RDBMSs are too complex to process such a big amount of incoming streams. After a few demonstration, they finely accepted to try a document store instead.
The new solution contains an orchestration of Flume instances, a cluster of Elasticsearch nodes and customized rivers. Increasing the compression ratio, Automatic daily indexing and decoding based64 contents where a number of extra stuffs that I’ve applied.
Despite our pretty short switch from MySQL to a Elasticsearch, we have already come to archive around 500GB data per a day, including a super fast analytical searching service. The most big improvement I’ve made, wasn’t in the code or architecture. It was in the team’s approach. I just showed them how a commodity cluster computing platform provides a huge performance, reliability and availability.
We had sort of challenges as well. The most important issue we were facing was the rapidly growing archive files. We’ve got a big amount of input streams. Very big file bulk read/write and indexing the growing files were killing the limited I/O. The nodes had just old spinning disks. So Indexing went slow day by day. To fix it, I started to break the main index file into a number of smaller indexes and it was simply a perfect idea.
Later may I migrate it to HDFS for some reasons, while it already rocks.