Item: Apache Hadoop
Rating: 8
Author: Mrugen Deshmukh

Use Cases and Deployment Scope

I have used Hadoop for building business feeds for a telecom client. The major purpose for using Hadoop was to tackle the problem of gaining insights into the ever growing number of business data. We leveraged the map reduce programming model to churn more than 30 gigabytes of data per day into actionable and aggregated data which was further leveraged by campaign teams to design and shape marketing and by product teams to envision new customer experiences.

Pros and Cons

Hadoop is an excellent framework for building distributed, fault tolerant data processing systems which leverage HDFS which is optimized for low latency storage and high throughput performance.
Hadoop Map reduce is a powerful programming model and can be leveraged directly either via use of Java programming language or by data flow languages like Apache Pig.
Hadoop has a reach eco system of companion tools which enable easy integration for ingesting large amounts of data efficiently from various sources. For example Apache Flume can act as data bus which can use HDFS as a sink and integrates effectively with disparate data sources.
Hadoop can also be leveraged to build complex data processing and machine learning workflows, due to availability of Apache Mahout, which uses the map reduce model of Hadoop to run complex algorithms.

Hadoop is a batch oriented processing framework, it lacks real time or stream processing.
Hadoop's HDFS file system is not a POSIX compliant file system and does not work well with small files, especially smaller than the default block size.
Hadoop cannot be used for running interactive jobs or analytics.

Alternatives Considered

MongoDB and Neo4j

Hadoop provides storage for large data sets and a powerful processing model to crunch and transform huge amounts of data. It does not assume the underlying hardware or infrastructure and enables the users to build data processing infrastructure from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are commonplace and thus should be automatically handled in software by the framework, relieving the developers from handling every edge scenario that can occur in a large distributed system.
Hadoop can be deployed in a traditional onsite datacenter as well as in the cloud. The cloud allows organizations to deploy Hadoop without hardware to acquire or a specific setup expertise. Many vendors who currently have an offer for the cloud include Microsoft, Amazon and Google.

Other Software Used

Apache Spark, MongoDB, Amazon Elastic Compute Cloud (EC2), MySQL

Likelihood to Recommend

1. How large are your data sets? If your answer is few gigabytes, Hadoop may be overkill for your needs.
2. Do you require real-time analytical processing? If yes, Hadoop's map reduce may not be a great asset in that scenario.
3. Do you want to want to process data in a batch processing fashion and scale for TeraBytes size clusters? Hadoop is definitely a great fit for your use case.

Hadoop - Effective tool for large scale distributed processing.

Modules Used

Overall Satisfaction with Hadoop

Use Cases and Deployment Scope

Pros and Cons

Alternatives Considered

Other Software Used

Likelihood to Recommend