Hadoop - Effective tool for large scale distributed processing.
December 01, 2015

Hadoop - Effective tool for large scale distributed processing.

Mrugen Deshmukh | TrustRadius Reviewer
Score 8 out of 10
Vetted Review
Verified User
Review Source

Modules Used

  • Hadoop Common
  • Hadoop Distributed File System
  • Hadoop MapReduce

Overall Satisfaction with Hadoop

I have used Hadoop for building business feeds for a telecom client. The major purpose for using Hadoop was to tackle the problem of gaining insights into the ever growing number of business data. We leveraged the map reduce programming model to churn more than 30 gigabytes of data per day into actionable and aggregated data which was further leveraged by campaign teams to design and shape marketing and by product teams to envision new customer experiences.
  • Hadoop is an excellent framework for building distributed, fault tolerant data processing systems which leverage HDFS which is optimized for low latency storage and high throughput performance.
  • Hadoop Map reduce is a powerful programming model and can be leveraged directly either via use of Java programming language or by data flow languages like Apache Pig.
  • Hadoop has a reach eco system of companion tools which enable easy integration for ingesting large amounts of data efficiently from various sources. For example Apache Flume can act as data bus which can use HDFS as a sink and integrates effectively with disparate data sources.
  • Hadoop can also be leveraged to build complex data processing and machine learning workflows, due to availability of Apache Mahout, which uses the map reduce model of Hadoop to run complex algorithms.
  • Hadoop is a batch oriented processing framework, it lacks real time or stream processing.
  • Hadoop's HDFS file system is not a POSIX compliant file system and does not work well with small files, especially smaller than the default block size.
  • Hadoop cannot be used for running interactive jobs or analytics.
Hadoop provides storage for large data sets and a powerful processing model to crunch and transform huge amounts of data. It does not assume the underlying hardware or infrastructure and enables the users to build data processing infrastructure from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are commonplace and thus should be automatically handled in software by the framework, relieving the developers from handling every edge scenario that can occur in a large distributed system.
Hadoop can be deployed in a traditional onsite datacenter as well as in the cloud. The cloud allows organizations to deploy Hadoop without hardware to acquire or a specific setup expertise. Many vendors who currently have an offer for the cloud include Microsoft, Amazon and Google.
Apache Spark, MongoDB, Amazon Elastic Compute Cloud (EC2), MySQL
  • Ease of merging data is a major reason for Hadoop's success, with a growing number of tools available for easy ingestion of data from disparate sources, some examples are Apache Flume and Apache Kafka.
  • Many major database vendors are now providing tools and plugins to pull data from the database into Hadoop for offline processing.
  • Access control measures: Apache Sentry provides fine grained role based authorization to data and metadata stored on a Hadoop clusters.
    Distributed scheduling and coordination of jobs via use of YARN and Apache Oozie.
    Availability of improved reporting and dashboards for Hadoop via use of vendor driven distributions like
  • Distributed scheduling and coordination of jobs via use of YARN and Apache Oozie.
  • Availability of improved reporting and dashboards for Hadoop via use of vendor driven distributions like Cloudier CDH.
1. How large are your data sets? If your answer is few gigabytes, Hadoop may be overkill for your needs.
2. Do you require real-time analytical processing? If yes, Hadoop's map reduce may not be a great asset in that scenario.
3. Do you want to want to process data in a batch processing fashion and scale for TeraBytes size clusters? Hadoop is definitely a great fit for your use case.