Item: Apache Hadoop
Rating: 8
Author: Mark Gargiulo

Overall Satisfaction with Hadoop

Use Cases and Deployment Scope

We needed a robust/redundant system to run multiple simultaneous jobs for our ETL pipeline, this needed distributed storage space, integration with Windows AD user accounts and the ability to expand when needed with little to no downtime.
We are using Cloudera 5.6 to orchestrate the install (along with puppet) and manage the hadoop cluster.

Pros and Cons

Pros

The distributed replicated HDFS filesystem allows for fault tolerance and the ability to use low cost JBOD arrays for data storage.
Yarn with MapReduce2 gives us a job slot scheduler to fully utilize available compute resources while providing HA and resource management.
The hadoop ecosystem allows for the use of many different technologies all using the same compute resources so that your spark, samza, camus, pig and oozie jobs can happily co-exist on the same infrastructure.

Cons

Without Cloudera as a management interface the hadoop components are much harder to manage to ensure consistency across a cluster.
The calculations of hardware resources to job slots/resource management can be quite an exercise in finding that "sweet spot" with your applications, a more transparent way of figuring this out would be welcome.
A lot of the roles and management pieces are written in java, which from an administration perspective can have there own issues with garbage collection and memory management.

Return on Investment

With our current platform (and budget) hadoop is really the only option at this time to gain access to the capacity and technologies we require.
So far the only real investment has been hardware and man hours, especially in the initial learning and deployment phase.

Alternatives Considered

As I am new to the hadoop ecosystem I have not used or evaluated any other similar products at this time. This was handed to me from a previous much older installation that was very under utilized. Our new platform will be working the new cluster much harder with jobs that run indefinitely. I'm not sure that any of the other "big data" technologies out there have as many certified components or work with such a diverse collection but as I said I am pretty new to this and so have only tertiary knowledge of competing products.

Other Software Used

Puppet Data Center Automation, JIRA Software, Atlassian Confluence

Likelihood to Recommend

Hadoop is not for the faint of heart and is not a technology per se but an ecosystem of disparate technologies sitting on top of HDFS. It is certainly powerful but if, like me, you were handed this with no prior knowledge or experience using or administering this ecosystem the learning curve can be significant and ongoing having said that I don't think currently there are many other opensource technologies that can provide the flexibility in the "big data" arena especially for ETL or machine learning.

Comments

Please log in to join the conversation

A newbie's look at Hadoop

Modules Used