Item: Apache Hadoop
Rating: 7
Author: Joe Hughes

Overall Satisfaction with Apache Hadoop

Use Cases and Deployment Scope

[Apache Hadoop] is being handled as it is (mostly) intended. For large, unstructured data management from our data flows to include logging and reports extract, transform and load. We are using it at a medium scale in an on-prem server delivery with Cloudera as the management platform. While I firmly believe cloudera makes it a bit easier to manage, it obfuscates issues at times.

Pros and Cons

Pros

Handles large amounts of unstructured data well, for business level purposes
Is a good catchall because of this design, i.e. what does not fit into our vertical tables fits here.
Decent for large ETL pipelines and logging free-for-alls because of this, also.

Cons

Many, many modules and because of Apache open source, takes time to learn
Integration is not always seamless between the disparate pieces nor are all the pieces required.
Optimization can be challenging (see PSTL design)

Return on Investment

Positive as we have saved money on hardware (and software costs) as data scaling as increased in the last several years.
Positive as I said earlier as the design of Hadoop allows for a natural split of the dataflows and less data to be "shoved into" the vertical data stack. This saves money and is naturally more efficient.
Negative, where we need expertise to manage the Hadoop datastacks due to the learning curves.

Alternatives Considered

MariaDB - Better to be already in the cloud you will use it for. Issues have improved as it has matured over the year.s
CockroachDB - Not nearly as performant (even out of the box) as Apache Hadoop. More configurations required just to make it work. In memory cacheing is an issue.

Key Insights

Do you think Apache Hadoop delivers good value for the price?

Yes

Are you happy with Apache Hadoop's feature set?

Yes

Did Apache Hadoop live up to sales and marketing promises?

Yes

Did implementation of Apache Hadoop go as expected?

Would you buy Apache Hadoop again?

Yes

Other Software Used

Adobe Spark, Apache Kafka, Oracle Java SE, Red Hat Ansible Automation Platform, Google Cloud Dataflow

Likelihood to Recommend

Apache Hadoop (and its subsequent add-ons) are well-suited to larger, unstructured data flows, such as aggregation of web traffic or advertising. Geospatial algorithms and their outputs are well-suited for this kind of aggregation as structuring that data is challenging, but leaving it unstructured and performing queries as-needed is a better fit for most business models. With the advent of data science, I would expect Hadoop fits a LOT of their initial outputs quite well.

Comments

Please log in to join the conversation

Apache Hadoop Can Save on the Headaches

Modules Used