Item: Apache Hive
Rating: 8
Author: Verified User

Use Cases and Deployment Scope

We use Apache Hive for two main use cases, analyzing our ever growing data volume insights and reports, and as part of our ETL pipeline where we found writing in SQL like syntax to allow for more rapid development with low complexity to the overall system.

Apache Hive solves a few issues for us but the main one being the ability to analyze large volumes of data on S3 directly with overall strong performance. We have been able to analyze billions of records in a matter of minutes with relatively small EC2 cluster using Apache Hive. It also allows for our Data Analysts to simply write SQL and avoids the ramp up to use other tools such as Apache Pig.

Pros and Cons

Apache Hive allows use to write expressive solutions to complex problems thanks to its SQL-like syntax.
Relatively easy to set up and start using.
Very little ramp-up to start using the actual product, documentation is very thorough, there is an active community, and the code base is constantly being improved.

Debugging can be messy with ambiguous return codes and large jobs can fail without much explanation as to why.
Hive is only SQL-like, while more features are being added we have found that some things do not translate over (for example outer joins, inserts, columns can only be referenced once in a select, etc.).
For out ETL jobs it does not seem to be the optimal tool due to tunings and performance being difficult, Apache Pig may be better for heavy processing jobs.

Return on Investment

Low resiliency, running as part of a processing pipeline has caused some undesirable bumps in the road at times and jobs can be difficult to debug.
Plays nicely with our Hadoop-based ecosystem and caused for little headaches for DevOps to install and set up on our cluster.
Virtually no ramp-up time for the entire team to start working with.

Alternatives Considered

Apache Pig

Apache Pig is probably the most direct technology to compare to Hive and has several different use cases to Hive. If you want to simplify processing tasks that run using MapReduce then Apache Pig may be a better tool for the job. However if you are going to be running many ad-hoc queries to dig through your data then Apache Hive really shines and I would consider to be a much more valuable tool for this purpose. Both great tools, it just comes down to individual use cases and what strengths your team has.

Other Software Used

Apache Pig, Apache Spark, Hadoop

Likelihood to Recommend

Apache Hive shines for ad-hoc analysis and plugging into BI tools. Its SQL-like syntax allows for ease of use not for only for engineers but also for data analysts. Through our experience, there are probably more desirable tools to use if you are planning on integrating Hive into your processing pipeline.

Apache Hive - Querying Big Data Made Easy!

Overall Satisfaction with Apache Hive