Item: Apache Hive
Rating: 6
Author: Praveen Murugesan

Overall Satisfaction with Apache Hive

Use Cases and Deployment Scope

We use apache hive across the whole organization. We built our own in-house hadoop cluster for data warehousing purposes complementary to HP Vertica which we were using. Vertica is limited to scale, and to achieve true scalability and process trillions of records we had to invest in a new solution. Enter Apache Hive. We are very data driven as an organization and hence to satisfy to appetite of people and also stick to something familiar to query data (SQL) we decided to invest in Apache Hive as a starting point in our new data infrastructure.

Pros and Cons

Pros

Hive which leverages traditional MapReduce at the core, can be used to process a large amount of data without a problem. Any problem that can be solved with MapReduce can now be simply expressed in SQL.
Hive leverages the disk in the case of processing large data and is not limited by physical memory of any one machine (which is a limitation for systems like Presto). Hence it even allows reasonable fact-fact cross joins.
Hive is extensible with UDFs. For any common patterns you can quickly write your own function set and it can be leveraged by everyone.

Cons

Compute Speed - Hive will be my last option to query vs. something like Presto, which has a much smarter query engine. Hive is slow, and I'd use it only if we cannot use something like Presto/Impala.
SQL syntax of hive is unique and does not conform to ANSI SQL. This is quite painful for beginners.
The ability to upsert records would be nice to have. Hive is cumbersome for mutable data where partitions require them to be rewritten. No one has solved this really well. If this is solved - it could be leveraged by many systems.

Return on Investment

Hive Metastore is great as all other query engines plug into it. I'd tell the hive community to invest more into the metastore as it's one of the strong points of hive.
Overall, we first started with Hadoop, then Hive and then Presto. These are all core components of data in our business and it's highly critical for our business.
We use Hive extensively to compute daily/weekly reports which are essential to run the business.

Alternatives Considered

Presto and Apache Pig

We selected Hive because it supports SQL, schema and provides structure on top of hadoop. Having data structured has its benefits, especially if there are thousands of users processing on the same data over and over again. Pig provides the ability to process unstructured data. However, it is hard to use and requires learning a new scripting language. On the processing side, Hive can lift and process any volume and any complex query. I'd recommend it for complex queries. However, for more simpler daily query, I'd recommend using Presto.

Other Software Used

Presto, Vertica

Likelihood to Recommend

Process large datasets (especially joins of two large datasets, cross joins etc). Hive is not well suited for generic queries on one table and it can still be very slow. There are better solutions for that (Presto, Impala).

Comments

Please log in to join the conversation

Hive Away, but not for everything!