One of the first SQL on Hadoop tools. Perhaps not the best.
Overall Satisfaction with Apache Hive
Hive allows us to run SQL queries against data sitting in Hadoop.
Pros
- One of the standard SQL on Hadoop implementations. Comes installed in both HDP and CDH Hadoop distributions.
- Hive Live Long and Process has made recent significant improvement on long-running queries.
- Allows BI tools to run analysis over Hadoop data.
- Allows various relational databases for its metastore. These include MySQL, Postgres, Derby, or Oracle.
Cons
- Needs to keep up with execution engine improvements. Spark or Tez on Hive, then LLAP are good starts.
- Overall speed of ad-hoc querying could be improved.
- Allows analysts to use their SQL skills against large datasets.
- Slow queries allow for opportunities to discover bottlenecks, parameters to tune, and alternative tools or ways to architect a system.
- Apache Impala, Apache Spark and PostgreSQL
Hive was one of the first SQL on Hadoop technologies, and it comes bundled with the main Hadoop distributions of HDP and CDH. Since its release, it has gained good improvements, but selecting the right SQL on Hadoop technology requires a good understanding of the strengths and weaknesses of the alternative options.
Comments
Please log in to join the conversation