Bringing Structure to your Unstructured Data
October 25, 2017

Bringing Structure to your Unstructured Data

Bharadwaj (Brad) Chivukula | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source

Overall Satisfaction with Apache Hive

1. In Retail, the business partners are more comfortable querying their own data instead of relying on Engineers. Hive solves one of those problems. The main purpose of using Hive is to building reports and do analysis of data that is stored in the Hadoop file system.
2. Events are gathered in HDFS by flume and needs to be processed into parquet files for fast querying. The input data contains variable attributes in the json payload as each customer could define custom attributes.

  • Hive syntax is almost like SQL, so for someone already familiar with SQL it takes almost no effort to pick up Hive.
  • To be able to run map reduce jobs using json parsing and generate dynamic partitions in parquet file format.
  • Simplifies your experience with Hadoop especially for non-technical/coding partners.
  • Hive doesn't support many features that traditional RDBMS SQL has; so it may not be an easier transformation as one would presume.
  • Being OpenSource, it has its share of problems and lack of support; need to explore community groups to get some clarifications if you are not using any of the big distribution providers like Cloudera or HW.
  • Hive is comparatively slower than its competitors. It's easy to use but that comes with the cost of processing. If you are using it just for batch processing then Hive is well and fine.
  • Hive has been instrumental to transform the technical landscape without putting Business Partners at risk when converting to Hadoop Ecosystem; it helps see your unstructured data in a structured format.
  • Primary Querying engine for Data Analytics.
  • Data analytics, making vast amounts of data available for general BI uses.

For storing bulk amount of data in a tabular manner, and where there's no need need of primary key, or just in case, if redundant data is received, it will not cause a problem. For small amounts of data, it does run MR, so beware. If your intention is to use it as a transactional records, then do not go with it. Explore other tools like Spark also as many of the features that Hive does is now supported by Spark.

We are trying to mine data from massive data sets for a wide variety of purposes (debugging production issues, creating business metrics, models, and forecasts among other things). We have been able to do this very easily using our data warehouse and a combo of Hive and Pig. Makes it simpler for your BA's as they are familiar with SQL, and can adapt to Hive without too much of technical knowhow.