Reviews (1-25 of 27)
- Simple query language built on top of Ma reduce paradigm.
- Provides parallel execution over distributed system.
- Tabular format and connectors available for all cloud platforms.
- Complex joins may take time to execute due to shuffling of data.
- Static queries mostly.
- Slower than Apache Spark by almost 100 times.
- Dependent on external memory and storage to execute.
- Hive queries are very efficient and fast and produces results in seconds.
- Provides features like ETL , reporting and analytics on top of Hadoop file systems.
- Supports SQL like syntax to query data from Hive tables.
- Supports multiple data formats.
- No updates and support as it is open source.
- Not suitable for Online analytical transaction Processing systems.
- No sub-query support.
It's being used for fetching and generating all the product metrics, for fetching legal data whenever required. All the product history data is stored in it,
It's the one stop cheaper solution for storing and fetching all the analytics data
- It is very easy to set up and start with
- Apache Hive is a cheaper solution for data warehousing and aggregation compared to other products
- One of the cons is the speed which is slightly lesser as compare to other enterprise solutions like BigQuery
- Also, It needs to be maintained by the company itself
If our requirement of aggregation is within seconds for. Terabytes of data then we may have to lookup for other solutions
- Its ability to integrate with Hadoop
- Multiple users can query data simultaneously
- Conversion of varieties of data formats within Hive
- ETL can be done easily
- It is used only for OLAP and not used for OLTP
- Sub queries not supported
- Flexibility through schema on read
- Familiar SQL like query language
- Functions for complex queries and analysis
- Slower processing than other tools on the market
It was one of those technical sessions and I was supposed to demonstrate a word count program of a novel downloaded from the Project Gutenberg. I was successfully able to download the novel, load it into the Hadoop platform and execute a HiveQL (a SQL similar syntax used by Apache Hive) query to demonstrate for few unique words, their count, and related examples.
- The capability to handle large amounts of data and its querying process.
- A syntax similar to SQL is an added advantage.
- An active developer support and community always ready to help.
- Ease of usage.
- Resource consuming sometimes. May be that I was using a larger object file.
- Needs to add an update or a modify functionality. This has to be the minimilastic CRUD requirement.
The only underlying problem could be that the Apache Hive is designed to run on the Apache Hadoop ecosystem. People who are not comfortable using a Linux tree structure based File System or even people who are not likely to use a Linux OS might not like to use Hive.
- Reading databases
- Writing databases
- Storing databases
- Distributed databases
- Improvement techniques for handling Relational Data
- Advanced optimizations
- Transactions memory
- Gives access to files stored in a variety of data storage systems
- Facilitates ETL operations, reporting and data analysis
- Supports queries expressed in a declarative language very similar to SQL
- Not suitable for for online transaction processing workloads
- Much more complicated than any typical RDBMS
- Licensing model based on Apache License 2.0
- The SQL, like query interface, is the core value and shining core of the Hive.
- It supports various data formats stored and also allows indexing.
- It is fast.
- No transaction support.
- No sub-query support.
- Can only deal with the cold data (non-real time).
- Monitor query performance
- Manage tables in the data warehouse
- Uses standard SQL
- UI is quite dated and not intuitive
- Open-source, so does not have consistent updates or support
- Not the most optimal for ETL processes
- The SQL-like query language is very familiar to all the CS students. Hence, it's easy to use.
- I used it on a server so I realize it is very scalable and can be used to process small and big datasets.
- I particularly liked the UDF functionality where the user could define functions to produce particular output.
- Transactions are not supported
- Lack of subqueries made some tasks achievable only when completing one query and then the subsequent one
- It is not as fast as spark.
On the other hand, it's definitely slower than some other alternatives such as spark. Also, it's not recommended to use it in processing small datasets. Pandas and other normal data loading libraries can be useful to deal with small datasets.
- Querying in Apache Hive is very simple because it is very similar to SQL.
- Hive produces good ad hoc queries required for data analysis.
- Another advantage of Hive is that it is scalable.
- Apache Hive isn't designed for and doesn't support online processing of data.
- Sub queries not supported.
- Updating the data can be a problematic task.
- One of the standard SQL on Hadoop implementations. Comes installed in both HDP and CDH Hadoop distributions.
- Hive Live Long and Process has made recent significant improvement on long-running queries.
- Allows BI tools to run analysis over Hadoop data.
- Allows various relational databases for its metastore. These include MySQL, Postgres, Derby, or Oracle.
- Needs to keep up with execution engine improvements. Spark or Tez on Hive, then LLAP are good starts.
- Overall speed of ad-hoc querying could be improved.
- Can query on large sets of data and fast when compared to RDBMS
- Can use SQL for data access and no need to learn new language
- Can write custom functions (UDF) with python and also Java
- Security roles for different users should be implemented
- All the functionalities of SQL should be available
- To query on large sets of data
- Faster access compared to traditional Databases
- OLAP projects
- Data Warehousing project
- To get insights from GigaByte's or TeraByte's of data
- Rule based projects and also to identify the patterns in data
- For applying transformations on large sets of data
- Faster response time than traditional databases
- Also able to get connected with hadoop components
- For complex analytical and different types of data formats
2. Events are gathered in HDFS by flume and needs to be processed into parquet files for fast querying. The input data contains variable attributes in the json payload as each customer could define custom attributes.
- Hive syntax is almost like SQL, so for someone already familiar with SQL it takes almost no effort to pick up Hive.
- To be able to run map reduce jobs using json parsing and generate dynamic partitions in parquet file format.
- Simplifies your experience with Hadoop especially for non-technical/coding partners.
- Hive doesn't support many features that traditional RDBMS SQL has; so it may not be an easier transformation as one would presume.
- Being OpenSource, it has its share of problems and lack of support; need to explore community groups to get some clarifications if you are not using any of the big distribution providers like Cloudera or HW.
- Hive is comparatively slower than its competitors. It's easy to use but that comes with the cost of processing. If you are using it just for batch processing then Hive is well and fine.
We are trying to mine data from massive data sets for a wide variety of purposes (debugging production issues, creating business metrics, models, and forecasts among other things). We have been able to do this very easily using our data warehouse and a combo of Hive and Pig. Makes it simpler for your BA's as they are familiar with SQL, and can adapt to Hive without too much of technical knowhow.
- SQL like query engine, allows easy ramp up from a standard RDBMS
- Scalability is great
- If properly configured the data retreival is fantastic
- The way we currently have it implemented is quite slow, but I believe that's more of our implementation
- Joins tend to be slow
- Hive which leverages traditional MapReduce at the core, can be used to process a large amount of data without a problem. Any problem that can be solved with MapReduce can now be simply expressed in SQL.
- Hive leverages the disk in the case of processing large data and is not limited by physical memory of any one machine (which is a limitation for systems like Presto). Hence it even allows reasonable fact-fact cross joins.
- Hive is extensible with UDFs. For any common patterns you can quickly write your own function set and it can be leveraged by everyone.
- SQL syntax of hive is unique and does not conform to ANSI SQL. This is quite painful for beginners.
- The ability to upsert records would be nice to have. Hive is cumbersome for mutable data where partitions require them to be rewritten. No one has solved this really well. If this is solved - it could be leveraged by many systems.
- Apache Hive works extremely well with large data sets. Analysis over a large data set (Example: 1PB of data) is made easy with hive.
- User-defined functions gives flexibility to users to define operations that are used frequently as functions.
- String functions that are available in hive has been extensively used for analysis.
- Joins (especially left join and right join) are very complex, space consuming and time consuming. Improvement in this area would be of great help!
- Having more descriptive errors help in resolving issues that arise when configuring and running Apache Hive.
Latency that exists when working with small data sets is a situation that needs to be looked at. Apache Hive is less appropriate in that scenario.
- Supports SQL like queries
- Various storage types including RCFile, HBase, ORC, etc.
- Supports indexing for acceleration
- HiveQL does not have all the features of SQL
- No support for transactions
- It's Fast!
- You can store a different kind of data structures here other than the standard ones
- Good scalability
- Good redundancy too
- It's not as ACID compliant as an RDBMS. It's a recently added feature and still needs work.
- This is not the tool to go for online data processing.
- It does not support sub-queries.
- It can't process data in real time.
Its good for fast query processing, for storing large amounts of data.
- Partition to increase query efficiency.
- Serde to support different data storage format.
- Integrate well with Impala and data can be queried by Impala.
- Support of parquet compression format
- Speed is slower compared to Impala since it uses map reduce
- Querying, joining and aggregating data
- In built-in and user-defined functions
- Support for other big data frameworks like Spark
- Need better user interfaces for browsing datastores and querying
- Hive is good for ETL workloads on Hadoop.
- HiveQL translates SQL like queries into map reduce jobs.It supports custom map reduce scripts to plugged in.
- Hive has two kinds of tables- Hive managed tables and external tables.
- Use external table when other applications like pig, sqoop or mapareduce also using the file in hdfs. Once we delete the external table from Hive, it just deletes the metadata from Hive and original file in hdfs stays.
- Use Hive for analytical work loads. Write once and read many scenarios. Do not prefer updates and deletes.
- Behind scenes Hive creates map reduce jobs. Hive performance is slow compared to Apache Spark.
- Map reduce writes the intermediate outputs to dial whereas Spark operates in in-memory and uses DAG.
Apache Hive solves a few issues for us but the main one being the ability to analyze large volumes of data on S3 directly with overall strong performance. We have been able to analyze billions of records in a matter of minutes with relatively small EC2 cluster using Apache Hive. It also allows for our Data Analysts to simply write SQL and avoids the ramp up to use other tools such as Apache Pig.
- Apache Hive allows use to write expressive solutions to complex problems thanks to its SQL-like syntax.
- Relatively easy to set up and start using.
- Very little ramp-up to start using the actual product, documentation is very thorough, there is an active community, and the code base is constantly being improved.
- Debugging can be messy with ambiguous return codes and large jobs can fail without much explanation as to why.
- Hive is only SQL-like, while more features are being added we have found that some things do not translate over (for example outer joins, inserts, columns can only be referenced once in a select, etc.).
- For out ETL jobs it does not seem to be the optimal tool due to tunings and performance being difficult, Apache Pig may be better for heavy processing jobs.
- Faster than writing MapReduce or scalding jobs to access data in Hadoop.
- Syntax is essentially the same as that of SQL, making the barriers for entry to start using data low.
- Apache Hive can be quite slow and is not suitable for interactive querying. Simple queries will take many minutes and more complex queries can take a very long time to finish running.
Apache Hive Scorecard Summary
What is Apache Hive?
Apache Hive Technical Details