Easy interface to use Hive
September 12, 2017

Easy interface to use Hive

Anonymous | TrustRadius Reviewer
Score 7 out of 10
Vetted Review
Verified User

Overall Satisfaction with Treasure Data

We use it mainly as a logging tool and to take snapshots of important collections. It is also our gateway between the production database and Redshift, which is used by less technical users. Any task that requires use of big data goes through the Treasure Data pipeline. Currently, it is only used by the Engineering team.
  • It uses Hive, which allows you to analyze TBs of data in a reasonable amount of time.
  • Since some tables may have "duplicate" records with respect to some columns, TD provides functions that allow you to pick essential data from the different records that represent "the same event".
  • When you need faster queries, there's also Presto. Presto does not have the overhead of Hive.
  • When exploring a table, there should be a fast way to query it. E.g., a button that says "Query Table" and upon clicking on it, it opens the query page with boilerplate SQL prewritten
  • When a database has too many tables, the Query page becomes unresponsive while it loads a lot of data from all the tables. There should be a way to opt out of that behavior.
  • Bad error messages when a query fails. (For example, I've received errors about a parenthesis when the real issue was I didn't assign an alias to a subquery.)
  • I often use the same tables. There should be a tab with "My Most Used Tables" or something like that, so I can get faster access to what I need in order to do work.
  • API throws 404s in some instances where it should be returning 403s or 401s. It becomes hard to debug when new team members haven't received the same level of access as older team members.
  • This question might be more appropriate to team leads.
  • There is a data visualization software we use which seems to be able to pull data from Presto TD easily. This speeds up developing time, although it is not as performant as pushing data to redshift beforehand.
We still use all of the above. They are part of an ecosystem of data software products and each of them has its own purpose. As I mentioned before, easiness of "writes" to TD and the capability of querying vast amounts of data in a reasonable time are a reason we will not be letting TD go any time soon.
It is a great solution for storing (and querying) a large amount of data. Its API is mostly good (I would love to have some more documentation. There's a chunker I still have no idea how to use, and a row handler that I basically ended up copying from a colleague's code when I needed my own). Building reports from TD is pretty simple.

Not good as a production database. Response times can be lengthy, which would drive users away.

Using Treasure Data

It took me just a few minutes before I managed to get started using their system. Aside from minor bumps, I think it's pretty straightforward and it's a great tool.
ProsCons
Like to use
Relatively simple
Easy to use
Well integrated
Consistent
Quick to learn
Convenient
Feel confident using
Familiar
None
  • The Query interface is pretty straightforward. It even highlights syntax errors as they happen.
  • You can query from a python shell via an API call. That allows you to build dynamic queries that depend on other data (such as dates or some particular ids you might be interested in)
  • When the results are huge, it is not very easy to process the results via API call. There are row handlers and chunkers, but these are not well documented (or, if they are, they're not easily accessible, as I couldn't find them).
  • The Query interface attempts to load all the (metadata?) tables in the database, which makes the website unresponsive for a bit.
  • Poor error messages when the query was not typed directly in the interface.
  • There are no line numbers in a query. This would simplify detecting where the problems are when there are syntax errors.