Item: Presto
Rating: 8
Author: Praveen Murugesan

Overall Satisfaction with Presto

Use Cases and Deployment Scope

Presto is used at Uuber for ad-hoc querying on datasets and also for a few data driven applications. It is used across the entire organization to make data driven decisions, reporting, experimentation analysis etc. Presto provides a scalable, fast distributed query engine and we use it on top of HDFS.

Pros and Cons

Pros

Fast - Presto, is incredibly fast due to its optimized query engine and is well suited for interactive analysis.
Flexible - Presto is highly flexible as it operates with a plug and play model for data sources. Joining and query across different data sources is very easy with presto (eg. HDFS, MySQL, Kafka).
ANSI Sql - Presto follows ANSI SQL which is the recognized SQL language and hence helps allow easy query migration without much overhead.
Large Fact + Small Dimension table joins made fast - By design presto excels most distributed query engines out there in this type of queries.

Cons

Presto was not designed for large fact fact joins. This is by design as presto does not leverage disk and used memory for processing which in turn makes it fast.. However, this is a tradeoff..in an ideal world, people would like to use one system for all their use cases, and presto should get exhaustive by solving this problem.
Resource allocation is not similar to YARN and presto has a priority queue based query resource allocation..so a query that takes long takes longer...this might be alleviated by giving some more control back to the user to define priority/override.
UDF Support is not available in presto. You will have to write your own functions..while this is good for performance, it comes at a huge overhead of building exclusively for presto and not being interoperable with other systems like Hive, SparkSQL etc.

Return on Investment

Presto has helped scale Uber's interactive data needs. We have migrated a lot out of proprietary tech like Vertica.
Presto has helped build data driven applications on its stack than maintain a separate online/offline stack.
Presto has helped us build data exploration tools by leveraging it's power of interactive and is immensely valuable for data scientists.

Alternatives Considered

Vertica, Apache Spark and Apache Pig

I think Presto is one of the best solutions out there today at the cutting edge for interactive query analysis. One of the challenges is presto is a niche tool for the interactive query use case and doesn't have the knobs and whistles as much as Spark. In the foreseeable future if they are able to make presto work without the need for Hive, solving all the gaps it could be game changing and can be a direct threat to spark.

Other Software Used

Vertica, Apache Hive, Apache Pig

Likelihood to Recommend

Presto is for interactive simple queries, where Hive is for reliable processing. If you have a fact-dim join, presto is great..however for fact-fact joins presto is not the solution.. Presto is a great replacement for proprietary technology like Vertica.

Comments

Please log in to join the conversation

Why do we Presto