Why do we Presto
August 08, 2017
Why do we Presto
Engineering Manager - Ride Experience
UberInternet, 5001-10,000 employees
Score 8 out of 10
Overall Satisfaction with Presto
Presto is used at Uuber for ad-hoc querying on datasets and also for a few data driven applications. It is used across the entire organization to make data driven decisions, reporting, experimentation analysis etc. Presto provides a scalable, fast distributed query engine and we use it on top of HDFS.
- Fast - Presto, is incredibly fast due to its optimized query engine and is well suited for interactive analysis.
- Flexible - Presto is highly flexible as it operates with a plug and play model for data sources. Joining and query across different data sources is very easy with presto (eg. HDFS, MySQL, Kafka).
- ANSI Sql - Presto follows ANSI SQL which is the recognized SQL language and hence helps allow easy query migration without much overhead.
- Large Fact + Small Dimension table joins made fast - By design presto excels most distributed query engines out there in this type of queries.
- Presto was not designed for large fact fact joins. This is by design as presto does not leverage disk and used memory for processing which in turn makes it fast.. However, this is a tradeoff..in an ideal world, people would like to use one system for all their use cases, and presto should get exhaustive by solving this problem.
- Resource allocation is not similar to YARN and presto has a priority queue based query resource allocation..so a query that takes long takes longer...this might be alleviated by giving some more control back to the user to define priority/override.
- UDF Support is not available in presto. You will have to write your own functions..while this is good for performance, it comes at a huge overhead of building exclusively for presto and not being interoperable with other systems like Hive, SparkSQL etc.
- Presto has helped scale Uber's interactive data needs. We have migrated a lot out of proprietary tech like Vertica.
- Presto has helped build data driven applications on its stack than maintain a separate online/offline stack.
- Presto has helped us build data exploration tools by leveraging it's power of interactive and is immensely valuable for data scientists.
I think Presto is one of the best solutions out there today at the cutting edge for interactive query analysis. One of the challenges is presto is a niche tool for the interactive query use case and doesn't have the knobs and whistles as much as Spark. In the foreseeable future if they are able to make presto work without the need for Hive, solving all the gaps it could be game changing and can be a direct threat to spark.