Apache Spark and Presto are open-source distributed data processing engines. Both engines are designed for ‘big data’ applications, designed to help analysts and data engineers query large amounts of data quickly. Although they have many similarities, Presto is focused on SQL query jobs, while Apache Spark is designed to handle applications that require more computational analysis, such as machine learning.
Both Apache Spark and Presto are used mostly by large enterprises, with a significant mid-sized company user base as well. Since both engines are designed for big data processing, they’re often overkill for smaller businesses.
Although both Apache Spark and Presto are used for similar applications, they each have distinguishing features that set them apart from each other.
Apache Spark is designed for fast data processing in a variety of contexts, including machine learning, ETL, and ad-hoc querying. It uses an in-memory processing design, meaning it can run with very few disk read/write operations and process enormous datasets quickly. Developers report that its SQL interface and object-oriented design make it easy to understand and write code for. Users also appreciate its wide variety of APIs for ETL procedures and cluster management. Apache Spark has a large support community and wide industry adoption, and the internet has plenty of recommended solutions to common problems.
Presto is optimized specifically for SQL, meaning it can exceed Apache Spark’s speed for SQL queries. It queries data in-place, without copying or moving data. Presto also uses a flexible, plug-and-play architecture that makes it easy to combine and simultaneously query data from multiple sources, including both SQL and NoSQL databases. It’s suitable for ad-hoc querying, batch ETL jobs, and data analysis for A/B testing.
Before adopting Apache Spark or Presto, consider the limitations of each engine.
Apache Spark’s in-memory processing may be fast, but it also requires plenty of memory, which can quickly get expensive. Some users found that Apache Spark isn’t ideal for real-time analytics, while others found its data security capabilities lacking. It lacks automatic optimization and caching features, requiring some users to build the functionality themselves. Finally, Apache Spark may be designed intuitively, but it’s still a complicated tool with a steep learning curve.
Presto’s SQL optimization is also its primary limitation. It’s designed primarily to run SQL queries, while Apache Spark is suitable for a wider range of applications. This also means that Presto is at its best when the data it’s querying is already in SQL databases; although Presto can query and join data from multiple database types, you only get the highest speeds with SQL data. Additionally, Presto requires a lot of setup to run properly, with installation and configuration across many different nodes.
Both Apache Spark and Presto are open-source and free.
Provided by the TrustRadius Research Team
Published on December 3, 2020
Likelihood to Recommend
- Rich APIs for data transformation making for very each to transform and prepare data in a distributed environment without worrying about memory issues
- Faster in execution times compare to Hadoop and PIG Latin
- Easy SQL interface to the same data set for people who are comfortable to explore data in a declarative manner
- Interoperability between SQL and Scala / Python style of munging data
- Linking, embedding links and adding images is easy enough.
- Once you have become familiar with the interface, Presto becomes very quick & easy to use (but, you have to practice & repeat to know what you are doing - it is not as intuitive as one would hope).
- Organizing & design is fairly simple with click & drag parameters.
- Memory management. Very weak on that.
- PySpark not as robust as scala with spark.
- spark master HA is needed. Not as HA as it should be.
- Locality should not be a necessity, but does help improvement. But would prefer no locality
- Presto was not designed for large fact fact joins. This is by design as presto does not leverage disk and used memory for processing which in turn makes it fast.. However, this is a tradeoff..in an ideal world, people would like to use one system for all their use cases, and presto should get exhaustive by solving this problem.
- Resource allocation is not similar to YARN and presto has a priority queue based query resource allocation..so a query that takes long takes longer...this might be alleviated by giving some more control back to the user to define priority/override.
- UDF Support is not available in presto. You will have to write your own functions..while this is good for performance, it comes at a huge overhead of building exclusively for presto and not being interoperable with other systems like Hive, SparkSQL etc.
Return on Investment
- It has had a very positive impact, as it helps reduce the data processing time and thus helps us achieve our goals much faster.
- Being easy to use, it allows us to adapt to the tool much faster than with others, which in turn allows us to access various data sources such as Hadoop, Apache Mesos, Kubernetes, independently or in the cloud. This makes it very useful.
- It was very easy for me to use Apache Spark and learn it since I come from a background of Java and SQL, and it shares those basic principles and uses a very similar logic.
- Presto has helped scale Uber's interactive data needs. We have migrated a lot out of proprietary tech like Vertica.
- Presto has helped build data driven applications on its stack than maintain a separate online/offline stack.
- Presto has helped us build data exploration tools by leveraging it's power of interactive and is immensely valuable for data scientists.