Apache Spark Reviews & Insights

Score8.9 out of 10

165 Reviews and Ratings

Community insights

TrustRadius Insights for Apache Spark are summaries of user sentiment data from TrustRadius reviews and, when necessary, third party data sources.

Pros

Great Computing Engine: Apache Spark is praised by many users for its capabilities in handling complex transformative logic and sophisticated data processing tasks. Several reviewers have mentioned that it is a great computing engine, indicating its effectiveness in solving intricate problems.

Valuable Insights and Analysis: Many reviewers find Apache Spark to be useful for understanding data and performing data analytical work. They appreciate the valuable insights and analysis capabilities provided by the software, suggesting that it helps them gain deeper understanding of their data.

Extensive Set of Libraries and APIs: The extensive set of libraries and APIs offered by Apache Spark has been highly appreciated by users. It provides a wide range of tools and functionalities to solve various day-to-day problems, making it a versatile choice for different data processing needs.

Apache Spark Reviews

24 Reviews

Apache Spark is still a valid DE tool

Rating: 9 out of 10

Incentivized

December 28, 2024

Use Cases and Deployment Scope

We use Apache Spark on a daily basis as the main computation engine for updating most critical and non-critical data pipelines. We mostly work with batch processing but there are instances for using Spark Streaming as well. The scope is for all analysis pipelines, machine learning datasets and several operational use cases.

Pros

Parallel processing
Configurability
Usage with other tools

Cons

More ready-to-use solutions for tweaking the Apache Spark configs
Reduce the creation of UDFs for Pyspark by implementing transformations directly

Likelihood to Recommend

Based on my personal experience, Apache Spark is great when you have the need for highly parallelized jobs and have the time and resources to adapt the configurations for your jobs: for this reason I would not recommend it for companies that do not have a strong group of data engineers that can support other data roles to process data in their company.

Verified User

Employee in Engineering (1001-5000 employees)

Vetted Review

Apache Spark: Lightning-Fast Distributed Computing with a Learning Curve

Rating: 10 out of 10

Incentivized

August 18, 2023

Use Cases and Deployment Scope

If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that I was involved in using Apache Spark was a Recommendation Systems based project. My area or domain of research expertise is also Recommendation Systems. The deployment of a RecSys along with the use of Apache Spark - functionalities like scalability, flexibility of using various data sources along with fault-tolerant systems - are very easy. The built-in machine learning library MLlib is a boon to work. We don't require any other libraries.

Pros

Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
Scalable to any extent.
Has built-in machine learning library called - MLlib
Very flexible - data from various data sources can be used. Usage with HDFS is very easy

Cons

Its fully not backward compatible.
It is memory-consuming for heavy and large workloads and datasets
Support for advanced analytics is not available - MLlib has minimalistic analytics.
Deployment is a complex task for beginners.

Likelihood to Recommend

Well suited: To most of the local run of datasets and non-prod systems - scalability is not a problem at all. Including data from multiple types of data sources is an added advantage. MLlib is a decently nice built-in library that can be used for most of the ML tasks.

Less appropriate: We had to work on a RecSys where the music dataset that we used was around 300+Gb in size. We faced memory-based issues. Few times we also got memory errors. Also the MLlib library does not have support for advanced analytics and deep-learning frameworks support. Understanding the internals of the working of Apache Spark for beginners is highly not possible.

Ananth Gouri

Assistant Professor in Engineering at The National Institute of Engineering, Mysuru (501-1000 employees)

Vetted Review

View profile

Lightning Fast In-Memory Cluster Computing Framework

Rating: 10 out of 10

August 30, 2022

Use Cases and Deployment Scope

Earlier we were using RDBMS like Oracle for retail and eCommerce data. We faced challenges such as cost, performance, and a huge amount of transactions coming in. After a lot of critical issues we migrated to delta lake. Now, we are using Apache Spark Streaming to deal with all real-time transactions. For batch data as well, we are pretty much handling TBs of data using Apache Spark.

Pros

Realtime data processing
Interactive Analysis of data
Trigger Event Detection

Cons

Machine Learning
GraphX Lib
True Realtime Streaming

Likelihood to Recommend

Well suited for batch processing and provides performance improvement through optimization techniques. Data Streaming is getting better with Apache Spark Structured Streaming. Out of memory issues and Data Skewness problems when data is not properly organized. Integration with BI tools such as Tableau could be better.

Riyaz Khan

Staff Engineer in Information Technology at Nagarro (10,001+ employees)

Vetted Review

View profile

Apache Spark is the next generation of big data computing.

Rating: 9 out of 10

April 18, 2022

Use Cases and Deployment Scope

We need to calculate risk-weighted assets (RWA) daily and monthly for different positions the bank holds on a T+1 basis. The volume of calculations is large: more than millions of records per day with very complicated formulas and algorithms. In our applications/projects, we used Scala and Apache Spark clusters to load all data we needed for calculation and implemented complicated formulas and algorithms via its DataFrame or DataSet from the Apache Spark platform.

Without adopting the Apache Spark cluster, it would be pretty hard for us to implement such a big system to handle a large volume of data calculations daily. After this system was successfully deployed into PROD, we've been able to provide capital risk control reports to regulation/compliance controllers in different regions in this global financial world.

Pros

DataFrame as a distributed collection of data: easy for developers to implement algorithms and formulas.
Calculation in-memory.
Cluster to distribute large data of calculation.

Cons

It would be great if Apache Spark could provide a native database to manage all file info of saved parquet.

Likelihood to Recommend

For a large volume of data to be calculated, Apache Spark is the go-to; for intermediate or small volumes of data sets, Apache Spark is an option.

Steven Li

Senior Software Developer (Consultant) in Information Technology at Morgan Stanley (10,001+ employees)

Vetted Review

View profile

good solution for long and narrow data

Rating: 9 out of 10

Incentivized

May 20, 2021

Use Cases and Deployment Scope

We are building a model and due to the size of the data, we chose to use Apache Spark for the feature generation. The usage of the tool is limited within my department and one another department. The two departments need to deal with long dataset and the other departments does not need that.

Pros

quick
utilized CPU cores
trendy

Cons

lack of support
memory hungry
slow on wide data

Likelihood to Recommend

I would recommend Apache Spark to the colleague if that person is working with long but narrow dataset. This would be a great tool to help the person fully utilize the CPU cores and speed up the work process. However, I would not recommend this tool if the dataset is wide not not very large.

Verified User

Analyst in Professional Services (10,001+ employees)

Vetted Review

Apache Spark - your go to technology for distributed data processing

Rating: 9 out of 10

Incentivized

May 3, 2021

Use Cases and Deployment Scope

We are using Apache Spark in Digital - Data teams to build data products and help business teams to take data-driven decisions.
We use Apache Spark to source that from different source systems, process it, and store it in the data lake.
Once the data is in data lake, we use spark for data cleansing and data transformation as per business requirements
Once the data is transformed, then we will insert it into the final target layer in the data warehouse.

Pros

Spark is very fast compered to other frameworks because it works in cluster mode and use distributed processing and computation frameworks internally
Robust and fault tolerant
Open source
Can source data from multiple data sources

Cons

No Dataset API support in python version of spark
Apache Spark job run UI can have more meaningful information
Spark errors can provide more meaningful information when a job is failed

Likelihood to Recommend

Specific scenarios where Apache Spark is well suited:
1. real-time processing of streaming data
2. processing unstructured data, semi-structured data, and structured data from multiple sources
3. avoid vendor lock-in and cloud platform lock-in while developing products

Surendranatha Reddy Chappidi

Senior Data Engineer in Information Technology at A.P. Moller - Maersk (10,001+ employees)

Vetted Review

View profile

Apache Spark in Telco

Rating: 10 out of 10

Incentivized

April 30, 2021

Use Cases and Deployment Scope

Apache Spark is being widely used within the company. In Advanced Analytics department data engineers and data scientists work closely in machine learning projects to generate value. Spark provides unified big data analytics engine which helps us easily process huge amount of data. We are using Spark in projects like churn prediction, network analytics.

Pros

Machine learning on big data
Stream processing
Lakehouse with Delta

Cons

Indexing
Mllib
Streaming

Likelihood to Recommend

Apache Spark is very good for prosessing large amount of data but not that good if you need many joins or low latency. With combination of delta engine performance improved alot. Especially having ACID support, time travel features and consistent view for simultaneous read and writes it’s now ready for next level.

Verified User

Engineer in Information Technology (10,001+ employees)

Vetted Review

Epic Computation Engine Framework

Rating: 9 out of 10

November 8, 2020

Use Cases and Deployment Scope

Apache Spark is being used by our organization for writing ETL applications. It enables us to ingest thousands of records of data to database tables.

Pros

Great computing engine for solving complex transformative logic
Useful for understanding data and doing data analytical work
Gives us a great set of libraries and api to solve day-to-day problems

Cons

High learning curve
Complexity
More documentation
More developer support
More educational videos

Likelihood to Recommend

Apache Spark is suited for big data applications when there is a need for performing analysis, streaming data work, and ETL work.

Chetan Munegowda

Software Engineer in Information Technology at SemanticBits (201-500 employees)

Vetted Review

View profile

A powerhouse processing engine.

Rating: 9 out of 10

Incentivized

September 19, 2020

Use Cases and Deployment Scope

We use Apache Spark for cluster computing in large-scale data processing, ETL functions, machine learning, as well as for analytics. Its primarily used by the Data Engineering Department, in order to support the data lake infrastructure. It helps us to effectively manage the great amounts of data that come from our clusters, ensuring the capacity, scalability, and performance needed.

Pros

Speed: Apache Spark has great performance for both streaming and batch data
Easy to use: the object oriented operators make it easy and intuitive.
Multiple language support
Fault tolerance
Cluster managment
Supports DF, DS, and RDDs

Cons

Hard to learn, documentation could be more in-depth.
Due to it's in-memory processing, it can take a large consumption of memory.
Poor data visualization, too basic.

Likelihood to Recommend

Well suited for: large datasets, fault tolerance, parallel processing, ETL, batch processing, streaming, analytics, graphing, or machine learning. Mostly any kind of large-scale processing, since it will save you a lot of time (days of processing). Less appropriate for: smaller datasets, you are better off using pandas or other libraries.

Verified User

Engineer in Information Technology (11-50 employees)

Vetted Review

Apache Spark -- The best big data solution

Rating: 8 out of 10

January 12, 2020

Use Cases and Deployment Scope

We were working for one of our products, which has a requirement for developing an enterprise-level product catering to manage a vast amount of Big data involved. We wanted to use a technology that is faster than Hadoop and can process large scale data by providing a streamlined process for the data scientists. Apache Spark is a powerful unified solution as we thought to be.
The main problem that we identified in our existing approach was that it was taking a large amount of time to process the data, and also the statistical analysis of the data was not up to the mark. We wanted a sophisticated analytical solution that was easy and fast to use. With using Apache Spark, the processing was made 5 times faster than earlier, giving rise to pretty good analytics. With Spark, across a cluster of machines, the data abstraction was achieved by using RDDs.

Pros

DataFrames, DataSets, and RDDs.
Spark has in-built Machine Learning library which scales and integrates with existing tools.

Cons

The data processing done by Spark comes at a price of memory blockages, as in-memory capabilities of processing can lead to large consumption of memory.
The caching algorithm is not in-built in Spark. We need to manually set up the caching mechanism.

Likelihood to Recommend

1. Suitable where the requirement for advanced analytics is prominent.
2. When you want big data to be processed at a very fast pace.
3. For large datasets, Spark is a viable solution.
4. When you need fault tolerance to be at a precision, go for Spark.

Spark is not suitable:
1. If you want your data to be processed in real-time, then Spark is not a good solution.
2. When you need automatic optimization, then Spark fails at that point.

Yogesh Mhasde

Technical Manager in Information Technology at Rishabh Software Private Limited (501-1000 employees)

Vetted Review

Loading Reviews List....

Apache Spark: Lightning-Fast Distributed Computing with a Learning Curve

Rating: 10 out of 10

Incentivized

August 18, 2023

Use Cases and Deployment Scope

Pros

Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
Scalable to any extent.
Has built-in machine learning library called - MLlib
Very flexible - data from various data sources can be used. Usage with HDFS is very easy

Cons

Its fully not backward compatible.
It is memory-consuming for heavy and large workloads and datasets
Support for advanced analytics is not available - MLlib has minimalistic analytics.
Deployment is a complex task for beginners.

Likelihood to Recommend

Ananth Gouri

Assistant Professor in Engineering at The National Institute of Engineering, Mysuru (501-1000 employees)

Vetted Review

View profile