TrustRadius Insights for Apache Spark are summaries of user sentiment data from TrustRadius reviews and, when necessary, third party data sources.
Pros
Great Computing Engine: Apache Spark is praised by many users for its capabilities in handling complex transformative logic and sophisticated data processing tasks. Several reviewers have mentioned that it is a great computing engine, indicating its effectiveness in solving intricate problems.
Valuable Insights and Analysis: Many reviewers find Apache Spark to be useful for understanding data and performing data analytical work. They appreciate the valuable insights and analysis capabilities provided by the software, suggesting that it helps them gain deeper understanding of their data.
Extensive Set of Libraries and APIs: The extensive set of libraries and APIs offered by Apache Spark has been highly appreciated by users. It provides a wide range of tools and functionalities to solve various day-to-day problems, making it a versatile choice for different data processing needs.
We use Apache Spark on a daily basis as the main computation engine for updating most critical and non-critical data pipelines. We mostly work with batch processing but there are instances for using Spark Streaming as well. The scope is for all analysis pipelines, machine learning datasets and several operational use cases.
Pros
Parallel processing
Configurability
Usage with other tools
Cons
More ready-to-use solutions for tweaking the Apache Spark configs
Reduce the creation of UDFs for Pyspark by implementing transformations directly
Likelihood to Recommend
Based on my personal experience, Apache Spark is great when you have the need for highly parallelized jobs and have the time and resources to adapt the configurations for your jobs: for this reason I would not recommend it for companies that do not have a strong group of data engineers that can support other data roles to process data in their company.
If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that I was involved in using Apache Spark was a Recommendation Systems based project. My area or domain of research expertise is also Recommendation Systems. The deployment of a RecSys along with the use of Apache Spark - functionalities like scalability, flexibility of using various data sources along with fault-tolerant systems - are very easy. The built-in machine learning library MLlib is a boon to work. We don't require any other libraries.
Pros
Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
Scalable to any extent.
Has built-in machine learning library called - MLlib
Very flexible - data from various data sources can be used. Usage with HDFS is very easy
Cons
Its fully not backward compatible.
It is memory-consuming for heavy and large workloads and datasets
Support for advanced analytics is not available - MLlib has minimalistic analytics.
Deployment is a complex task for beginners.
Likelihood to Recommend
Well suited: To most of the local run of datasets and non-prod systems - scalability is not a problem at all. Including data from multiple types of data sources is an added advantage. MLlib is a decently nice built-in library that can be used for most of the ML tasks.
Less appropriate: We had to work on a RecSys where the music dataset that we used was around 300+Gb in size. We faced memory-based issues. Few times we also got memory errors. Also the MLlib library does not have support for advanced analytics and deep-learning frameworks support. Understanding the internals of the working of Apache Spark for beginners is highly not possible.
Earlier we were using RDBMS like Oracle for retail and eCommerce data. We faced challenges such as cost, performance, and a huge amount of transactions coming in. After a lot of critical issues we migrated to delta lake. Now, we are using Apache Spark Streaming to deal with all real-time transactions. For batch data as well, we are pretty much handling TBs of data using Apache Spark.
Pros
Realtime data processing
Interactive Analysis of data
Trigger Event Detection
Cons
Machine Learning
GraphX Lib
True Realtime Streaming
Likelihood to Recommend
Well suited for batch processing and provides performance improvement through optimization techniques. Data Streaming is getting better with Apache Spark Structured Streaming. Out of memory issues and Data Skewness problems when data is not properly organized. Integration with BI tools such as Tableau could be better.
We need to calculate risk-weighted assets (RWA) daily and monthly for different positions the bank holds on a T+1 basis. The volume of calculations is large: more than millions of records per day with very complicated formulas and algorithms. In our applications/projects, we used Scala and Apache Spark clusters to load all data we needed for calculation and implemented complicated formulas and algorithms via its DataFrame or DataSet from the Apache Spark platform.
Without adopting the Apache Spark cluster, it would be pretty hard for us to implement such a big system to handle a large volume of data calculations daily. After this system was successfully deployed into PROD, we've been able to provide capital risk control reports to regulation/compliance controllers in different regions in this global financial world.
Pros
DataFrame as a distributed collection of data: easy for developers to implement algorithms and formulas.
Calculation in-memory.
Cluster to distribute large data of calculation.
Cons
It would be great if Apache Spark could provide a native database to manage all file info of saved parquet.
Likelihood to Recommend
For a large volume of data to be calculated, Apache Spark is the go-to; for intermediate or small volumes of data sets, Apache Spark is an option.
We are building a model and due to the size of the data, we chose to use Apache Spark for the feature generation. The usage of the tool is limited within my department and one another department. The two departments need to deal with long dataset and the other departments does not need that.
Pros
quick
utilized CPU cores
trendy
Cons
lack of support
memory hungry
slow on wide data
Likelihood to Recommend
I would recommend Apache Spark to the colleague if that person is working with long but narrow dataset. This would be a great tool to help the person fully utilize the CPU cores and speed up the work process. However, I would not recommend this tool if the dataset is wide not not very large.
VU
Verified User
Analyst in Professional Services (10,001+ employees)
<ul><li>We are using Apache Spark in Digital - Data teams to build data products and help business teams to take data-driven decisions.</li><li>We use Apache Spark to source that from different source systems, process it, and store it in the data lake.</li><li>Once the data is in data lake, we use spark for data cleansing and data transformation as per business requirements </li><li>Once the data is transformed, then we will insert it into the final target layer in the data warehouse.</li></ul>
Pros
Spark is very fast compered to other frameworks because it works in cluster mode and use distributed processing and computation frameworks internally
Robust and fault tolerant
Open source
Can source data from multiple data sources
Cons
No Dataset API support in python version of spark
Apache Spark job run UI can have more meaningful information
Spark errors can provide more meaningful information when a job is failed
Likelihood to Recommend
Specific scenarios where Apache Spark is well suited:
1. real-time processing of streaming data
2. processing unstructured data, semi-structured data, and structured data from multiple sources
3. avoid vendor lock-in and cloud platform lock-in while developing products
Apache Spark is being widely used within the company. In Advanced Analytics department data engineers and data scientists work closely in machine learning projects to generate value. Spark provides unified big data analytics engine which helps us easily process huge amount of data. We are using Spark in projects like churn prediction, network analytics.
Pros
Machine learning on big data
Stream processing
Lakehouse with Delta
Cons
Indexing
Mllib
Streaming
Likelihood to Recommend
Apache Spark is very good for prosessing large amount of data but not that good if you need many joins or low latency. With combination of delta engine performance improved alot. Especially having ACID support, time travel features and consistent view for simultaneous read and writes it’s now ready for next level.
VU
Verified User
Engineer in Information Technology (10,001+ employees)
We use Apache Spark for cluster computing in large-scale data processing, ETL functions, machine learning, as well as for analytics. Its primarily used by the Data Engineering Department, in order to support the data lake infrastructure. It helps us to effectively manage the great amounts of data that come from our clusters, ensuring the capacity, scalability, and performance needed.
Pros
Speed: Apache Spark has great performance for both streaming and batch data
Easy to use: the object oriented operators make it easy and intuitive.
Multiple language support
Fault tolerance
Cluster managment
Supports DF, DS, and RDDs
Cons
Hard to learn, documentation could be more in-depth.
Due to it's in-memory processing, it can take a large consumption of memory.
Poor data visualization, too basic.
Likelihood to Recommend
Well suited for: large datasets, fault tolerance, parallel processing, ETL, batch processing, streaming, analytics, graphing, or machine learning. Mostly any kind of large-scale processing, since it will save you a lot of time (days of processing). Less appropriate for: smaller datasets, you are better off using pandas or other libraries.
VU
Verified User
Engineer in Information Technology (11-50 employees)
We were working for one of our products, which has a requirement for developing an enterprise-level product catering to manage a vast amount of Big data involved. We wanted to use a technology that is faster than Hadoop and can process large scale data by providing a streamlined process for the data scientists. Apache Spark is a powerful unified solution as we thought to be.
The main problem that we identified in our existing approach was that it was taking a large amount of time to process the data, and also the statistical analysis of the data was not up to the mark. We wanted a sophisticated analytical solution that was easy and fast to use. With using Apache Spark, the processing was made 5 times faster than earlier, giving rise to pretty good analytics. With Spark, across a cluster of machines, the data abstraction was achieved by using RDDs.
Spark has in-built Machine Learning library which scales and integrates with existing tools.
Cons
The data processing done by Spark comes at a price of memory blockages, as in-memory capabilities of processing can lead to large consumption of memory.
The caching algorithm is not in-built in Spark. We need to manually set up the caching mechanism.
Likelihood to Recommend
1. Suitable where the requirement for advanced analytics is prominent.
2. When you want big data to be processed at a very fast pace.
3. For large datasets, Spark is a viable solution.
4. When you need fault tolerance to be at a precision, go for Spark.
Spark is not suitable:
1. If you want your data to be processed in real-time, then Spark is not a good solution.
2. When you need automatic optimization, then Spark fails at that point.