Name: Apache Spark
Rating: 8.7 (159 reviews)
Author: Apache

Overview

Recent Reviews

TrustRadius Insights

December 15, 2023

Apache Spark is an incredibly versatile tool that has been widely adopted across various departments for processing very large datasets …

Apache Spark: Lightning-Fast Distributed Computing with a Learning Curve

10 out of 10

August 18, 2023

Incentivized

If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that …

Lightning Fast In-Memory Cluster Computing Framework

10 out of 10

August 30, 2022

Earlier we were using RDBMS like Oracle for retail and eCommerce data. We faced challenges such as cost, performance, and a huge amount of …

Apache Spark is the next generation of big data computing.

9 out of 10

April 18, 2022

We need to calculate risk-weighted assets (RWA) daily and monthly for different positions the bank holds on a T+1 basis. The volume of …

Apache Spark in Telco

10 out of 10

July 22, 2021

Incentivized

Apache Spark is being widely used within the company. In Advanced Analytics department data engineers and data scientists work closely in …

Spark is useful, but requires lots of very valuable questions to justify the effort, and be prepared for failure in answering posed questions

9 out of 10

July 04, 2021

Incentivized

Apache Spark is used by certain departments to produce summary statistics. The software is used for data sets that are very, very large in …

good solution for long and narrow data

9 out of 10

May 20, 2021

Incentivized

We are building a model and due to the size of the data, we chose to use Apache Spark for the feature generation. The usage of the tool is …

Apache Spark - your go to technology for distributed data processing

9 out of 10

May 03, 2021

Incentivized

We are using Apache Spark in Digital - Data teams to build data products and help business teams to take data-driven decisions.
We use …

Epic Computation Engine Framework

9 out of 10

November 08, 2020

Apache Spark is being used by our organization for writing ETL applications. It enables us to ingest thousands of records of data to …

A powerhouse processing engine.

9 out of 10

September 19, 2020

Incentivized

We use Apache Spark for cluster computing in large-scale data processing, ETL functions, machine learning, as well as for analytics. Its …

Apache Spark -- The best big data solution

8 out of 10

January 12, 2020

We were working for one of our products, which has a requirement for developing an enterprise-level product catering to manage a vast …

Great open source tool for data processing

9 out of 10

December 13, 2019

Incentivized

We do use Apache Spark for cluster computing for our ETL environment, data and analytics as well as machine learning. It is mainly used by …

Want to save dollars, resources and time processing big data, switch to Apache Spark

9 out of 10

March 27, 2019

Incentivized

We sold a data science product to one of the leading US-based e-commerce firms. Suddenly, their data started growing at a very fast rate. …

Apache Spark Review

7 out of 10

March 16, 2019

Incentivized

We used Apache Spark within our department as a Solution Architecture team. It helped make big data processing more efficient since the …

Apache Spark - defacto for big data processing/analytics

9 out of 10

December 14, 2018

Incentivized

Used as the in memory data engine for big data analytics, streaming data and SQL workloads. Also, in the process of trying it out for …

Very useful application for Big Data processing and excellent for large volume production workflows

10 out of 10

August 28, 2018

Incentivized

Apache Spark is being used by the whole organization. It helps us a lot in the transmission of data, as it is 100 times faster than Hadoop …

Read all reviews

Reviewer Pros & Cons

View all pros & cons

Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
Support for advanced analytics is not available - MLlib has minimalistic analytics.

Ananth Gouri

Assistant Professor

The National Institute of Engineering, Mysuru (Education Management, 501-1000 employees)

Return to navigation

Product Demos

Spark Project | Spark Tutorial | Online Spark Training | Intellipaat

YouTube

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

YouTube

Apache Spark Full Course | Apache Spark Tutorial For Beginners | Learn Spark In 7 Hours |Simplilearn

YouTube

Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka

YouTube

Introduction to Databricks [New demo linked in description]

YouTube

Apache Spark Tutorial | Spark Tutorial for Beginners | Spark Big Data | Intellipaat

YouTube

Return to navigation

Product Details

About
Tech Details

What is Apache Spark?

Apache Spark Technical Details

Operating Systems	Unspecified
Mobile Application	No

Return to navigation

Comparisons

View all alternatives

Compare with

Reviews and Ratings

(159)

January 31st 2024

Community Insights

TrustRadius Insights are summaries of user sentiment data from TrustRadius reviews and, when necessary, 3rd-party data sources. Have feedback on this content? Let us know!

Business Problems Solved
Pros
Cons
Recommendations

Apache Spark is an incredibly versatile tool that has been widely adopted across various departments for processing very large datasets and generating summary statistics. Users have found it particularly useful for creating simple graphics when working with big data, making it a valuable asset for analytics departments. It is also used extensively in the banking industry to calculate risk-weighted assets on a daily and monthly basis for different positions. The integration of Apache Spark with Scala and Apache Spark clusters enables users to load and process large volumes of data, implementing complex formulas and algorithms. Additionally, Apache Spark is often utilized alongside Kafka and Spark Streams to extract data from Kafka queues into HDFS environments, allowing for streamlined data analysis and processing.

One of the key strengths of Apache Spark lies in its ability to handle large volumes of retail and eCommerce data, providing cost and performance benefits over traditional RDBMS solutions. This makes it a preferred choice for companies in these industries. Furthermore, Apache Spark plays a crucial role in supporting data-driven decision-making by digital data teams. Its capabilities allow these teams to build data products, source data from different systems, process and transform it, and store it in data lakes.

Apache Spark is highly regarded for its ability to perform data cleansing and transformation before inserting it into the final target layer in data warehouses. This makes it a vital tool for ensuring the accuracy and reliability of data. Its faster data processing capabilities compared to Hadoop MapReduce have made Apache Spark a go-to choice for tasks such as machine learning, analytics, batch processing, data ingestion, and report development. Moreover, educational institutions rely on Apache Spark to optimize scheduling by assigning classrooms based on student course enrollment and professor schedules.

Overall, Apache Spark proves itself as an indispensable product that meets the needs of various industries by offering efficient distributed data processing, advanced analytics capabilities, and seamless integration with other technologies. Its versatility allows it to support a wide range of use cases, making it an essential tool for anyone working with big data.

Great Computing Engine: Apache Spark is praised by many users for its capabilities in handling complex transformative logic and sophisticated data processing tasks. Several reviewers have mentioned that it is a great computing engine, indicating its effectiveness in solving intricate problems.

Valuable Insights and Analysis: Many reviewers find Apache Spark to be useful for understanding data and performing data analytical work. They appreciate the valuable insights and analysis capabilities provided by the software, suggesting that it helps them gain deeper understanding of their data.

Extensive Set of Libraries and APIs: The extensive set of libraries and APIs offered by Apache Spark has been highly appreciated by users. It provides a wide range of tools and functionalities to solve various day-to-day problems, making it a versatile choice for different data processing needs.

Challenging to Understand and Use: Some users have found Apache Spark to be challenging to understand and use for modeling big data. They struggle with the complexity of the software, leading to a high learning curve.

Lack of User-Friendliness: The software is considered not user-friendly, with a confusing user interface and graphics that are not of high quality. This has resulted in frustration among some users who find it difficult to navigate and work with.

Time-Consuming Processing: Apache Spark can be time-consuming when processing large data sets across multiple nodes. This has been reported by several users who have experienced delays in their data processing tasks, affecting overall efficiency.

When using Spark for big data tasks, users commonly recommend familiarizing yourself with the documentation and gaining experience. They emphasize investing time in reading and understanding the documentation to overcome any initial challenges. As users gain experience, they find working with Spark becomes easier and more efficient.

Users also suggest utilizing Spark specifically for true big data problems, where its capabilities and performance shine. They highlight that Spark is well-suited for tackling large-scale data processing tasks.

Additionally, users find value in leveraging the R and Python APIs in Spark. These APIs allow them to work with Spark using familiar programming languages such as R and Python, making it easier to analyze and process data.

Overall, users advise diving into the documentation, utilizing Spark for big data challenges, and leveraging the R and Python APIs to enhance their experience with Spark.

Attribute Ratings

Reviews

(1-4 of 4)

Sort By *

Companies can't remove reviews or game the system. Here's why

November 08, 2020

Epic Computation Engine Framework

Chetan Munegowda

Software Engineer

SemanticBits (Information Technology and Services, 201-500 employees)

Score 9 out of 10

Vetted Review

Verified User

Use Cases and Deployment Scope

Apache Spark is being used by our organization for writing ETL applications. It enables us to ingest thousands of records of data to database tables.

Pros and Cons

Great computing engine for solving complex transformative logic
Useful for understanding data and doing data analytical work
Gives us a great set of libraries and api to solve day-to-day problems

High learning curve
Complexity
More documentation
More developer support
More educational videos

Likelihood to Recommend

Apache Spark is suited for big data applications when there is a need for performing analysis, streaming data work, and ETL work.

Return on Investment

Saves lot of time
Very powerful
Automates lots of manual work
Higher depth of knowledge is required to understand and perform analysis

Alternatives Considered

Apache Hadoop

Spark is simply awesome to work on with any data sets and also has an in-memory database which makes it very flexible.

Usability

I have been using this for my ETL application, gives me all the necessary APIs to work with and solves my business objective.

Support Rating

Developer support for Apache Spark can be improved. We need more of a developer community around this considering it's an emerging technology.

Other Software Used

AWS Glue, Amazon Athena

September 19, 2020

A powerhouse processing engine.

Verified User

Engineer in Information Technology

Information Technology & Services Company, 11-50 employees

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We use Apache Spark for cluster computing in large-scale data processing, ETL functions, machine learning, as well as for analytics. Its primarily used by the Data Engineering Department, in order to support the data lake infrastructure. It helps us to effectively manage the great amounts of data that come from our clusters, ensuring the capacity, scalability, and performance needed.

Pros and Cons

Speed: Apache Spark has great performance for both streaming and batch data
Easy to use: the object oriented operators make it easy and intuitive.
Multiple language support
Fault tolerance
Cluster managment
Supports DF, DS, and RDDs

Hard to learn, documentation could be more in-depth.
Due to it's in-memory processing, it can take a large consumption of memory.
Poor data visualization, too basic.

Likelihood to Recommend

Well suited for: large datasets, fault tolerance, parallel processing, ETL, batch processing, streaming, analytics, graphing, or machine learning. Mostly any kind of large-scale processing, since it will save you a lot of time (days of processing). Less appropriate for: smaller datasets, you are better off using pandas or other libraries.

Return on Investment

Saved time and resources for the company because of it's agility
High performance data processing.

Support Rating

Never had to contact them, however, they offer 24/7 support and there are a large number of forums about Spark, well-integrated with python and supports SQL syntaxis.

Usability

The only thing I dislike about spark's usability is the learning curve, there are many actions and transformations, however, its wide-range of uses for ETL processing, facility to integrate and it's multi-language support make this library a powerhouse for your data science solutions. It has especially aided us with its lightning-fast processing times.

Other Software Used

Hadoop, Apache Kafka

January 12, 2020

Apache Spark -- The best big data solution

Yogesh Mhasde

Technical Manager

Rishabh Software Private Limited (Information Technology & Services, 501-1000 employees)

Score 8 out of 10

Vetted Review

Verified User

Use Cases and Deployment Scope

We were working for one of our products, which has a requirement for developing an enterprise-level product catering to manage a vast amount of Big data involved. We wanted to use a technology that is faster than Hadoop and can process large scale data by providing a streamlined process for the data scientists. Apache Spark is a powerful unified solution as we thought to be.
The main problem that we identified in our existing approach was that it was taking a large amount of time to process the data, and also the statistical analysis of the data was not up to the mark. We wanted a sophisticated analytical solution that was easy and fast to use. With using Apache Spark, the processing was made 5 times faster than earlier, giving rise to pretty good analytics. With Spark, across a cluster of machines, the data abstraction was achieved by using RDDs.

Pros and Cons

DataFrames, DataSets, and RDDs.
Spark has in-built Machine Learning library which scales and integrates with existing tools.

The data processing done by Spark comes at a price of memory blockages, as in-memory capabilities of processing can lead to large consumption of memory.
The caching algorithm is not in-built in Spark. We need to manually set up the caching mechanism.

Likelihood to Recommend

1. Suitable where the requirement for advanced analytics is prominent.
2. When you want big data to be processed at a very fast pace.
3. For large datasets, Spark is a viable solution.
4. When you need fault tolerance to be at a precision, go for Spark.

Spark is not suitable:
1. If you want your data to be processed in real-time, then Spark is not a good solution.
2. When you need automatic optimization, then Spark fails at that point.

Return on Investment

The ROI was increased by considerable percentage after using Apache Spark.
Apache Spark provided the agility towards supporting multiple applications.

Alternatives Considered

Hadoop and Amazon EMR (Elastic MapReduce)

1. Apache Spark is almost 100 % faster than Hadoop.
2. Apache Spark is more stable than Amazon EMR.
3. The end to end distributed machine library is more robust in Apache Spark.
4. For very large data sets, Apache Spark is more trustworthy than the other two.
5. For data transformations, Apache Spark provides a very rich set of APIs.
6. The interface provided for SQL in Apache Spark is easy to understand as compared to others.

Support Rating

1. It integrates very well with scala or python.
2. It's very easy to understand SQL interoperability.
3. Apache is way faster than the other competitive technologies.
4. The support from the Apache community is very huge for Spark.
5. Execution times are faster as compared to others.
6. There are a large number of forums available for Apache Spark.
7. The code availability for Apache Spark is simpler and easy to gain access to.
8. Many organizations use Apache Spark, so many solutions are available for existing applications.

Other Software Used

Apache Camel, Azure Bot Service (Microsoft Bot Framework), Apache Kafka

December 13, 2019

Great open source tool for data processing

Verified User

Executive in Information Technology

Consumer Goods Company, 10,001+ employees

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We do use Apache Spark for cluster computing for our ETL environment, data and analytics as well as machine learning. It is mainly used by our data engineering team to support the entire Data Lake foundation. As we have huge amounts of information coming from multiple sources, we needed an effective cluster management system to handle capacity and deliver the performance and throughput we needed.

Pros and Cons

Cluster management for ETL.
Data processing engine for our data lake.

You still need Hive or other HDFS to store information.
Security is behind compared to MapReduce.

Likelihood to Recommend

Spark is a one-size-fits-all data processing platform. You can run batch and in-motion streams, you can use for ETL, machine learning or even graphs. You do not have multiple tools, so it makes your TCO and management tasks way easier. As every new platform, has room to grow: storage and security are the main opportunities we found.

Return on Investment

Simplified our landscape.
Drove great performance for data processing.

Alternatives Considered

Databricks Unified Analytics Platform

Databricks uses Spark as a foundation, and is also a great platform. It does bring several add-ons, which we did not feel needed by the time we evaluated - and haven't needed since then. One interesting plus in our opinion was the engineering support, which is great depending on the criticality of your platform.

Support Rating

As every open source tool, you have to use forums, consulting companies and engineering power to support and maintain. There is plenty of documentation available, so you will be in good hands. You can also find consulting companies small-mid size which can support your environment at a decent cost. Another alternative is going to Data Bricks, if support is a key criteria for your decision.

Other Software Used

SAP BW/4HANA, SAP HANA, Snowflake

Return to navigation

Spark Project | Spark Tutorial | Online Spark Training | Intellipaat

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

Apache Spark Full Course | Apache Spark Tutorial For Beginners | Learn Spark In 7 Hours |Simplilearn

Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka

Introduction to Databricks [New demo linked in description]

Apache Spark Tutorial | Spark Tutorial for Beginners | Spark Big Data | Intellipaat

Hadoop

Apache Hive

Elasticsearch

Google BigQuery

Snowflake

Presto

MongoDB

Scala

Hive

TensorFlow

Community Insights