Skip to main content
TrustRadius
Apache Spark

Apache Spark

Overview

Recent Reviews

TrustRadius Insights

Apache Spark is an incredibly versatile tool that has been widely adopted across various departments for processing very large datasets …
Continue reading

Apache Spark in Telco

10 out of 10
July 22, 2021
Incentivized
Apache Spark is being widely used within the company. In Advanced Analytics department data engineers and data scientists work closely in …
Continue reading

Apache Spark Review

7 out of 10
March 16, 2019
Incentivized
We used Apache Spark within our department as a Solution Architecture team. It helped make big data processing more efficient since the …
Continue reading
Read all reviews

Reviewer Pros & Cons

View all pros & cons
Return to navigation

Product Demos

Spark Project | Spark Tutorial | Online Spark Training | Intellipaat

YouTube

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

YouTube

Apache Spark Full Course | Apache Spark Tutorial For Beginners | Learn Spark In 7 Hours |Simplilearn

YouTube

Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka

YouTube

Introduction to Databricks [New demo linked in description]

YouTube

Apache Spark Tutorial | Spark Tutorial for Beginners | Spark Big Data | Intellipaat

YouTube
Return to navigation

Product Details

What is Apache Spark?

Apache Spark Technical Details

Operating SystemsUnspecified
Mobile ApplicationNo
Return to navigation

Comparisons

View all alternatives
Return to navigation

Reviews and Ratings

(159)

Community Insights

TrustRadius Insights are summaries of user sentiment data from TrustRadius reviews and, when necessary, 3rd-party data sources. Have feedback on this content? Let us know!

Apache Spark is an incredibly versatile tool that has been widely adopted across various departments for processing very large datasets and generating summary statistics. Users have found it particularly useful for creating simple graphics when working with big data, making it a valuable asset for analytics departments. It is also used extensively in the banking industry to calculate risk-weighted assets on a daily and monthly basis for different positions. The integration of Apache Spark with Scala and Apache Spark clusters enables users to load and process large volumes of data, implementing complex formulas and algorithms. Additionally, Apache Spark is often utilized alongside Kafka and Spark Streams to extract data from Kafka queues into HDFS environments, allowing for streamlined data analysis and processing.

One of the key strengths of Apache Spark lies in its ability to handle large volumes of retail and eCommerce data, providing cost and performance benefits over traditional RDBMS solutions. This makes it a preferred choice for companies in these industries. Furthermore, Apache Spark plays a crucial role in supporting data-driven decision-making by digital data teams. Its capabilities allow these teams to build data products, source data from different systems, process and transform it, and store it in data lakes.

Apache Spark is highly regarded for its ability to perform data cleansing and transformation before inserting it into the final target layer in data warehouses. This makes it a vital tool for ensuring the accuracy and reliability of data. Its faster data processing capabilities compared to Hadoop MapReduce have made Apache Spark a go-to choice for tasks such as machine learning, analytics, batch processing, data ingestion, and report development. Moreover, educational institutions rely on Apache Spark to optimize scheduling by assigning classrooms based on student course enrollment and professor schedules.

Overall, Apache Spark proves itself as an indispensable product that meets the needs of various industries by offering efficient distributed data processing, advanced analytics capabilities, and seamless integration with other technologies. Its versatility allows it to support a wide range of use cases, making it an essential tool for anyone working with big data.

Great Computing Engine: Apache Spark is praised by many users for its capabilities in handling complex transformative logic and sophisticated data processing tasks. Several reviewers have mentioned that it is a great computing engine, indicating its effectiveness in solving intricate problems.

Valuable Insights and Analysis: Many reviewers find Apache Spark to be useful for understanding data and performing data analytical work. They appreciate the valuable insights and analysis capabilities provided by the software, suggesting that it helps them gain deeper understanding of their data.

Extensive Set of Libraries and APIs: The extensive set of libraries and APIs offered by Apache Spark has been highly appreciated by users. It provides a wide range of tools and functionalities to solve various day-to-day problems, making it a versatile choice for different data processing needs.

Challenging to Understand and Use: Some users have found Apache Spark to be challenging to understand and use for modeling big data. They struggle with the complexity of the software, leading to a high learning curve.

Lack of User-Friendliness: The software is considered not user-friendly, with a confusing user interface and graphics that are not of high quality. This has resulted in frustration among some users who find it difficult to navigate and work with.

Time-Consuming Processing: Apache Spark can be time-consuming when processing large data sets across multiple nodes. This has been reported by several users who have experienced delays in their data processing tasks, affecting overall efficiency.

When using Spark for big data tasks, users commonly recommend familiarizing yourself with the documentation and gaining experience. They emphasize investing time in reading and understanding the documentation to overcome any initial challenges. As users gain experience, they find working with Spark becomes easier and more efficient.

Users also suggest utilizing Spark specifically for true big data problems, where its capabilities and performance shine. They highlight that Spark is well-suited for tackling large-scale data processing tasks.

Additionally, users find value in leveraging the R and Python APIs in Spark. These APIs allow them to work with Spark using familiar programming languages such as R and Python, making it easier to analyze and process data.

Overall, users advise diving into the documentation, utilizing Spark for big data challenges, and leveraging the R and Python APIs to enhance their experience with Spark.

Attribute Ratings

Reviews

(1-3 of 3)
Companies can't remove reviews or game the system. Here's why
Ananth Gouri | TrustRadius Reviewer
Score 10 out of 10
Vetted Review
Verified User
Incentivized
If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that I was involved in using Apache Spark was a Recommendation Systems based project. My area or domain of research expertise is also Recommendation Systems. The deployment of a RecSys along with the use of Apache Spark - functionalities like scalability, flexibility of using various data sources along with fault-tolerant systems - are very easy. The built-in machine learning library MLlib is a boon to work. We don't require any other libraries.
  • Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
  • Scalable to any extent.
  • Has built-in machine learning library called - MLlib
  • Very flexible - data from various data sources can be used. Usage with HDFS is very easy
  • Its fully not backward compatible.
  • It is memory-consuming for heavy and large workloads and datasets
  • Support for advanced analytics is not available - MLlib has minimalistic analytics.
  • Deployment is a complex task for beginners.
Well suited: To most of the local run of datasets and non-prod systems - scalability is not a problem at all. Including data from multiple types of data sources is an added advantage. MLlib is a decently nice built-in library that can be used for most of the ML tasks.

Less appropriate: We had to work on a RecSys where the music dataset that we used was around 300+Gb in size. We faced memory-based issues. Few times we also got memory errors. Also the MLlib library does not have support for advanced analytics and deep-learning frameworks support. Understanding the internals of the working of Apache Spark for beginners is highly not possible.
  • Scalability
  • We had data across multiple sources. Integration with those data source types was not a problem
  • Generation of recommendations was achievable easily
  • We used Apache Spark for one of the research projects. The ROI though cannot be measured here - but the research paper got accepted to a good conference. What else would a project require??!!
We used Surprise Kit for one of the other research works. It is more fine-tuned to Recommendation systems and their algorithms. Apache Spark has MLlib for majority of ML problems. Where as software like Surprse Kit - it suitable for a specific task of Recommendations only.
Once we learn about the installation process and procedure - deploying Apache Spark for a prod-based system should not be a difficult task. Until we want to learn the internals of the software like Apache Spark - using it for high level work and with API should not be a big deal. Also with its amount of support available - we could get easy configuration based solutions to few of the errors. Their overall support is amazing.
  • Usage of libraries
  • Usage of HDFS in particular
  • Basic analysis of data is possible
  • Understanding internals of the product
  • changing data sources - was kinda complex
  • Integration of other ML libraries is not so user friendly
No
Chetan Munegowda | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Apache Spark is being used by our organization for writing ETL applications. It enables us to ingest thousands of records of data to database tables.
  • Great computing engine for solving complex transformative logic
  • Useful for understanding data and doing data analytical work
  • Gives us a great set of libraries and api to solve day-to-day problems
  • High learning curve
  • Complexity
  • More documentation
  • More developer support
  • More educational videos
Apache Spark is suited for big data applications when there is a need for performing analysis, streaming data work, and ETL work.
  • Saves lot of time
  • Very powerful
  • Automates lots of manual work
  • Higher depth of knowledge is required to understand and perform analysis
Spark is simply awesome to work on with any data sets and also has an in-memory database which makes it very flexible.
I have been using this for my ETL application, gives me all the necessary APIs to work with and solves my business objective.
Developer support for Apache Spark can be improved. We need more of a developer community around this considering it's an emerging technology.
Score 9 out of 10
Vetted Review
Verified User
Incentivized
We use Apache Spark for cluster computing in large-scale data processing, ETL functions, machine learning, as well as for analytics. Its primarily used by the Data Engineering Department, in order to support the data lake infrastructure. It helps us to effectively manage the great amounts of data that come from our clusters, ensuring the capacity, scalability, and performance needed.
  • Speed: Apache Spark has great performance for both streaming and batch data
  • Easy to use: the object oriented operators make it easy and intuitive.
  • Multiple language support
  • Fault tolerance
  • Cluster managment
  • Supports DF, DS, and RDDs
  • Hard to learn, documentation could be more in-depth.
  • Due to it's in-memory processing, it can take a large consumption of memory.
  • Poor data visualization, too basic.
Well suited for: large datasets, fault tolerance, parallel processing, ETL, batch processing, streaming, analytics, graphing, or machine learning. Mostly any kind of large-scale processing, since it will save you a lot of time (days of processing). Less appropriate for: smaller datasets, you are better off using pandas or other libraries.
  • Saved time and resources for the company because of it's agility
  • High performance data processing.
Never had to contact them, however, they offer 24/7 support and there are a large number of forums about Spark, well-integrated with python and supports SQL syntaxis.
The only thing I dislike about spark's usability is the learning curve, there are many actions and transformations, however, its wide-range of uses for ETL processing, facility to integrate and it's multi-language support make this library a powerhouse for your data science solutions. It has especially aided us with its lightning-fast processing times.
Return to navigation