Apache Spark Reviews

117 Ratings
<a href='https://www.trustradius.com/static/about-trustradius-scoring' target='_blank' rel='nofollow noopener noreferrer'>trScore algorithm: Learn more.</a>
Score 8.2 out of 100

Do you work for this company? Manage this listing

Overall Rating

Reviewer's Company Size

Last Updated

By Topic




Job Type


Reviews (1-17 of 17)

Yogesh Mhasde | TrustRadius Reviewer
January 11, 2020

Apache Spark -- The best big data solution

Score 8 out of 10
Vetted Review
Verified User
Review Source
We were working for one of our products, which has a requirement for developing an enterprise-level product catering to manage a vast amount of Big data involved. We wanted to use a technology that is faster than Hadoop and can process large scale data by providing a streamlined process for the data scientists. Apache Spark is a powerful unified solution as we thought to be.
The main problem that we identified in our existing approach was that it was taking a large amount of time to process the data, and also the statistical analysis of the data was not up to the mark. We wanted a sophisticated analytical solution that was easy and fast to use. With using Apache Spark, the processing was made 5 times faster than earlier, giving rise to pretty good analytics. With Spark, across a cluster of machines, the data abstraction was achieved by using RDDs.
  • DataFrames, DataSets, and RDDs.
  • Spark has in-built Machine Learning library which scales and integrates with existing tools.
  • The data processing done by Spark comes at a price of memory blockages, as in-memory capabilities of processing can lead to large consumption of memory.
  • The caching algorithm is not in-built in Spark. We need to manually set up the caching mechanism.
1. Suitable where the requirement for advanced analytics is prominent.
2. When you want big data to be processed at a very fast pace.
3. For large datasets, Spark is a viable solution.
4. When you need fault tolerance to be at a precision, go for Spark.

Spark is not suitable:
1. If you want your data to be processed in real-time, then Spark is not a good solution.
2. When you need automatic optimization, then Spark fails at that point.
Read Yogesh Mhasde's full review
Anonymous | TrustRadius Reviewer
December 13, 2019

Great open source tool for data processing

Score 9 out of 10
Vetted Review
Verified User
Review Source
We do use Apache Spark for cluster computing for our ETL environment, data and analytics as well as machine learning. It is mainly used by our data engineering team to support the entire Data Lake foundation. As we have huge amounts of information coming from multiple sources, we needed an effective cluster management system to handle capacity and deliver the performance and throughput we needed.
  • Cluster management for ETL.
  • Data processing engine for our data lake.
  • You still need Hive or other HDFS to store information.
  • Security is behind compared to MapReduce.
Spark is a one-size-fits-all data processing platform. You can run batch and in-motion streams, you can use for ETL, machine learning or even graphs. You do not have multiple tools, so it makes your TCO and management tasks way easier. As every new platform, has room to grow: storage and security are the main opportunities we found.
Read this authenticated review
Anonymous | TrustRadius Reviewer
March 27, 2019

Want to save dollars, resources and time processing big data, switch to Apache Spark

Score 9 out of 10
Vetted Review
Verified User
Review Source
We sold a data science product to one of the leading US-based e-commerce firms. Suddenly, their data started growing at a very fast rate. The product, at this stage, was based on R programming. With such huge data, the product started taking a lot of time. We then started thinking of an alternative to R, to process multiplying big data such as this client has. We eventually came across Apache Spark. With the permission of the client, we started switching the codes from R to Apache Spark. It took a very long time to learn and code in Spark, but it was worth the effort. The R codes, which were taking days of time to run, came down to a few hours.
  • Very good tool to process big datasets.
  • Inbuilt fault tolerance.
  • Supports multiple languages.
  • Supports advanced analytics.
  • A large number of libraries available -- GraphX, Spark SQL, Spark Streaming, etc.
  • Very slow with smaller amounts of data.
  • Expensive, as it stores data in memory.
If your data is very huge, I recommend converting the underlying technology into Apache Spark. This will save you a lot of time and effort in the near future due to your growing data. The Apache Spark scalability feature also means it handles all the future data related processing.
Read this authenticated review
Anonymous | TrustRadius Reviewer
March 16, 2019

Apache Spark Review

Score 7 out of 10
Vetted Review
Verified User
Review Source
We used Apache Spark within our department as a Solution Architecture team. It helped make big data processing more efficient since the same framework can be used for batch and stream processing.
  • Customizable, it integrates with Jupyter notebooks which was really helpful for our team.
  • Easy to use and implement.
  • It allows us to quickly build microservices.
  • Release cycles can be faster.
  • Sometimes it kicked some of the users out due to inactivity.
It is beneficial to use Apache Spark if:
  • You are working with big data, preprocessing data before machine learning
  • Building simple microservices and creating PoC. It makes it easier to create REST and simple web APIs.
  • If you need great customer service, Apache Spark would be a great choice since they provide it 24/7.
Read this authenticated review
Anonymous | TrustRadius Reviewer
March 06, 2019

Sparking the future

Score 8 out of 10
Vetted Review
Verified User
Review Source
Only one of our departments is using Apache Spark to work on very large datasets. We are thinking of implementing it to other departments as well.
  • It is very fast.
  • It is gaining usability now that the PySpark community is growing and more functions are being developed.
  • Programmers can run different languages on the servers.
  • PySpark does not have the same ease of use and functionality that Pandas does yet.
It is well suited for very large datasets that would run slowly on Hadoop servers or if you want to do real-time analytics on streaming data. It is overkill and not worth the loss of programming functionality for smaller datasets.
Read this authenticated review
Thomas Young | TrustRadius Reviewer
January 25, 2019

Spark is useful, but requires lots of very valuable questions to justify the effort, and be prepared for failure in answering posed questions

Score 7 out of 10
Vetted Review
Verified User
Review Source
Apache Spark is used by certain departments to produce summary statistics. The software is used for data sets that are very, very large in size and require immense processing power. The software is also used for simple graphics. When the data are small enough, Apache Spark is not the preferred analytical tool. It's the big data that makes Spark useful.
  • Apache Spark makes processing very large data sets possible. It handles these data sets in a fairly quick manner.
  • Apache Spark does a fairly good job implementing machine learning models for larger data sets.
  • Apache Spark seems to be a rapidly advancing software, with the new features making the software ever more straight-forward to use.
  • Apache Spark requires some advanced ability to understand and structure the modeling of big data. The software is not user-friendly.
  • The graphics produced by Apache Spark are by no means world-class. They sometimes appear high-schoolish.
  • Apache Spark takes an enormous amount of time to crunch through multiple nodes across very large data sets. Apache Spark could improve this by offering the software in a more interactive programming environment.
The software appears to run more efficiently than other big data tools, such as Hadoop. Given that, Apache Spark is well-suited for querying and trying to make sense of very, very large data sets. The software offers many advanced machine learning and econometrics tools, although these tools are used only partially because very large data sets require too much time when the data sets get too large. The software is not well-suited for projects that are not big data in size. The graphics and analytical output are subpar compared to other tools.

Read Thomas Young's full review
Shiv Shivakumar | TrustRadius Reviewer
December 14, 2018

Apache Spark - defacto for big data processing/analytics

Score 9 out of 10
Vetted Review
Verified User
Review Source
Used as the in memory data engine for big data analytics, streaming data and SQL workloads. Also, in the process of trying it out for certain machine learning algorithms. It basically processes data for analytical needs of the business and is a great tool to co-exist with the hadoop file systems.
  • in memory data engine and hence faster processing
  • does well to lay on top of hadoop file system for big data analytics
  • very good tool for streaming data
  • could do a better job for analytics dashboards to provide insights on a data stream and hence not have to rely on data visualization tools along with spark
  • also there is room for improvement in the area of data discovery
Apache Spark is very well suited for big data analytics in conjunction with the hadoop file system and also does a good job of providing fast access to data in SQL workloads since it has an in memory data processing engine that can very quickly process data. In addition, it can also be used for streaming data processing.
Read Shiv Shivakumar's full review
Carla Borges | TrustRadius Reviewer
August 28, 2018

Very useful application for Big Data processing and excellent for large volume production workflows

Score 10 out of 10
Vetted Review
Verified User
Review Source
Apache Spark is being used by the whole organization. It helps us a lot in the transmission of data, as it is 100 times faster than Hadoop MapReduce in memory and 10 times faster in disk, as we work with Java this application. It allows native links for Java programming languages, ​​and as it is compatible with SQL, is completely adapted to the needs of our organization, because of the large amount of information that we use. We highly prefer Apache Spark since it supports in-memory processing to increase performance of big data analysis applications.
  • It performs a conventional disk-based process when the data sets are too large to fit into memory, which is very useful because, regardless of the size of the data, it is always possible to store them.
  • It has great speed and ability to join multiple types of databases and run different types of analysis applications. This functionality is super useful as it reduces work times
  • Apache Spark uses the data storage model of Hadoop and can be integrated with other big data frameworks such as HBase, MongoDB, and Cassandra. This is very useful because it is compatible with multiple frameworks that the company has, and thus allows us to unify all the processes.
  • Increase the information and trainings that come with the application, especially for debugging since the process is difficult to understand.
  • It should be more attentive to users and make tutorials, to reduce the learning curve.
  • There should be more grouping algorithms.
It is suitable for processing large amounts of data, as it is very easy to use and its syntax is simple and understandable. I also find it useful to use in a variety of applications without the need to integrate many other processing technologies, and it is very fast and has many machine learning algorithms that can be used for data problems. I find it less appropriate for data that is not so large, as it uses too many resources.
Read Carla Borges's full review
Nitin Pasumarthy | TrustRadius Reviewer
July 21, 2018

Apache Spark: One stop shop for distributed data processing, machine learning and graph processing

Score 10 out of 10
Vetted Review
Verified User
Review Source
We use Apache Spark across all analytics departments in the company. We primarily use it for distributed data processing and data preparation for machine learning models. We also use it while running distributed CRON jobs for various analytical workloads. I am familiar with a story where we contributed an algorithm to Spark open source which is on Random Walks in Large Graphs - https://databricks.com/session/random-walks-on-large-scale-graphs-with-apache-spark
  • Rich APIs for data transformation making for very each to transform and prepare data in a distributed environment without worrying about memory issues
  • Faster in execution times compare to Hadoop and PIG Latin
  • Easy SQL interface to the same data set for people who are comfortable to explore data in a declarative manner
  • Interoperability between SQL and Scala / Python style of munging data
  • Documentation could be better as I usually end up going to other sites / blogs to understand the concepts better
  • More APIs are to be ported to MLlib as only very few algorithms are available at least in clustering segment
Apache Spark has rich APIs for regular data transformations or for ML workloads or for graph workloads, whereas other systems may not such a wide range of support. Choose it when you need to perform data transformations for big data as offline jobs, whereas use MongoDB-like distributed database systems for more realtime queries.

Read Nitin Pasumarthy's full review
Anson Abraham | TrustRadius Reviewer
March 27, 2018

Apache Spark, the be all End All.

Score 9 out of 10
Vetted Review
Verified User
Review Source
Spark was/is being used in myriad of ways. With Kafka, using Spark Streams to grab data from kafka queue into our hdfs environment. SparkSQL used for analysis of data for those not familiar with spark. Using Spark for data analysis as well and for main workflow process. Using spark over mapreduce. Using Spark for some machine learning algo's with the data.
  • Machine Learning.
  • Data Analysis
  • WorkFlow process (faster than MapReduce).
  • SQL connector to multiple data sources
  • Memory management. Very weak on that.
  • PySpark not as robust as scala with spark.
  • spark master HA is needed. Not as HA as it should be.
  • Locality should not be a necessity, but does help improvement. But would prefer no locality
Spark is great as a workflow process and extract transform layer process tool. Is really good for machine learning especially for large datasets that can be processed in split file paralallelization.
Spark streaming is scalable for close to real-time data workflow process.
what it's not good for, is smaller subset of data processing.
Read Anson Abraham's full review
Kartik Chavan | TrustRadius Reviewer
June 07, 2018

My Apache Spark Review

Score 9 out of 10
Vetted Review
Verified User
Review Source
My company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.
  • Easy ELT Process
  • Easy clustering on cloud
  • Amazing speed
  • Batch & real time processing
  • Debugging is difficult as it is new for most people
  • There are fewer learning resources
When the data is very big, and you cannot afford a lot of computational timing such as in a real-time environment, it is advisable to use Apache Spark. There are alternatives to Apache Spark, but it is the most common and robust tool to work with. It is great at batch processing.
Read Kartik Chavan's full review
Kamesh Emani | TrustRadius Reviewer
October 26, 2017

Apache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processing

Score 10 out of 10
Vetted Review
Verified User
Review Source
We previously used the database and Pentaho ETL tool to perform data transformation as per project requirements but as the time passed our data is building day by day and we suffered a lot of optimization problems working this way. Then we thought of implementing Hadoop cluster with 8 nodes in our company. We deployed an 8 node cluster with Cloudera distribution. Then we started using Apache Spark to create applications for Student Course Enrollment data and run them parallelly on multiprocessors.

It is used by a department but the data consists of information about students and professors of the whole organization.

It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules.
This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization
  • Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable.
  • It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time.
  • It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis.
  • I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.
  • Data visualization.
  • Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system.
  • Transformations and actions available are limited so must modify API to work for more features.
For large data
For best optimization
For parallel processing
For machine learning on huge data because presently available machine learning software like RapidMiner, are are limited to data size whereas Spark is not
Read Kamesh Emani's full review
Sunil Dhage | TrustRadius Reviewer
June 26, 2017

Sparkling Spark

Score 10 out of 10
Vetted Review
Verified User
Review Source
It's being replaced as the traditional ETL tool and we are using Apache Spark for data science solutions.
  • It makes the ETL process very simple when compared to SQL SERVER and MYSQL ETL tools.
  • It's very fast and has many machine learning algorithms which can be used for data science problems.
  • It is easily implemented on a cloud cluster.
  • The initialization and spark context procedures.
  • Running applications on a cluster is not well documented anywhere, some applications are hard to debug.
  • Debugging and Testing are sometimes time-consuming.
It's well suited for ETL, data Integration, and data science problems of large data sets. It's not at all suitable for small data sets which can be done on desktops and laptops using the Python tool.
Read Sunil Dhage's full review
Jordan Moore | TrustRadius Reviewer
September 12, 2016

A useful replacement for MapReduce for Big Data processing

Score 8 out of 10
Vetted Review
Verified User
Review Source
We are learning core Apache Spark + SparkSQL and MLLib, while creating proof-of-concepts as well as providing solutions for clients. It addresses the needs of quickly processing large amounts of data, typically located in Hadoop.
  • Scale from local machine to full cluster. You can run a standalone, single cluster simply by starting up a Spark Shell or submitting an application to test an algorithm, then it quickly can be transferred and configured to run in a distributed environment.
  • Provides multiple APIs. Most people I know use Python and/or Java as their main programming language. Data scientists who are familiar with NumPy and SciPy can quickly become comfortable with Spark, while Java developers would best served using Java 8 and the new features that it provides. Scala, on the other hand, is a mix between the Java and Python styles of writing Spark code, in my opinion.
  • Plentiful learning resources. The Learning Spark book is a good introduction to the mechanics of Spark although written for Spark 1.3, and the current version is 2.0. The GitHub repository for the book contains all the code examples that are discussed, plus the Spark website is also filled with useful information that is simple to navigate.
  • For data that isn't truly that large, Spark may be overkill when the problem could likely be solved on a computer with reasonable hardware resources. There doesn't seem to be a lot of examples for how a Spark task would otherwise be implemented in a different library; for instance scikit-learn and NumPy rather than Spark MLlib.
On the plus side, Spark is a good tool to learn to apply to various data processing problems.

As described in the Cons - Spark may not be needed unless there is truly a large amount of data to operate on. Other libraries may be better suited for the same task.
Read Jordan Moore's full review
Anonymous | TrustRadius Reviewer
December 13, 2017

Apache Spark Should Spark Your Interest

Score 9 out of 10
Vetted Review
Verified User
Review Source
At my current company, we are using Spark in a variety of ways ranging from batch processing to data analysis to machine learning techniques. It has become our main driver for any distributed processing applications. It has gained quick adoption across the organization for its ease of use, integration into the Hadoop stack, and for its support in a variety of languages.
  • Ease of use, the Spark API allows for minimal boilerplate and can be written in a variety of languages including Python, Scala, and Java.
  • Performance, for most applications we have found that jobs are more performant running via Spark than other distributed processing technologies like Map-Reduce, Hive, and Pig.
  • Flexibility, the frameworks comes with support for streaming, batch processing, sql queries, machine learning, etc. It can be used in a variety of applications without needing to integrate a lot of other distributed processing technologies.
  • Resource heavy, jobs, in general, can be very memory intensive and you will want the nodes in your cluster to reflect that.
  • Debugging, it has gotten better with every release but sometimes it can be difficult to debug an error due to ambiguous or misleading exceptions and stack traces.
If you are running a distributed environment and are running applications that make use of batch processing, analytics, streaming, machine learning, or graphing then I cannot recommend Spark enough. It is easy to get going, simple to learn (relative to similar technologies), and can be used in a variety of use cases. All while giving you great performance.
Read this authenticated review
Anonymous | TrustRadius Reviewer
January 23, 2018

Use Apache Spark to Speed Up Cluster Computing

Score 7 out of 10
Vetted Review
Verified User
Review Source
In our company, we used Spark for a healthcare analytical project, where we need to do large-scale data processing in a Hadoop environment. The project is about building an enterprise data lake where we bring data from multiple products and consolidate. Further, in the downstream, we will develop some business reports.
  • We used to make our batch processing faster. Spark is faster in batch processing than MapReduce with it in memory computing
  • Spark will run along with other tools in the Hadoop ecosystem including Hive and Pig
  • Spark supports both batch and real-time processing
  • Apache Spark has Machine Learning Algorithms support
  • Consumes more memory
  • Difficult to address issues around memory utilization
  • Expensive - In-memory processing is expensive when we look for a cost-efficient processing of big data
Well suited:
1. Data can be integrated from several sources including click stream, logs, transactional systems
2. Real-time ingestion through Kafka, Kinesis, and other streaming platforms

Read this authenticated review
Anonymous | TrustRadius Reviewer
August 02, 2017

Apache Spark if great for high volume production workflows

Score 10 out of 10
Vetted Review
Verified User
Review Source
We use it primarily in our department as part of a machine learning and data processing platform to build enterprise scale predictive applications.
  • Great APIs and tools.
  • Scale.
  • Speed for iterative algorithms.
  • No true streaming.
  • Lack of strongly typed yet convenient APIs.
Well suited for batch and near-real time data processing tasks as well as production deployments of machine learning, especially at large scale. Not well suited for general analytics workflows for small and medium sized data sets; SQL based data warehouses like Redshift, Vertica, and etc. are better for those use cases.
Read this authenticated review

About Apache Spark

Categories:  Hadoop-Related

Apache Spark Technical Details

Operating Systems: Unspecified
Mobile Application:No