TrustRadius
https://dudodiprj2sv7.cloudfront.net/product-logos/0H/3D/90TJJ6JJ6KNK.jpegApache Spark: One stop shop for distributed data processing, machine learning and graph processingWe use Apache Spark across all analytics departments in the company. We primarily use it for distributed data processing and data preparation for machine learning models. We also use it while running distributed CRON jobs for various analytical workloads. I am familiar with a story where we contributed an algorithm to Spark open source which is on Random Walks in Large Graphs - https://databricks.com/session/random-walks-on-large-scale-graphs-with-apache-spark,Rich APIs for data transformation making for very each to transform and prepare data in a distributed environment without worrying about memory issues Faster in execution times compare to Hadoop and PIG Latin Easy SQL interface to the same data set for people who are comfortable to explore data in a declarative manner Interoperability between SQL and Scala / Python style of munging data,Documentation could be better as I usually end up going to other sites / blogs to understand the concepts better More APIs are to be ported to MLlib as only very few algorithms are available at least in clustering segment,10,Switching from PIG Latin to Apache Spark sped up the overall development time and also the resource utilization has gone up. Our offline jobs also run faster than traditional map-reduce like systems. Integrating with Jupyter like notebook environments, the development experience becomes more pleasant and we can iterate much faster.,Apache Hive, Presto, Apache Pig and Hadoop,TensorFlow, KerasVery useful application for Big Data processing and excellent for large volume production workflowsApache Spark is being used by the whole organization. It helps us a lot in the transmission of data, as it is 100 times faster than Hadoop MapReduce in memory and 10 times faster in disk, as we work with Java this application. It allows native links for Java programming languages, ​​and as it is compatible with SQL, is completely adapted to the needs of our organization, because of the large amount of information that we use. We highly prefer Apache Spark since it supports in-memory processing to increase performance of big data analysis applications.,It performs a conventional disk-based process when the data sets are too large to fit into memory, which is very useful because, regardless of the size of the data, it is always possible to store them. It has great speed and ability to join multiple types of databases and run different types of analysis applications. This functionality is super useful as it reduces work times Apache Spark uses the data storage model of Hadoop and can be integrated with other big data frameworks such as HBase, MongoDB, and Cassandra. This is very useful because it is compatible with multiple frameworks that the company has, and thus allows us to unify all the processes.,Increase the information and trainings that come with the application, especially for debugging since the process is difficult to understand. It should be more attentive to users and make tutorials, to reduce the learning curve. There should be more grouping algorithms.,10,It has had a very positive impact, as it helps reduce the data processing time and thus helps us achieve our goals much faster. Being easy to use, it allows us to adapt to the tool much faster than with others, which in turn allows us to access various data sources such as Hadoop, Apache Mesos, Kubernetes, independently or in the cloud. This makes it very useful. It was very easy for me to use Apache Spark and learn it since I come from a background of Java and SQL, and it shares those basic principles and uses a very similar logic.,Hadoop,Hadoop, Cassandra, Apache Camel, Apache CloudStack, Apache OpenOfficeApache Spark, the be all End All.Spark was/is being used in myriad of ways. With Kafka, using Spark Streams to grab data from kafka queue into our hdfs environment. SparkSQL used for analysis of data for those not familiar with spark. Using Spark for data analysis as well and for main workflow process. Using spark over mapreduce. Using Spark for some machine learning algo's with the data.,Machine Learning. Data Analysis WorkFlow process (faster than MapReduce). SQL connector to multiple data sources,Memory management. Very weak on that. PySpark not as robust as scala with spark. spark master HA is needed. Not as HA as it should be. Locality should not be a necessity, but does help improvement. But would prefer no locality,9,Workflow process using spark went from 1 day to 2 hours Spark Streaming allowed for quick determiniation of data validity spark on yarn was good for manangement. But Spark with Kubernetes was easier to use.,mapreduce and apache storm,HBase, Cassandra, Apache DrillMy Apache Spark ReviewMy company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.,Easy ELT Process Easy clustering on cloud Amazing speed Batch & real time processing,Debugging is difficult as it is new for most people There are fewer learning resources,9,Apache Spark has faster performance compared to MapReduce. Combination of Python & Spark is the best. Shorter code, faster and efficient performance. Can replace RDBMS,Amazon Elastic MapReduce,Apache Hive, Amazon Elastic MapReduce, Apache PigApache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processingWe previously used the database and Pentaho ETL tool to perform data transformation as per project requirements but as the time passed our data is building day by day and we suffered a lot of optimization problems working this way. Then we thought of implementing Hadoop cluster with 8 nodes in our company. We deployed an 8 node cluster with Cloudera distribution. Then we started using Apache Spark to create applications for Student Course Enrollment data and run them parallelly on multiprocessors. It is used by a department but the data consists of information about students and professors of the whole organization. It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules. This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization,Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable. It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time. It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis. I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.,Data visualization. Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system. Transformations and actions available are limited so must modify API to work for more features.,10,Optimization at its best (Super Fast). Handles huge data with simple syntax whereas other programming language takes hell a lot of coding. Best for parallel computing applications.,python, Apache Pig and Apache Hive,Apache Hive
Unspecified
Apache Spark
95 Ratings
Score 8.6 out of 101
TRScore

Apache Spark Reviews

Apache Spark
95 Ratings
Score 8.6 out of 101
Show Filters 
Hide Filters 
Filter 95 vetted Apache Spark reviews and ratings
Clear all filters
Overall Rating
Reviewer's Company Size
Last Updated
By Topic
Industry
Department
Experience
Job Type
Role
Reviews (1-10 of 10)
  Vendors can't alter or remove reviews. Here's why.
July 21, 2018

Review: "Apache Spark: One stop shop for distributed data processing, machine learning and graph processing"

Score 10 out of 10
Vetted Review
Verified User
Review Source
We use Apache Spark across all analytics departments in the company. We primarily use it for distributed data processing and data preparation for machine learning models. We also use it while running distributed CRON jobs for various analytical workloads. I am familiar with a story where we contributed an algorithm to Spark open source which is on Random Walks in Large Graphs - https://databricks.com/session/random-walks-on-large-scale-graphs-with-apache-spark
  • Rich APIs for data transformation making for very each to transform and prepare data in a distributed environment without worrying about memory issues
  • Faster in execution times compare to Hadoop and PIG Latin
  • Easy SQL interface to the same data set for people who are comfortable to explore data in a declarative manner
  • Interoperability between SQL and Scala / Python style of munging data
  • Documentation could be better as I usually end up going to other sites / blogs to understand the concepts better
  • More APIs are to be ported to MLlib as only very few algorithms are available at least in clustering segment
Apache Spark has rich APIs for regular data transformations or for ML workloads or for graph workloads, whereas other systems may not such a wide range of support. Choose it when you need to perform data transformations for big data as offline jobs, whereas use MongoDB-like distributed database systems for more realtime queries.

Read Nitin Pasumarthy's full review
August 28, 2018

Apache Spark Review: "Very useful application for Big Data processing and excellent for large volume production workflows"

Score 10 out of 10
Vetted Review
Verified User
Review Source
Apache Spark is being used by the whole organization. It helps us a lot in the transmission of data, as it is 100 times faster than Hadoop MapReduce in memory and 10 times faster in disk, as we work with Java this application. It allows native links for Java programming languages, ​​and as it is compatible with SQL, is completely adapted to the needs of our organization, because of the large amount of information that we use. We highly prefer Apache Spark since it supports in-memory processing to increase performance of big data analysis applications.
  • It performs a conventional disk-based process when the data sets are too large to fit into memory, which is very useful because, regardless of the size of the data, it is always possible to store them.
  • It has great speed and ability to join multiple types of databases and run different types of analysis applications. This functionality is super useful as it reduces work times
  • Apache Spark uses the data storage model of Hadoop and can be integrated with other big data frameworks such as HBase, MongoDB, and Cassandra. This is very useful because it is compatible with multiple frameworks that the company has, and thus allows us to unify all the processes.
  • Increase the information and trainings that come with the application, especially for debugging since the process is difficult to understand.
  • It should be more attentive to users and make tutorials, to reduce the learning curve.
  • There should be more grouping algorithms.
It is suitable for processing large amounts of data, as it is very easy to use and its syntax is simple and understandable. I also find it useful to use in a variety of applications without the need to integrate many other processing technologies, and it is very fast and has many machine learning algorithms that can be used for data problems. I find it less appropriate for data that is not so large, as it uses too many resources.
Read Carla Borges's full review
March 27, 2018

User Review: "Apache Spark, the be all End All."

Score 9 out of 10
Vetted Review
Verified User
Review Source
Spark was/is being used in myriad of ways. With Kafka, using Spark Streams to grab data from kafka queue into our hdfs environment. SparkSQL used for analysis of data for those not familiar with spark. Using Spark for data analysis as well and for main workflow process. Using spark over mapreduce. Using Spark for some machine learning algo's with the data.
  • Machine Learning.
  • Data Analysis
  • WorkFlow process (faster than MapReduce).
  • SQL connector to multiple data sources
  • Memory management. Very weak on that.
  • PySpark not as robust as scala with spark.
  • spark master HA is needed. Not as HA as it should be.
  • Locality should not be a necessity, but does help improvement. But would prefer no locality
Spark is great as a workflow process and extract transform layer process tool. Is really good for machine learning especially for large datasets that can be processed in split file paralallelization.
Spark streaming is scalable for close to real-time data workflow process.
what it's not good for, is smaller subset of data processing.
Read Anson Abraham's full review
June 07, 2018

"My Apache Spark Review"

Score 9 out of 10
Vetted Review
Verified User
Review Source
My company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.
  • Easy ELT Process
  • Easy clustering on cloud
  • Amazing speed
  • Batch & real time processing
  • Debugging is difficult as it is new for most people
  • There are fewer learning resources
When the data is very big, and you cannot afford a lot of computational timing such as in a real-time environment, it is advisable to use Apache Spark. There are alternatives to Apache Spark, but it is the most common and robust tool to work with. It is great at batch processing.
Read Kartik Chavan's full review
October 26, 2017

Review: "Apache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processing"

Score 10 out of 10
Vetted Review
Verified User
Review Source
We previously used the database and Pentaho ETL tool to perform data transformation as per project requirements but as the time passed our data is building day by day and we suffered a lot of optimization problems working this way. Then we thought of implementing Hadoop cluster with 8 nodes in our company. We deployed an 8 node cluster with Cloudera distribution. Then we started using Apache Spark to create applications for Student Course Enrollment data and run them parallelly on multiprocessors.

It is used by a department but the data consists of information about students and professors of the whole organization.

It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules.
This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization
  • Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable.
  • It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time.
  • It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis.
  • I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.
  • Data visualization.
  • Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system.
  • Transformations and actions available are limited so must modify API to work for more features.
For large data
For best optimization
For parallel processing
For machine learning on huge data because presently available machine learning software like RapidMiner, are are limited to data size whereas Spark is not
Read Kamesh Emani's full review
December 13, 2017

Review: "Apache Spark Should Spark Your Interest"

Score 9 out of 10
Vetted Review
Verified User
Review Source
At my current company, we are using Spark in a variety of ways ranging from batch processing to data analysis to machine learning techniques. It has become our main driver for any distributed processing applications. It has gained quick adoption across the organization for its ease of use, integration into the Hadoop stack, and for its support in a variety of languages.
  • Ease of use, the Spark API allows for minimal boilerplate and can be written in a variety of languages including Python, Scala, and Java.
  • Performance, for most applications we have found that jobs are more performant running via Spark than other distributed processing technologies like Map-Reduce, Hive, and Pig.
  • Flexibility, the frameworks comes with support for streaming, batch processing, sql queries, machine learning, etc. It can be used in a variety of applications without needing to integrate a lot of other distributed processing technologies.
  • Resource heavy, jobs, in general, can be very memory intensive and you will want the nodes in your cluster to reflect that.
  • Debugging, it has gotten better with every release but sometimes it can be difficult to debug an error due to ambiguous or misleading exceptions and stack traces.
If you are running a distributed environment and are running applications that make use of batch processing, analytics, streaming, machine learning, or graphing then I cannot recommend Spark enough. It is easy to get going, simple to learn (relative to similar technologies), and can be used in a variety of use cases. All while giving you great performance.
Read this authenticated review
January 23, 2018

Review: "Use Apache Spark to Speed Up Cluster Computing"

Score 7 out of 10
Vetted Review
Verified User
Review Source
In our company, we used Spark for a healthcare analytical project, where we need to do large-scale data processing in a Hadoop environment. The project is about building an enterprise data lake where we bring data from multiple products and consolidate. Further, in the downstream, we will develop some business reports.
  • We used to make our batch processing faster. Spark is faster in batch processing than MapReduce with it in memory computing
  • Spark will run along with other tools in the Hadoop ecosystem including Hive and Pig
  • Spark supports both batch and real-time processing
  • Apache Spark has Machine Learning Algorithms support
  • Consumes more memory
  • Difficult to address issues around memory utilization
  • Expensive - In-memory processing is expensive when we look for a cost-efficient processing of big data
Well suited:
1. Data can be integrated from several sources including click stream, logs, transactional systems
2. Real-time ingestion through Kafka, Kinesis, and other streaming platforms

Read this authenticated review
June 26, 2017

Apache Spark Review: "Sparkling Spark"

Score 10 out of 10
Vetted Review
Verified User
Review Source
It's being replaced as the traditional ETL tool and we are using Apache Spark for data science solutions.
  • It makes the ETL process very simple when compared to SQL SERVER and MYSQL ETL tools.
  • It's very fast and has many machine learning algorithms which can be used for data science problems.
  • It is easily implemented on a cloud cluster.
  • The initialization and spark context procedures.
  • Running applications on a cluster is not well documented anywhere, some applications are hard to debug.
  • Debugging and Testing are sometimes time-consuming.
It's well suited for ETL, data Integration, and data science problems of large data sets. It's not at all suitable for small data sets which can be done on desktops and laptops using the Python tool.
Read Sunil Dhage's full review
August 02, 2017

Review: "Apache Spark if great for high volume production workflows"

Score 10 out of 10
Vetted Review
Verified User
Review Source
We use it primarily in our department as part of a machine learning and data processing platform to build enterprise scale predictive applications.
  • Great APIs and tools.
  • Scale.
  • Speed for iterative algorithms.
  • No true streaming.
  • Lack of strongly typed yet convenient APIs.
Well suited for batch and near-real time data processing tasks as well as production deployments of machine learning, especially at large scale. Not well suited for general analytics workflows for small and medium sized data sets; SQL based data warehouses like Redshift, Vertica, and etc. are better for those use cases.
Read this authenticated review
September 12, 2016

Apache Spark Review: "A useful replacement for MapReduce for Big Data processing"

Score 8 out of 10
Vetted Review
Verified User
Review Source
We are learning core Apache Spark + SparkSQL and MLLib, while creating proof-of-concepts as well as providing solutions for clients. It addresses the needs of quickly processing large amounts of data, typically located in Hadoop.
  • Scale from local machine to full cluster. You can run a standalone, single cluster simply by starting up a Spark Shell or submitting an application to test an algorithm, then it quickly can be transferred and configured to run in a distributed environment.
  • Provides multiple APIs. Most people I know use Python and/or Java as their main programming language. Data scientists who are familiar with NumPy and SciPy can quickly become comfortable with Spark, while Java developers would best served using Java 8 and the new features that it provides. Scala, on the other hand, is a mix between the Java and Python styles of writing Spark code, in my opinion.
  • Plentiful learning resources. The Learning Spark book is a good introduction to the mechanics of Spark although written for Spark 1.3, and the current version is 2.0. The GitHub repository for the book contains all the code examples that are discussed, plus the Spark website is also filled with useful information that is simple to navigate.
  • For data that isn't truly that large, Spark may be overkill when the problem could likely be solved on a computer with reasonable hardware resources. There doesn't seem to be a lot of examples for how a Spark task would otherwise be implemented in a different library; for instance scikit-learn and NumPy rather than Spark MLlib.
On the plus side, Spark is a good tool to learn to apply to various data processing problems.

As described in the Cons - Spark may not be needed unless there is truly a large amount of data to operate on. Other libraries may be better suited for the same task.
Read Jordan Moore's full review

Apache Spark Scorecard Summary

About Apache Spark

Categories:  Hadoop-Related

Apache Spark Technical Details

Operating Systems: Unspecified
Mobile Application:No