TrustRadius
https://dudodiprj2sv7.cloudfront.net/product-logos/0H/3D/90TJJ6JJ6KNK.jpegApache Spark, the be all End All.Spark was/is being used in myriad of ways. With Kafka, using Spark Streams to grab data from kafka queue into our hdfs environment. SparkSQL used for analysis of data for those not familiar with spark. Using Spark for data analysis as well and for main workflow process. Using spark over mapreduce. Using Spark for some machine learning algo's with the data.,Machine Learning. Data Analysis WorkFlow process (faster than MapReduce). SQL connector to multiple data sources,Memory management. Very weak on that. PySpark not as robust as scala with spark. spark master HA is needed. Not as HA as it should be. Locality should not be a necessity, but does help improvement. But would prefer no locality,9,Workflow process using spark went from 1 day to 2 hours Spark Streaming allowed for quick determiniation of data validity spark on yarn was good for manangement. But Spark with Kubernetes was easier to use.,mapreduce and apache storm,HBase, Cassandra, Apache DrillMy Apache Spark ReviewMy company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.,Easy ELT Process Easy clustering on cloud Amazing speed Batch & real time processing,Debugging is difficult as it is new for most people There are fewer learning resources,9,Apache Spark has faster performance compared to MapReduce. Combination of Python & Spark is the best. Shorter code, faster and efficient performance. Can replace RDBMS,Amazon Elastic MapReduce,Apache Hive, Amazon Elastic MapReduce, Apache PigUse Apache Spark to Speed Up Cluster ComputingIn our company, we used Spark for a healthcare analytical project, where we need to do large-scale data processing in a Hadoop environment. The project is about building an enterprise data lake where we bring data from multiple products and consolidate. Further, in the downstream, we will develop some business reports.,We used to make our batch processing faster. Spark is faster in batch processing than MapReduce with it in memory computing Spark will run along with other tools in the Hadoop ecosystem including Hive and Pig Spark supports both batch and real-time processing Apache Spark has Machine Learning Algorithms support,Consumes more memory Difficult to address issues around memory utilization Expensive - In-memory processing is expensive when we look for a cost-efficient processing of big data,7,We were able to make batch job faster by 20 times as compared to MapReduce With the language support like Scala, Java, and Python, easily manageable,,EMC Greenplum HD, Amazon Relational Database Service, AWS LambdaApache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processingWe previously used the database and Pentaho ETL tool to perform data transformation as per project requirements but as the time passed our data is building day by day and we suffered a lot of optimization problems working this way. Then we thought of implementing Hadoop cluster with 8 nodes in our company. We deployed an 8 node cluster with Cloudera distribution. Then we started using Apache Spark to create applications for Student Course Enrollment data and run them parallelly on multiprocessors. It is used by a department but the data consists of information about students and professors of the whole organization. It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules. This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization,Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable. It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time. It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis. I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.,Data visualization. Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system. Transformations and actions available are limited so must modify API to work for more features.,10,Optimization at its best (Super Fast). Handles huge data with simple syntax whereas other programming language takes hell a lot of coding. Best for parallel computing applications.,python, Apache Pig and Apache Hive,Apache HiveApache Spark Should Spark Your InterestAt my current company, we are using Spark in a variety of ways ranging from batch processing to data analysis to machine learning techniques. It has become our main driver for any distributed processing applications. It has gained quick adoption across the organization for its ease of use, integration into the Hadoop stack, and for its support in a variety of languages.,Ease of use, the Spark API allows for minimal boilerplate and can be written in a variety of languages including Python, Scala, and Java. Performance, for most applications we have found that jobs are more performant running via Spark than other distributed processing technologies like Map-Reduce, Hive, and Pig. Flexibility, the frameworks comes with support for streaming, batch processing, sql queries, machine learning, etc. It can be used in a variety of applications without needing to integrate a lot of other distributed processing technologies.,Resource heavy, jobs, in general, can be very memory intensive and you will want the nodes in your cluster to reflect that. Debugging, it has gotten better with every release but sometimes it can be difficult to debug an error due to ambiguous or misleading exceptions and stack traces.,9,Faster turn around on feature development, we have seen a noticeable improvement in our agile development since using Spark. Easy adoption, having multiple departments use the same underlying technology even if the use cases are very different allows for more commonality amongst applications which definitely makes the operations team happy. Performance, we have been able to make some applications run over 20x faster since switching to Spark. This has saved us time, headaches, and operating costs.,Apache Pig, Apache Hive and Apache Flume,Apache Hive, Hadoop, Apache Pig
Unspecified
Apache Spark
93 Ratings
Score 8.6 out of 101
TRScore

Apache Spark Reviews

Apache Spark
93 Ratings
Score 8.6 out of 101
Show Filters 
Hide Filters 
Filter 93 vetted Apache Spark reviews and ratings
Clear all filters
Overall Rating
Reviewer's Company Size
Last Updated
By Topic
Industry
Department
Experience
Job Type
Role
Reviews (1-8 of 8)
  Vendors can't alter or remove reviews. Here's why.
March 27, 2018

User Review: "Apache Spark, the be all End All."

Score 9 out of 10
Vetted Review
Verified User
Review Source
Spark was/is being used in myriad of ways. With Kafka, using Spark Streams to grab data from Kafka queue into our hdfs environment. SparkSQL used for analysis of data for those not familiar with spark. Using Spark for data analysis as well and for main workflow process. Using spark over mapreduce. Using Spark for some machine learning algo's with the data.
  • Machine Learning.
  • Data Analysis
  • WorkFlow process (faster than MapReduce).
  • SQL connector to multiple data sources
  • Memory management. Very weak on that.
  • PySpark not as robust as scala with spark.
  • spark master HA is needed. Not as HA as it should be.
  • Locality should not be a necessity, but does help improvement. But would prefer no locality
Spark is great as a workflow process and extract transform layer process tool. Is really good for machine learning especially for large datasets that can be processed in split file paralallelization.
Spark streaming is scalable for close to real-time data workflow process.
what it's not good for, is smaller subset of data processing.
Read Anson Abraham's full review
June 07, 2018

"My Apache Spark Review"

Score 9 out of 10
Vetted Review
Verified User
Review Source
My company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.
  • Easy ELT Process
  • Easy clustering on cloud
  • Amazing speed
  • Batch & real time processing
  • Debugging is difficult as it is new for most people
  • There are fewer learning resources
When the data is very big, and you cannot afford a lot of computational timing such as in a real-time environment, it is advisable to use Apache Spark. There are alternatives to Apache Spark, but it is the most common and robust tool to work with. It is great at batch processing.
Read Kartik Chavan's full review
January 23, 2018

Review: "Use Apache Spark to Speed Up Cluster Computing"

Score 7 out of 10
Vetted Review
Verified User
Review Source
In our company, we used Spark for a healthcare analytical project, where we need to do large-scale data processing in a Hadoop environment. The project is about building an enterprise data lake where we bring data from multiple products and consolidate. Further, in the downstream, we will develop some business reports.
  • We used to make our Batch processing faster. Spark is faster in Batch processing than MapReduce with it in memory computing
  • Spark will run along with other tools in the Hadoop ecosystem including Hive and Pig
  • Spark supports both Batch and real-time processing
  • Apache Spark has Machine Learning Algorithms support
  • Consumes more memory
  • Difficult to address issues around memory utilization
  • Expensive - In-memory processing is expensive when we look for a cost-efficient processing of big data
Well suited:
1. Data can be integrated from several sources including click stream, logs, transactional systems
2. Real-time ingestion through Kafka, Kinesis, and other streaming platforms

Read this authenticated review
October 26, 2017

Review: "Apache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processing"

Score 10 out of 10
Vetted Review
Verified User
Review Source
We previously used the database and Pentaho ETL tool to perform data transformation as per project requirements but as the time passed our data is building day by day and we suffered a lot of optimization problems working this way. Then we thought of implementing Hadoop cluster with 8 nodes in our company. We deployed an 8 node cluster with Cloudera distribution. Then we started using Apache Spark to create applications for Student Course Enrollment data and run them parallelly on multiprocessors.

It is used by a department but the data consists of information about students and professors of the whole organization.

It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules.
This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization
  • Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable.
  • It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time.
  • It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis.
  • I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.
  • Data visualization.
  • Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system.
  • Transformations and actions available are limited so must modify API to work for more features.
For large data
For best optimization
For parallel processing
For machine learning on huge data because presently available machine learning software like RapidMiner, are are limited to data size whereas Spark is not
Read Kamesh Emani's full review
December 13, 2017

Review: "Apache Spark Should Spark Your Interest"

Score 9 out of 10
Vetted Review
Verified User
Review Source
At my current company, we are using Spark in a variety of ways ranging from Batch processing to data analysis to machine learning techniques. It has become our main driver for any distributed processing applications. It has gained quick adoption across the organization for its ease of use, integration into the Hadoop stack, and for its support in a variety of languages.
  • Ease of use, the Spark API allows for minimal boilerplate and can be written in a variety of languages including Python, Scala, and Java.
  • Performance, for most applications we have found that jobs are more performant running via Spark than other distributed processing technologies like Map-Reduce, Hive, and Pig.
  • Flexibility, the frameworks comes with support for streaming, Batch processing, sql queries, machine learning, etc. It can be used in a variety of applications without needing to integrate a lot of other distributed processing technologies.
  • Resource heavy, jobs, in general, can be very memory intensive and you will want the nodes in your cluster to reflect that.
  • Debugging, it has gotten better with every release but sometimes it can be difficult to debug an error due to ambiguous or misleading exceptions and stack traces.
If you are running a distributed environment and are running applications that make use of Batch processing, analytics, streaming, machine learning, or graphing then I cannot recommend Spark enough. It is easy to get going, simple to learn (relative to similar technologies), and can be used in a variety of use cases. All while giving you great performance.
Read this authenticated review
August 02, 2017

Review: "Apache Spark if great for high volume production workflows"

Score 10 out of 10
Vetted Review
Verified User
Review Source
We use it primarily in our department as part of a machine learning and data processing platform to build enterprise scale predictive applications.
  • Great APIs and tools.
  • Scale.
  • Speed for iterative algorithms.
  • No true streaming.
  • Lack of strongly typed yet convenient APIs.
Well suited for Batch and near-real time data processing tasks as well as production deployments of machine learning, especially at large scale. Not well suited for general analytics workflows for small and medium sized data sets; SQL based data warehouses like Redshift, Vertica, and etc. are better for those use cases.
Read this authenticated review
June 26, 2017

Apache Spark Review: "Sparkling Spark"

Score 10 out of 10
Vetted Review
Verified User
Review Source
It's being replaced as the traditional ETL tool and we are using Apache Spark for data science solutions.
  • It makes the ETL process very simple when compared to SQL SERVER and MySQL ETL tools.
  • It's very fast and has many machine learning algorithms which can be used for data science problems.
  • It is easily implemented on a cloud cluster.
  • The initialization and spark context procedures.
  • Running applications on a cluster is not well documented anywhere, some applications are hard to debug.
  • Debugging and Testing are sometimes time-consuming.
It's well suited for ETL, data Integration, and data science problems of large data sets. It's not at all suitable for small data sets which can be done on desktops and laptops using the Python tool.
Read Sunil Dhage's full review
September 12, 2016

Apache Spark Review: "A useful replacement for MapReduce for Big Data processing"

Score 8 out of 10
Vetted Review
Verified User
Review Source
We are learning core Apache Spark + SparkSQL and MLLib, while creating proof-of-concepts as well as providing solutions for clients. It addresses the needs of quickly processing large amounts of data, typically located in Hadoop.
  • Scale from local machine to full cluster. You can run a standalone, single cluster simply by starting up a Spark Shell or submitting an application to test an algorithm, then it quickly can be transferred and configured to run in a distributed environment.
  • Provides multiple APIs. Most people I know use Python and/or Java as their main programming language. Data scientists who are familiar with NumPy and SciPy can quickly become comfortable with Spark, while Java developers would best served using Java 8 and the new features that it provides. Scala, on the other hand, is a mix between the Java and Python styles of writing Spark code, in my opinion.
  • Plentiful learning resources. The Learning Spark book is a good introduction to the mechanics of Spark although written for Spark 1.3, and the current version is 2.0. The GitHub repository for the book contains all the code examples that are discussed, plus the Spark website is also filled with useful information that is simple to navigate.
  • For data that isn't truly that large, Spark may be overkill when the problem could likely be solved on a computer with reasonable hardware resources. There doesn't seem to be a lot of examples for how a Spark task would otherwise be implemented in a different library; for instance scikit-learn and NumPy rather than Spark MLlib.
On the plus side, Spark is a good tool to learn to apply to various data processing problems.

As described in the Cons - Spark may not be needed unless there is truly a large amount of data to operate on. Other libraries may be better suited for the same task.
Read Jordan Moore's full review

Apache Spark Scorecard Summary

About Apache Spark

Categories:  Hadoop-Related

Apache Spark Technical Details

Operating Systems: Unspecified
Mobile Application:No