TrustRadius
https://media.trustradius.com/product-logos/0H/3D/90TJJ6JJ6KNK.jpegWant to save dollars, resources and time processing big data, switch to Apache SparkWe sold a data science product to one of the leading US-based e-commerce firms. Suddenly, their data started growing at a very fast rate. The product, at this stage, was based on R programming. With such huge data, the product started taking a lot of time. We then started thinking of an alternative to R, to process multiplying big data such as this client has. We eventually came across Apache Spark. With the permission of the client, we started switching the codes from R to Apache Spark. It took a very long time to learn and code in Spark, but it was worth the effort. The R codes, which were taking days of time to run, came down to a few hours.,Very good tool to process big datasets. Inbuilt fault tolerance. Supports multiple languages. Supports advanced analytics. A large number of libraries available -- GraphX, Spark SQL, Spark Streaming, etc.,Very slow with smaller amounts of data. Expensive, as it stores data in memory.,9,We saved a lot of time and resources, thereby saving a lot of dollars for our company as well as the client.,Microsoft BI, Google BigQuery, SkypeApache Spark ReviewWe used Apache Spark within our department as a Solution Architecture team. It helped make big data processing more efficient since the same framework can be used for batch and stream processing.,Customizable, it integrates with Jupyter notebooks which was really helpful for our team. Easy to use and implement. It allows us to quickly build microservices.,Release cycles can be faster. Sometimes it kicked some of the users out due to inactivity.,7,Positive impact on analyzing big data. Fast customer service saved our time. Easy to use which means less time spent on training the team.,Sparking the futureOnly one of our departments is using Apache Spark to work on very large datasets. We are thinking of implementing it to other departments as well.,It is very fast. It is gaining usability now that the PySpark community is growing and more functions are being developed. Programmers can run different languages on the servers.,PySpark does not have the same ease of use and functionality that Pandas does yet.,8,The ability to program and run Spark programs makes consulting companies more attractive to clients. Clients like hearing new technology being leveraged and fancy terms. Projects can be completed faster because the programs run faster.,SAS Enterprise Guide,SAS Enterprise Guide, TIBCO Spotfire, Microsoft Access, Microsoft Power BISpark is useful, but requires lots of very valuable questions to justify the effort, and be prepared for failure in answering posed questionsApache Spark is used by certain departments to produce summary statistics. The software is used for data sets that are very, very large in size and require immense processing power. The software is also used for simple graphics. When the data are small enough, Apache Spark is not the preferred analytical tool. It's the big data that makes Spark useful.,Apache Spark makes processing very large data sets possible. It handles these data sets in a fairly quick manner. Apache Spark does a fairly good job implementing machine learning models for larger data sets. Apache Spark seems to be a rapidly advancing software, with the new features making the software ever more straight-forward to use.,Apache Spark requires some advanced ability to understand and structure the modeling of big data. The software is not user-friendly. The graphics produced by Apache Spark are by no means world-class. They sometimes appear high-schoolish. Apache Spark takes an enormous amount of time to crunch through multiple nodes across very large data sets. Apache Spark could improve this by offering the software in a more interactive programming environment.,7,In one sense, Apache Spark has been a positive ROI because it helps us figure out details of the vast amounts of data. Sometimes the software leads to answers to questions that are surprising. Small data software tools probably would have failed in discovering some of the insights Spark makes possible. Spark has been a negative ROI in the sense that it takes lots and lots of time to produce simple answers to simple questions, and often the answers are what was expected. Because of the confirmatory rather than insightful nature of the software, it seems like a lot of effort for the results garnered. Apache Spark represents a positive ROI on the instances when it gives a well-producing machine learning model, a model that produces predictions that actually get used.,Hadoop, Apache Flink, Amazon Kinesis, Amazon Kinesis Analytics and Amazon Elastic MapReduce,Amazon Elastic MapReduce, Amazon Kinesis, Amazon Kinesis Analytics, Hadoop, MySQL, Tableau Server, Sisense, Microsoft AzureApache Spark - defacto for big data processing/analyticsUsed as the in memory data engine for big data analytics, streaming data and SQL workloads. Also, in the process of trying it out for certain machine learning algorithms. It basically processes data for analytical needs of the business and is a great tool to co-exist with the hadoop file systems.,in memory data engine and hence faster processing does well to lay on top of hadoop file system for big data analytics very good tool for streaming data,could do a better job for analytics dashboards to provide insights on a data stream and hence not have to rely on data visualization tools along with spark also there is room for improvement in the area of data discovery,9,overall positive impact to the business for analysis of big data using hadoop file system very well received by data scientists in the business despite its shortcoming on analytical dashboarding,,Apache Kafka, SAP HANA, Couchbase Data PlatformVery useful application for Big Data processing and excellent for large volume production workflowsApache Spark is being used by the whole organization. It helps us a lot in the transmission of data, as it is 100 times faster than Hadoop MapReduce in memory and 10 times faster in disk, as we work with Java this application. It allows native links for Java programming languages, ​​and as it is compatible with SQL, is completely adapted to the needs of our organization, because of the large amount of information that we use. We highly prefer Apache Spark since it supports in-memory processing to increase performance of big data analysis applications.,It performs a conventional disk-based process when the data sets are too large to fit into memory, which is very useful because, regardless of the size of the data, it is always possible to store them. It has great speed and ability to join multiple types of databases and run different types of analysis applications. This functionality is super useful as it reduces work times Apache Spark uses the data storage model of Hadoop and can be integrated with other big data frameworks such as HBase, MongoDB, and Cassandra. This is very useful because it is compatible with multiple frameworks that the company has, and thus allows us to unify all the processes.,Increase the information and trainings that come with the application, especially for debugging since the process is difficult to understand. It should be more attentive to users and make tutorials, to reduce the learning curve. There should be more grouping algorithms.,10,It has had a very positive impact, as it helps reduce the data processing time and thus helps us achieve our goals much faster. Being easy to use, it allows us to adapt to the tool much faster than with others, which in turn allows us to access various data sources such as Hadoop, Apache Mesos, Kubernetes, independently or in the cloud. This makes it very useful. It was very easy for me to use Apache Spark and learn it since I come from a background of Java and SQL, and it shares those basic principles and uses a very similar logic.,Hadoop,Hadoop, Cassandra, Apache Camel, Apache CloudStack, Apache OpenOfficeApache Spark: One stop shop for distributed data processing, machine learning and graph processingWe use Apache Spark across all analytics departments in the company. We primarily use it for distributed data processing and data preparation for machine learning models. We also use it while running distributed CRON jobs for various analytical workloads. I am familiar with a story where we contributed an algorithm to Spark open source which is on Random Walks in Large Graphs - https://databricks.com/session/random-walks-on-large-scale-graphs-with-apache-spark,Rich APIs for data transformation making for very each to transform and prepare data in a distributed environment without worrying about memory issues Faster in execution times compare to Hadoop and PIG Latin Easy SQL interface to the same data set for people who are comfortable to explore data in a declarative manner Interoperability between SQL and Scala / Python style of munging data,Documentation could be better as I usually end up going to other sites / blogs to understand the concepts better More APIs are to be ported to MLlib as only very few algorithms are available at least in clustering segment,10,Switching from PIG Latin to Apache Spark sped up the overall development time and also the resource utilization has gone up. Our offline jobs also run faster than traditional map-reduce like systems. Integrating with Jupyter like notebook environments, the development experience becomes more pleasant and we can iterate much faster.,Apache Hive, Presto, Apache Pig and Hadoop,TensorFlow, KerasApache Spark, the be all End All.Spark was/is being used in myriad of ways. With Kafka, using Spark Streams to grab data from kafka queue into our hdfs environment. SparkSQL used for analysis of data for those not familiar with spark. Using Spark for data analysis as well and for main workflow process. Using spark over mapreduce. Using Spark for some machine learning algo's with the data.,Machine Learning. Data Analysis WorkFlow process (faster than MapReduce). SQL connector to multiple data sources,Memory management. Very weak on that. PySpark not as robust as scala with spark. spark master HA is needed. Not as HA as it should be. Locality should not be a necessity, but does help improvement. But would prefer no locality,9,Workflow process using spark went from 1 day to 2 hours Spark Streaming allowed for quick determiniation of data validity spark on yarn was good for manangement. But Spark with Kubernetes was easier to use.,mapreduce and apache storm,HBase, Cassandra, Apache DrillMy Apache Spark ReviewMy company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.,Easy ELT Process Easy clustering on cloud Amazing speed Batch & real time processing,Debugging is difficult as it is new for most people There are fewer learning resources,9,Apache Spark has faster performance compared to MapReduce. Combination of Python & Spark is the best. Shorter code, faster and efficient performance. Can replace RDBMS,Amazon Elastic MapReduce,Apache Hive, Amazon Elastic MapReduce, Apache PigApache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processingWe previously used the database and Pentaho ETL tool to perform data transformation as per project requirements but as the time passed our data is building day by day and we suffered a lot of optimization problems working this way. Then we thought of implementing Hadoop cluster with 8 nodes in our company. We deployed an 8 node cluster with Cloudera distribution. Then we started using Apache Spark to create applications for Student Course Enrollment data and run them parallelly on multiprocessors. It is used by a department but the data consists of information about students and professors of the whole organization. It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules. This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization,Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable. It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time. It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis. I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.,Data visualization. Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system. Transformations and actions available are limited so must modify API to work for more features.,10,Optimization at its best (Super Fast). Handles huge data with simple syntax whereas other programming language takes hell a lot of coding. Best for parallel computing applications.,python, Apache Pig and Apache Hive,Apache HiveApache Spark Should Spark Your InterestAt my current company, we are using Spark in a variety of ways ranging from batch processing to data analysis to machine learning techniques. It has become our main driver for any distributed processing applications. It has gained quick adoption across the organization for its ease of use, integration into the Hadoop stack, and for its support in a variety of languages.,Ease of use, the Spark API allows for minimal boilerplate and can be written in a variety of languages including Python, Scala, and Java. Performance, for most applications we have found that jobs are more performant running via Spark than other distributed processing technologies like Map-Reduce, Hive, and Pig. Flexibility, the frameworks comes with support for streaming, batch processing, sql queries, machine learning, etc. It can be used in a variety of applications without needing to integrate a lot of other distributed processing technologies.,Resource heavy, jobs, in general, can be very memory intensive and you will want the nodes in your cluster to reflect that. Debugging, it has gotten better with every release but sometimes it can be difficult to debug an error due to ambiguous or misleading exceptions and stack traces.,9,Faster turn around on feature development, we have seen a noticeable improvement in our agile development since using Spark. Easy adoption, having multiple departments use the same underlying technology even if the use cases are very different allows for more commonality amongst applications which definitely makes the operations team happy. Performance, we have been able to make some applications run over 20x faster since switching to Spark. This has saved us time, headaches, and operating costs.,Apache Pig, Apache Hive and Apache Flume,Apache Hive, Hadoop, Apache PigUse Apache Spark to Speed Up Cluster ComputingIn our company, we used Spark for a healthcare analytical project, where we need to do large-scale data processing in a Hadoop environment. The project is about building an enterprise data lake where we bring data from multiple products and consolidate. Further, in the downstream, we will develop some business reports.,We used to make our batch processing faster. Spark is faster in batch processing than MapReduce with it in memory computing Spark will run along with other tools in the Hadoop ecosystem including Hive and Pig Spark supports both batch and real-time processing Apache Spark has Machine Learning Algorithms support,Consumes more memory Difficult to address issues around memory utilization Expensive - In-memory processing is expensive when we look for a cost-efficient processing of big data,7,We were able to make batch job faster by 20 times as compared to MapReduce With the language support like Scala, Java, and Python, easily manageable,,EMC Greenplum HD, Amazon Relational Database Service, AWS LambdaSparkling SparkIt's being replaced as the traditional ETL tool and we are using Apache Spark for data science solutions.,It makes the ETL process very simple when compared to SQL SERVER and MYSQL ETL tools. It's very fast and has many machine learning algorithms which can be used for data science problems. It is easily implemented on a cloud cluster.,The initialization and spark context procedures. Running applications on a cluster is not well documented anywhere, some applications are hard to debug. Debugging and Testing are sometimes time-consuming.,10,Time saved in developing applications is less. ROI on time, resources, money. Can replace the traditional database systems.,Microsoft SQL Server, Apache Pig, Cloudera Manager,Yes,Price Product Features Product Usability Product Reputation Prior Experience with the Product Vendor Reputation Existing Relationship with the Vendor Analyst Reports,I would think ofROI and resource allocation as the most important factors.A useful replacement for MapReduce for Big Data processingWe are learning core Apache Spark + SparkSQL and MLLib, while creating proof-of-concepts as well as providing solutions for clients. It addresses the needs of quickly processing large amounts of data, typically located in Hadoop.,Scale from local machine to full cluster. You can run a standalone, single cluster simply by starting up a Spark Shell or submitting an application to test an algorithm, then it quickly can be transferred and configured to run in a distributed environment. Provides multiple APIs. Most people I know use Python and/or Java as their main programming language. Data scientists who are familiar with NumPy and SciPy can quickly become comfortable with Spark, while Java developers would best served using Java 8 and the new features that it provides. Scala, on the other hand, is a mix between the Java and Python styles of writing Spark code, in my opinion. Plentiful learning resources. The Learning Spark book is a good introduction to the mechanics of Spark although written for Spark 1.3, and the current version is 2.0. The GitHub repository for the book contains all the code examples that are discussed, plus the Spark website is also filled with useful information that is simple to navigate.,For data that isn't truly that large, Spark may be overkill when the problem could likely be solved on a computer with reasonable hardware resources. There doesn't seem to be a lot of examples for how a Spark task would otherwise be implemented in a different library; for instance scikit-learn and NumPy rather than Spark MLlib.,8,By learning Spark, we can become certified and/or provide proper recommendations or implementations on Spark solutions. With a background in Hadoop distributed processes, it has been easy to understand and diagnose how Spark handles the transfer of data within a cluster. Especially when using YARN as the resource manager and HDFS as the data source. Staying up to date with the latest changes to Spark has become a repetitive task. While most Hadoop distributions only support Spark 1.6 at the moment, Spark 2.0 has introduced some useful features, but those require a re-write of existing applications.,mapreduce and Apache Pig,Apache Sqoop, Apache Pig, Apache Hive, Apache MavenApache Spark if great for high volume production workflowsWe use it primarily in our department as part of a machine learning and data processing platform to build enterprise scale predictive applications.,Great APIs and tools. Scale. Speed for iterative algorithms.,No true streaming. Lack of strongly typed yet convenient APIs.,10,Positive: we don't worry about scale. Positive: large support community. Negative: Takes time to set up, overkill for many simpler workflows.,,Amazon Redshift, Amazon S3 (Simple Storage Service), Amazon Elastic Compute Cloud (EC2), Amazon Elastic MapReduce, Salesforce Analytics Cloud, Looker
Unspecified
Apache Spark
109 Ratings
Score 8.4 out of 101
<a href='https://www.trustradius.com/static/about-trustradius-scoring' target='_blank' rel='nofollow'>trScore algorithm: Learn more.</a>TRScore

Apache Spark Reviews

Apache Spark
109 Ratings
<a href='https://www.trustradius.com/static/about-trustradius-scoring' target='_blank' rel='nofollow'>trScore algorithm: Learn more.</a>
Score 8.4 out of 101

Do you work for this company? Manage this listing

Show Filters 
Hide Filters 
Filter 109 vetted Apache Spark reviews and ratings
Clear all filters
Overall Rating
Reviewer's Company Size
Last Updated
By Topic
Industry
Department
Experience
Job Type
Role

Reviews (1-15 of 15)

Do you use this product? Write a Review
No photo available
March 27, 2019

Want to save dollars, resources and time processing big data, switch to Apache Spark

Score 9 out of 10
Vetted Review
Verified User
Review Source
We sold a data science product to one of the leading US-based e-commerce firms. Suddenly, their data started growing at a very fast rate. The product, at this stage, was based on R programming. With such huge data, the product started taking a lot of time. We then started thinking of an alternative to R, to process multiplying big data such as this client has. We eventually came across Apache Spark. With the permission of the client, we started switching the codes from R to Apache Spark. It took a very long time to learn and code in Spark, but it was worth the effort. The R codes, which were taking days of time to run, came down to a few hours.
  • Very good tool to process big datasets.
  • Inbuilt fault tolerance.
  • Supports multiple languages.
  • Supports advanced analytics.
  • A large number of libraries available -- GraphX, Spark SQL, Spark Streaming, etc.
  • Very slow with smaller amounts of data.
  • Expensive, as it stores data in memory.
If your data is very huge, I recommend converting the underlying technology into Apache Spark. This will save you a lot of time and effort in the near future due to your growing data. The Apache Spark scalability feature also means it handles all the future data related processing.
Read this authenticated review
No photo available
March 16, 2019

Apache Spark Review

Score 7 out of 10
Vetted Review
Verified User
Review Source
We used Apache Spark within our department as a Solution Architecture team. It helped make big data processing more efficient since the same framework can be used for batch and stream processing.
  • Customizable, it integrates with Jupyter notebooks which was really helpful for our team.
  • Easy to use and implement.
  • It allows us to quickly build microservices.
  • Release cycles can be faster.
  • Sometimes it kicked some of the users out due to inactivity.
It is beneficial to use Apache Spark if:
  • You are working with big data, preprocessing data before machine learning
  • Building simple microservices and creating PoC. It makes it easier to create REST and simple web APIs.
  • If you need great customer service, Apache Spark would be a great choice since they provide it 24/7.
Read this authenticated review
No photo available
March 06, 2019

Sparking the future

Score 8 out of 10
Vetted Review
Verified User
Review Source
Only one of our departments is using Apache Spark to work on very large datasets. We are thinking of implementing it to other departments as well.
  • It is very fast.
  • It is gaining usability now that the PySpark community is growing and more functions are being developed.
  • Programmers can run different languages on the servers.
  • PySpark does not have the same ease of use and functionality that Pandas does yet.
It is well suited for very large datasets that would run slowly on Hadoop servers or if you want to do real-time analytics on streaming data. It is overkill and not worth the loss of programming functionality for smaller datasets.
Read this authenticated review
Thomas Young profile photo
January 25, 2019

Spark is useful, but requires lots of very valuable questions to justify the effort, and be prepared for failure in answering posed questions

Score 7 out of 10
Vetted Review
Verified User
Review Source
Apache Spark is used by certain departments to produce summary statistics. The software is used for data sets that are very, very large in size and require immense processing power. The software is also used for simple graphics. When the data are small enough, Apache Spark is not the preferred analytical tool. It's the big data that makes Spark useful.
  • Apache Spark makes processing very large data sets possible. It handles these data sets in a fairly quick manner.
  • Apache Spark does a fairly good job implementing machine learning models for larger data sets.
  • Apache Spark seems to be a rapidly advancing software, with the new features making the software ever more straight-forward to use.
  • Apache Spark requires some advanced ability to understand and structure the modeling of big data. The software is not user-friendly.
  • The graphics produced by Apache Spark are by no means world-class. They sometimes appear high-schoolish.
  • Apache Spark takes an enormous amount of time to crunch through multiple nodes across very large data sets. Apache Spark could improve this by offering the software in a more interactive programming environment.
The software appears to run more efficiently than other big data tools, such as Hadoop. Given that, Apache Spark is well-suited for querying and trying to make sense of very, very large data sets. The software offers many advanced machine learning and econometrics tools, although these tools are used only partially because very large data sets require too much time when the data sets get too large. The software is not well-suited for projects that are not big data in size. The graphics and analytical output are subpar compared to other tools.

Read Thomas Young's full review
Shiv Shivakumar profile photo
December 14, 2018

Apache Spark - defacto for big data processing/analytics

Score 9 out of 10
Vetted Review
Verified User
Review Source
Used as the in memory data engine for big data analytics, streaming data and SQL workloads. Also, in the process of trying it out for certain machine learning algorithms. It basically processes data for analytical needs of the business and is a great tool to co-exist with the hadoop file systems.
  • in memory data engine and hence faster processing
  • does well to lay on top of hadoop file system for big data analytics
  • very good tool for streaming data
  • could do a better job for analytics dashboards to provide insights on a data stream and hence not have to rely on data visualization tools along with spark
  • also there is room for improvement in the area of data discovery
Apache Spark is very well suited for big data analytics in conjunction with the hadoop file system and also does a good job of providing fast access to data in SQL workloads since it has an in memory data processing engine that can very quickly process data. In addition, it can also be used for streaming data processing.
Read Shiv Shivakumar's full review
Carla Borges profile photo
August 28, 2018

Very useful application for Big Data processing and excellent for large volume production workflows

Score 10 out of 10
Vetted Review
Verified User
Review Source
Apache Spark is being used by the whole organization. It helps us a lot in the transmission of data, as it is 100 times faster than Hadoop MapReduce in memory and 10 times faster in disk, as we work with Java this application. It allows native links for Java programming languages, ​​and as it is compatible with SQL, is completely adapted to the needs of our organization, because of the large amount of information that we use. We highly prefer Apache Spark since it supports in-memory processing to increase performance of big data analysis applications.
  • It performs a conventional disk-based process when the data sets are too large to fit into memory, which is very useful because, regardless of the size of the data, it is always possible to store them.
  • It has great speed and ability to join multiple types of databases and run different types of analysis applications. This functionality is super useful as it reduces work times
  • Apache Spark uses the data storage model of Hadoop and can be integrated with other big data frameworks such as HBase, MongoDB, and Cassandra. This is very useful because it is compatible with multiple frameworks that the company has, and thus allows us to unify all the processes.
  • Increase the information and trainings that come with the application, especially for debugging since the process is difficult to understand.
  • It should be more attentive to users and make tutorials, to reduce the learning curve.
  • There should be more grouping algorithms.
It is suitable for processing large amounts of data, as it is very easy to use and its syntax is simple and understandable. I also find it useful to use in a variety of applications without the need to integrate many other processing technologies, and it is very fast and has many machine learning algorithms that can be used for data problems. I find it less appropriate for data that is not so large, as it uses too many resources.
Read Carla Borges's full review
Nitin Pasumarthy profile photo
July 21, 2018

Apache Spark: One stop shop for distributed data processing, machine learning and graph processing

Score 10 out of 10
Vetted Review
Verified User
Review Source
We use Apache Spark across all analytics departments in the company. We primarily use it for distributed data processing and data preparation for machine learning models. We also use it while running distributed CRON jobs for various analytical workloads. I am familiar with a story where we contributed an algorithm to Spark open source which is on Random Walks in Large Graphs - https://databricks.com/session/random-walks-on-large-scale-graphs-with-apache-spark
  • Rich APIs for data transformation making for very each to transform and prepare data in a distributed environment without worrying about memory issues
  • Faster in execution times compare to Hadoop and PIG Latin
  • Easy SQL interface to the same data set for people who are comfortable to explore data in a declarative manner
  • Interoperability between SQL and Scala / Python style of munging data
  • Documentation could be better as I usually end up going to other sites / blogs to understand the concepts better
  • More APIs are to be ported to MLlib as only very few algorithms are available at least in clustering segment
Apache Spark has rich APIs for regular data transformations or for ML workloads or for graph workloads, whereas other systems may not such a wide range of support. Choose it when you need to perform data transformations for big data as offline jobs, whereas use MongoDB-like distributed database systems for more realtime queries.

Read Nitin Pasumarthy's full review
Anson Abraham profile photo
March 27, 2018

Apache Spark, the be all End All.

Score 9 out of 10
Vetted Review
Verified User
Review Source
Spark was/is being used in myriad of ways. With Kafka, using Spark Streams to grab data from kafka queue into our hdfs environment. SparkSQL used for analysis of data for those not familiar with spark. Using Spark for data analysis as well and for main workflow process. Using spark over mapreduce. Using Spark for some machine learning algo's with the data.
  • Machine Learning.
  • Data Analysis
  • WorkFlow process (faster than MapReduce).
  • SQL connector to multiple data sources
  • Memory management. Very weak on that.
  • PySpark not as robust as scala with spark.
  • spark master HA is needed. Not as HA as it should be.
  • Locality should not be a necessity, but does help improvement. But would prefer no locality
Spark is great as a workflow process and extract transform layer process tool. Is really good for machine learning especially for large datasets that can be processed in split file paralallelization.
Spark streaming is scalable for close to real-time data workflow process.
what it's not good for, is smaller subset of data processing.
Read Anson Abraham's full review
Kartik Chavan profile photo
June 07, 2018

My Apache Spark Review

Score 9 out of 10
Vetted Review
Verified User
Review Source
My company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.
  • Easy ELT Process
  • Easy clustering on cloud
  • Amazing speed
  • Batch & real time processing
  • Debugging is difficult as it is new for most people
  • There are fewer learning resources
When the data is very big, and you cannot afford a lot of computational timing such as in a real-time environment, it is advisable to use Apache Spark. There are alternatives to Apache Spark, but it is the most common and robust tool to work with. It is great at batch processing.
Read Kartik Chavan's full review
Kamesh Emani profile photo
October 26, 2017

Apache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processing

Score 10 out of 10
Vetted Review
Verified User
Review Source
We previously used the database and Pentaho ETL tool to perform data transformation as per project requirements but as the time passed our data is building day by day and we suffered a lot of optimization problems working this way. Then we thought of implementing Hadoop cluster with 8 nodes in our company. We deployed an 8 node cluster with Cloudera distribution. Then we started using Apache Spark to create applications for Student Course Enrollment data and run them parallelly on multiprocessors.

It is used by a department but the data consists of information about students and professors of the whole organization.

It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules.
This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization
  • Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable.
  • It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time.
  • It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis.
  • I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.
  • Data visualization.
  • Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system.
  • Transformations and actions available are limited so must modify API to work for more features.
For large data
For best optimization
For parallel processing
For machine learning on huge data because presently available machine learning software like RapidMiner, are are limited to data size whereas Spark is not
Read Kamesh Emani's full review
No photo available
December 13, 2017

Apache Spark Should Spark Your Interest

Score 9 out of 10
Vetted Review
Verified User
Review Source
At my current company, we are using Spark in a variety of ways ranging from batch processing to data analysis to machine learning techniques. It has become our main driver for any distributed processing applications. It has gained quick adoption across the organization for its ease of use, integration into the Hadoop stack, and for its support in a variety of languages.
  • Ease of use, the Spark API allows for minimal boilerplate and can be written in a variety of languages including Python, Scala, and Java.
  • Performance, for most applications we have found that jobs are more performant running via Spark than other distributed processing technologies like Map-Reduce, Hive, and Pig.
  • Flexibility, the frameworks comes with support for streaming, batch processing, sql queries, machine learning, etc. It can be used in a variety of applications without needing to integrate a lot of other distributed processing technologies.
  • Resource heavy, jobs, in general, can be very memory intensive and you will want the nodes in your cluster to reflect that.
  • Debugging, it has gotten better with every release but sometimes it can be difficult to debug an error due to ambiguous or misleading exceptions and stack traces.
If you are running a distributed environment and are running applications that make use of batch processing, analytics, streaming, machine learning, or graphing then I cannot recommend Spark enough. It is easy to get going, simple to learn (relative to similar technologies), and can be used in a variety of use cases. All while giving you great performance.
Read this authenticated review
No photo available
January 23, 2018

Use Apache Spark to Speed Up Cluster Computing

Score 7 out of 10
Vetted Review
Verified User
Review Source
In our company, we used Spark for a healthcare analytical project, where we need to do large-scale data processing in a Hadoop environment. The project is about building an enterprise data lake where we bring data from multiple products and consolidate. Further, in the downstream, we will develop some business reports.
  • We used to make our batch processing faster. Spark is faster in batch processing than MapReduce with it in memory computing
  • Spark will run along with other tools in the Hadoop ecosystem including Hive and Pig
  • Spark supports both batch and real-time processing
  • Apache Spark has Machine Learning Algorithms support
  • Consumes more memory
  • Difficult to address issues around memory utilization
  • Expensive - In-memory processing is expensive when we look for a cost-efficient processing of big data
Well suited:
1. Data can be integrated from several sources including click stream, logs, transactional systems
2. Real-time ingestion through Kafka, Kinesis, and other streaming platforms

Read this authenticated review
Sunil Dhage profile photo
June 26, 2017

Sparkling Spark

Score 10 out of 10
Vetted Review
Verified User
Review Source
It's being replaced as the traditional ETL tool and we are using Apache Spark for data science solutions.
  • It makes the ETL process very simple when compared to SQL SERVER and MYSQL ETL tools.
  • It's very fast and has many machine learning algorithms which can be used for data science problems.
  • It is easily implemented on a cloud cluster.
  • The initialization and spark context procedures.
  • Running applications on a cluster is not well documented anywhere, some applications are hard to debug.
  • Debugging and Testing are sometimes time-consuming.
It's well suited for ETL, data Integration, and data science problems of large data sets. It's not at all suitable for small data sets which can be done on desktops and laptops using the Python tool.
Read Sunil Dhage's full review
Jordan Moore profile photo
September 12, 2016

A useful replacement for MapReduce for Big Data processing

Score 8 out of 10
Vetted Review
Verified User
Review Source
We are learning core Apache Spark + SparkSQL and MLLib, while creating proof-of-concepts as well as providing solutions for clients. It addresses the needs of quickly processing large amounts of data, typically located in Hadoop.
  • Scale from local machine to full cluster. You can run a standalone, single cluster simply by starting up a Spark Shell or submitting an application to test an algorithm, then it quickly can be transferred and configured to run in a distributed environment.
  • Provides multiple APIs. Most people I know use Python and/or Java as their main programming language. Data scientists who are familiar with NumPy and SciPy can quickly become comfortable with Spark, while Java developers would best served using Java 8 and the new features that it provides. Scala, on the other hand, is a mix between the Java and Python styles of writing Spark code, in my opinion.
  • Plentiful learning resources. The Learning Spark book is a good introduction to the mechanics of Spark although written for Spark 1.3, and the current version is 2.0. The GitHub repository for the book contains all the code examples that are discussed, plus the Spark website is also filled with useful information that is simple to navigate.
  • For data that isn't truly that large, Spark may be overkill when the problem could likely be solved on a computer with reasonable hardware resources. There doesn't seem to be a lot of examples for how a Spark task would otherwise be implemented in a different library; for instance scikit-learn and NumPy rather than Spark MLlib.
On the plus side, Spark is a good tool to learn to apply to various data processing problems.

As described in the Cons - Spark may not be needed unless there is truly a large amount of data to operate on. Other libraries may be better suited for the same task.
Read Jordan Moore's full review
No photo available
August 02, 2017

Apache Spark if great for high volume production workflows

Score 10 out of 10
Vetted Review
Verified User
Review Source
We use it primarily in our department as part of a machine learning and data processing platform to build enterprise scale predictive applications.
  • Great APIs and tools.
  • Scale.
  • Speed for iterative algorithms.
  • No true streaming.
  • Lack of strongly typed yet convenient APIs.
Well suited for batch and near-real time data processing tasks as well as production deployments of machine learning, especially at large scale. Not well suited for general analytics workflows for small and medium sized data sets; SQL based data warehouses like Redshift, Vertica, and etc. are better for those use cases.
Read this authenticated review

Apache Spark Scorecard Summary

About Apache Spark

Categories:  Hadoop-Related

Apache Spark Technical Details

Operating Systems: Unspecified
Mobile Application:No