Apache Spark Reviews

133 Ratings
<a href='https://www.trustradius.com/static/about-trustradius-scoring' target='_blank' rel='nofollow noopener'>trScore algorithm: Learn more.</a>
Score 9.0 out of 100

Do you work for this company? Learn how we help vendors

Overall Rating

Reviewer's Company Size

Last Updated

By Topic

Industry

Department

Experience

Job Type

Role

Reviews (1-23 of 23)

Companies can't remove reviews or game the system. Here's why.
May 02, 2021
Surendranatha Reddy Chappidi | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
  • We are using Apache Spark in Digital - Data teams to build data products and help business teams to take data-driven decisions.
  • We use Apache Spark to source that from different source systems, process it, and store it in the data lake.
  • Once the data is in data lake, we use spark for data cleansing and data transformation as per business requirements
  • Once the data is transformed, then we will insert it into the final target layer in the data warehouse.
  • Spark is very fast compered to other frameworks because it works in cluster mode and use distributed processing and computation frameworks internally
  • Robust and fault tolerant
  • Open source
  • Can source data from multiple data sources
  • No Dataset API support in python version of spark
  • Apache Spark job run UI can have more meaningful information
  • Spark errors can provide more meaningful information when a job is failed
Specific scenarios where Apache Spark is well suited:
1. real-time processing of streaming data
2. processing unstructured data, semi-structured data, and structured data from multiple sources
3. avoid vendor lock-in and cloud platform lock-in while developing products
Read Surendranatha Reddy Chappidi's full review
May 20, 2021
Anonymous | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
We are building a model and due to the size of the data, we chose to use Apache Spark for the feature generation. The usage of the tool is limited within my department and one another department. The two departments need to deal with long dataset and the other departments does not need that.
  • quick
  • utilized CPU cores
  • trendy
  • lack of support
  • memory hungry
  • slow on wide data
I would recommend Apache Spark to the colleague if that person is working with long but narrow dataset. This would be a great tool to help the person fully utilize the CPU cores and speed up the work process. However, I would not recommend this tool if the dataset is wide not not very large.
Read this authenticated review
April 30, 2021
Anonymous | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
Apache Spark is being widely used within the company. In Advanced Analytics department data engineers and data scientists work closely in machine learning projects to generate value. Spark provides unified big data analytics engine which helps us easily process huge amount of data. We are using Spark in projects like churn prediction, network analytics.
  • Machine learning on big data
  • Stream processing
  • Lakehouse with Delta
  • Indexing
  • Mllib
  • Streaming
Apache Spark is very good for prosessing large amount of data but not that good if you need many joins or low latency. With combination of delta engine performance improved alot. Especially having ACID support, time travel features and consistent view for simultaneous read and writes it’s now ready for next level.
Read this authenticated review
September 19, 2020
Partha Protim Pegu | TrustRadius Reviewer
Score 8 out of 10
Vetted Review
Verified User
Review Source
Our organization currently uses Apache Spark for processing large chunks of data. It is being used for machine learning and large scale SQL queries. We are using high-level APIs for performing complex tasks. Our team of developers and data scientists incorporate Spark into their applications to transform large chunks of data. It is also being used for LOT, ETL, etc.
  • It has API working with big data.
  • Reduces the number of read and write actions to disk.
  • Data is stored primarily on memory and not stored on hard disk unless required.
  • Easy to program.
  • Runs complex jobs in a fraction of the time.
  • Automation is missing from Spark i.e. automatic optimization process.
  • It needs to have its own file management system.
  • Inability to support more concurrent users.
Apache Spark is well suited for the below scenarios:
Processing large chunks of data. Spark supports multiple frameworks for Big data. It is good when we need high scalability

Apache Spark is not well suited for the below scenarios:
If we want real-time analytics and need results quickly. Not to be used as a replacement to existing infrastructure but can be used as a parallel framework. Working with small datasets.
We have been using Spark for a very long time and we are very happy with its service and support. It has a very good and interactive Community, which is enough to solve any problem which we encounter. The tool itself is very easy to use and combined with the support makes it a very useful tool.
Read Partha Protim Pegu's full review
November 07, 2020
Chetan Munegowda | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
Apache Spark is being used by our organization for writing ETL applications. It enables us to ingest thousands of records of data to database tables.
  • Great computing engine for solving complex transformative logic
  • Useful for understanding data and doing data analytical work
  • Gives us a great set of libraries and api to solve day-to-day problems
  • High learning curve
  • Complexity
  • More documentation
  • More developer support
  • More educational videos
Apache Spark is suited for big data applications when there is a need for performing analysis, streaming data work, and ETL work.
Developer support for Apache Spark can be improved. We need more of a developer community around this considering it's an emerging technology.
Read Chetan Munegowda's full review
September 18, 2020
Anonymous | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
We use Apache Spark for cluster computing in large-scale data processing, ETL functions, machine learning, as well as for analytics. Its primarily used by the Data Engineering Department, in order to support the data lake infrastructure. It helps us to effectively manage the great amounts of data that come from our clusters, ensuring the capacity, scalability, and performance needed.
  • Speed: Apache Spark has great performance for both streaming and batch data
  • Easy to use: the object oriented operators make it easy and intuitive.
  • Multiple language support
  • Fault tolerance
  • Cluster managment
  • Supports DF, DS, and RDDs
  • Hard to learn, documentation could be more in-depth.
  • Due to it's in-memory processing, it can take a large consumption of memory.
  • Poor data visualization, too basic.
Well suited for: large datasets, fault tolerance, parallel processing, ETL, batch processing, streaming, analytics, graphing, or machine learning. Mostly any kind of large-scale processing, since it will save you a lot of time (days of processing). Less appropriate for: smaller datasets, you are better off using pandas or other libraries.
Never had to contact them, however, they offer 24/7 support and there are a large number of forums about Spark, well-integrated with python and supports SQL syntaxis.
Read this authenticated review
January 11, 2020
Yogesh Mhasde | TrustRadius Reviewer
Score 8 out of 10
Vetted Review
Verified User
Review Source
We were working for one of our products, which has a requirement for developing an enterprise-level product catering to manage a vast amount of Big data involved. We wanted to use a technology that is faster than Hadoop and can process large scale data by providing a streamlined process for the data scientists. Apache Spark is a powerful unified solution as we thought to be.
The main problem that we identified in our existing approach was that it was taking a large amount of time to process the data, and also the statistical analysis of the data was not up to the mark. We wanted a sophisticated analytical solution that was easy and fast to use. With using Apache Spark, the processing was made 5 times faster than earlier, giving rise to pretty good analytics. With Spark, across a cluster of machines, the data abstraction was achieved by using RDDs.
  • DataFrames, DataSets, and RDDs.
  • Spark has in-built Machine Learning library which scales and integrates with existing tools.
  • The data processing done by Spark comes at a price of memory blockages, as in-memory capabilities of processing can lead to large consumption of memory.
  • The caching algorithm is not in-built in Spark. We need to manually set up the caching mechanism.
1. Suitable where the requirement for advanced analytics is prominent.
2. When you want big data to be processed at a very fast pace.
3. For large datasets, Spark is a viable solution.
4. When you need fault tolerance to be at a precision, go for Spark.

Spark is not suitable:
1. If you want your data to be processed in real-time, then Spark is not a good solution.
2. When you need automatic optimization, then Spark fails at that point.
1. It integrates very well with scala or python.
2. It's very easy to understand SQL interoperability.
3. Apache is way faster than the other competitive technologies.
4. The support from the Apache community is very huge for Spark.
5. Execution times are faster as compared to others.
6. There are a large number of forums available for Apache Spark.
7. The code availability for Apache Spark is simpler and easy to gain access to.
8. Many organizations use Apache Spark, so many solutions are available for existing applications.
Read Yogesh Mhasde's full review
December 13, 2019
Anonymous | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
We do use Apache Spark for cluster computing for our ETL environment, data and analytics as well as machine learning. It is mainly used by our data engineering team to support the entire Data Lake foundation. As we have huge amounts of information coming from multiple sources, we needed an effective cluster management system to handle capacity and deliver the performance and throughput we needed.
  • Cluster management for ETL.
  • Data processing engine for our data lake.
  • You still need Hive or other HDFS to store information.
  • Security is behind compared to MapReduce.
Spark is a one-size-fits-all data processing platform. You can run batch and in-motion streams, you can use for ETL, machine learning or even graphs. You do not have multiple tools, so it makes your TCO and management tasks way easier. As every new platform, has room to grow: storage and security are the main opportunities we found.
As every open source tool, you have to use forums, consulting companies and engineering power to support and maintain. There is plenty of documentation available, so you will be in good hands. You can also find consulting companies small-mid size which can support your environment at a decent cost. Another alternative is going to Data Bricks, if support is a key criteria for your decision.
Read this authenticated review
January 25, 2019
Thomas Young | TrustRadius Reviewer
Score 7 out of 10
Vetted Review
Verified User
Review Source
Apache Spark is used by certain departments to produce summary statistics. The software is used for data sets that are very, very large in size and require immense processing power. The software is also used for simple graphics. When the data are small enough, Apache Spark is not the preferred analytical tool. It's the big data that makes Spark useful.
  • Apache Spark makes processing very large data sets possible. It handles these data sets in a fairly quick manner.
  • Apache Spark does a fairly good job implementing machine learning models for larger data sets.
  • Apache Spark seems to be a rapidly advancing software, with the new features making the software ever more straight-forward to use.
  • Apache Spark requires some advanced ability to understand and structure the modeling of big data. The software is not user-friendly.
  • The graphics produced by Apache Spark are by no means world-class. They sometimes appear high-schoolish.
  • Apache Spark takes an enormous amount of time to crunch through multiple nodes across very large data sets. Apache Spark could improve this by offering the software in a more interactive programming environment.
The software appears to run more efficiently than other big data tools, such as Hadoop. Given that, Apache Spark is well-suited for querying and trying to make sense of very, very large data sets. The software offers many advanced machine learning and econometrics tools, although these tools are used only partially because very large data sets require too much time when the data sets get too large. The software is not well-suited for projects that are not big data in size. The graphics and analytical output are subpar compared to other tools.

Read Thomas Young's full review
December 14, 2018
Shiv Shivakumar | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
Used as the in memory data engine for big data analytics, streaming data and SQL workloads. Also, in the process of trying it out for certain machine learning algorithms. It basically processes data for analytical needs of the business and is a great tool to co-exist with the hadoop file systems.
  • in memory data engine and hence faster processing
  • does well to lay on top of hadoop file system for big data analytics
  • very good tool for streaming data
  • could do a better job for analytics dashboards to provide insights on a data stream and hence not have to rely on data visualization tools along with spark
  • also there is room for improvement in the area of data discovery
Apache Spark is very well suited for big data analytics in conjunction with the hadoop file system and also does a good job of providing fast access to data in SQL workloads since it has an in memory data processing engine that can very quickly process data. In addition, it can also be used for streaming data processing.
Read Shiv Shivakumar's full review
July 21, 2018
Nitin Pasumarthy | TrustRadius Reviewer
Score 10 out of 10
Vetted Review
Verified User
Review Source
We use Apache Spark across all analytics departments in the company. We primarily use it for distributed data processing and data preparation for machine learning models. We also use it while running distributed CRON jobs for various analytical workloads. I am familiar with a story where we contributed an algorithm to Spark open source which is on Random Walks in Large Graphs - https://databricks.com/session/random-walks-on-large-scale-graphs-with-apache-spark
  • Rich APIs for data transformation making for very each to transform and prepare data in a distributed environment without worrying about memory issues
  • Faster in execution times compare to Hadoop and PIG Latin
  • Easy SQL interface to the same data set for people who are comfortable to explore data in a declarative manner
  • Interoperability between SQL and Scala / Python style of munging data
  • Documentation could be better as I usually end up going to other sites / blogs to understand the concepts better
  • More APIs are to be ported to MLlib as only very few algorithms are available at least in clustering segment
Apache Spark has rich APIs for regular data transformations or for ML workloads or for graph workloads, whereas other systems may not such a wide range of support. Choose it when you need to perform data transformations for big data as offline jobs, whereas use MongoDB-like distributed database systems for more realtime queries.

Read Nitin Pasumarthy's full review
August 28, 2018
Carla Borges | TrustRadius Reviewer
Score 10 out of 10
Vetted Review
Verified User
Review Source
Apache Spark is being used by the whole organization. It helps us a lot in the transmission of data, as it is 100 times faster than Hadoop MapReduce in memory and 10 times faster in disk, as we work with Java this application. It allows native links for Java programming languages, ​​and as it is compatible with SQL, is completely adapted to the needs of our organization, because of the large amount of information that we use. We highly prefer Apache Spark since it supports in-memory processing to increase performance of big data analysis applications.
  • It performs a conventional disk-based process when the data sets are too large to fit into memory, which is very useful because, regardless of the size of the data, it is always possible to store them.
  • It has great speed and ability to join multiple types of databases and run different types of analysis applications. This functionality is super useful as it reduces work times
  • Apache Spark uses the data storage model of Hadoop and can be integrated with other big data frameworks such as HBase, MongoDB, and Cassandra. This is very useful because it is compatible with multiple frameworks that the company has, and thus allows us to unify all the processes.
  • Increase the information and trainings that come with the application, especially for debugging since the process is difficult to understand.
  • It should be more attentive to users and make tutorials, to reduce the learning curve.
  • There should be more grouping algorithms.
It is suitable for processing large amounts of data, as it is very easy to use and its syntax is simple and understandable. I also find it useful to use in a variety of applications without the need to integrate many other processing technologies, and it is very fast and has many machine learning algorithms that can be used for data problems. I find it less appropriate for data that is not so large, as it uses too many resources.
Read Carla Borges's full review
March 27, 2018
Anson Abraham | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
Spark was/is being used in myriad of ways. With Kafka, using Spark Streams to grab data from kafka queue into our hdfs environment. SparkSQL used for analysis of data for those not familiar with spark. Using Spark for data analysis as well and for main workflow process. Using spark over mapreduce. Using Spark for some machine learning algo's with the data.
  • Machine Learning.
  • Data Analysis
  • WorkFlow process (faster than MapReduce).
  • SQL connector to multiple data sources
  • Memory management. Very weak on that.
  • PySpark not as robust as scala with spark.
  • spark master HA is needed. Not as HA as it should be.
  • Locality should not be a necessity, but does help improvement. But would prefer no locality
Spark is great as a workflow process and extract transform layer process tool. Is really good for machine learning especially for large datasets that can be processed in split file paralallelization.
Spark streaming is scalable for close to real-time data workflow process.
what it's not good for, is smaller subset of data processing.
Read Anson Abraham's full review
June 07, 2018
Kartik Chavan | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
My company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.
  • Easy ELT Process
  • Easy clustering on cloud
  • Amazing speed
  • Batch & real time processing
  • Debugging is difficult as it is new for most people
  • There are fewer learning resources
When the data is very big, and you cannot afford a lot of computational timing such as in a real-time environment, it is advisable to use Apache Spark. There are alternatives to Apache Spark, but it is the most common and robust tool to work with. It is great at batch processing.
Read Kartik Chavan's full review
October 26, 2017
Kamesh Emani | TrustRadius Reviewer
Score 10 out of 10
Vetted Review
Verified User
Review Source
We previously used the database and Pentaho ETL tool to perform data transformation as per project requirements but as the time passed our data is building day by day and we suffered a lot of optimization problems working this way. Then we thought of implementing Hadoop cluster with 8 nodes in our company. We deployed an 8 node cluster with Cloudera distribution. Then we started using Apache Spark to create applications for Student Course Enrollment data and run them parallelly on multiprocessors.

It is used by a department but the data consists of information about students and professors of the whole organization.

It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules.
This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization
  • Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable.
  • It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time.
  • It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis.
  • I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.
  • Data visualization.
  • Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system.
  • Transformations and actions available are limited so must modify API to work for more features.
For large data
For best optimization
For parallel processing
For machine learning on huge data because presently available machine learning software like RapidMiner, are are limited to data size whereas Spark is not
Read Kamesh Emani's full review
June 26, 2017
Sunil Dhage | TrustRadius Reviewer
Score 10 out of 10
Vetted Review
Verified User
Review Source
It's being replaced as the traditional ETL tool and we are using Apache Spark for data science solutions.
  • It makes the ETL process very simple when compared to SQL SERVER and MYSQL ETL tools.
  • It's very fast and has many machine learning algorithms which can be used for data science problems.
  • It is easily implemented on a cloud cluster.
  • The initialization and spark context procedures.
  • Running applications on a cluster is not well documented anywhere, some applications are hard to debug.
  • Debugging and Testing are sometimes time-consuming.
It's well suited for ETL, data Integration, and data science problems of large data sets. It's not at all suitable for small data sets which can be done on desktops and laptops using the Python tool.
Read Sunil Dhage's full review
September 12, 2016
Jordan Moore | TrustRadius Reviewer
Score 8 out of 10
Vetted Review
Verified User
Review Source
We are learning core Apache Spark + SparkSQL and MLLib, while creating proof-of-concepts as well as providing solutions for clients. It addresses the needs of quickly processing large amounts of data, typically located in Hadoop.
  • Scale from local machine to full cluster. You can run a standalone, single cluster simply by starting up a Spark Shell or submitting an application to test an algorithm, then it quickly can be transferred and configured to run in a distributed environment.
  • Provides multiple APIs. Most people I know use Python and/or Java as their main programming language. Data scientists who are familiar with NumPy and SciPy can quickly become comfortable with Spark, while Java developers would best served using Java 8 and the new features that it provides. Scala, on the other hand, is a mix between the Java and Python styles of writing Spark code, in my opinion.
  • Plentiful learning resources. The Learning Spark book is a good introduction to the mechanics of Spark although written for Spark 1.3, and the current version is 2.0. The GitHub repository for the book contains all the code examples that are discussed, plus the Spark website is also filled with useful information that is simple to navigate.
  • For data that isn't truly that large, Spark may be overkill when the problem could likely be solved on a computer with reasonable hardware resources. There doesn't seem to be a lot of examples for how a Spark task would otherwise be implemented in a different library; for instance scikit-learn and NumPy rather than Spark MLlib.
On the plus side, Spark is a good tool to learn to apply to various data processing problems.

As described in the Cons - Spark may not be needed unless there is truly a large amount of data to operate on. Other libraries may be better suited for the same task.
Read Jordan Moore's full review
March 27, 2019
Anonymous | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
We sold a data science product to one of the leading US-based e-commerce firms. Suddenly, their data started growing at a very fast rate. The product, at this stage, was based on R programming. With such huge data, the product started taking a lot of time. We then started thinking of an alternative to R, to process multiplying big data such as this client has. We eventually came across Apache Spark. With the permission of the client, we started switching the codes from R to Apache Spark. It took a very long time to learn and code in Spark, but it was worth the effort. The R codes, which were taking days of time to run, came down to a few hours.
  • Very good tool to process big datasets.
  • Inbuilt fault tolerance.
  • Supports multiple languages.
  • Supports advanced analytics.
  • A large number of libraries available -- GraphX, Spark SQL, Spark Streaming, etc.
  • Very slow with smaller amounts of data.
  • Expensive, as it stores data in memory.
If your data is very huge, I recommend converting the underlying technology into Apache Spark. This will save you a lot of time and effort in the near future due to your growing data. The Apache Spark scalability feature also means it handles all the future data related processing.
Read this authenticated review
March 16, 2019
Anonymous | TrustRadius Reviewer
Score 7 out of 10
Vetted Review
Verified User
Review Source
We used Apache Spark within our department as a Solution Architecture team. It helped make big data processing more efficient since the same framework can be used for batch and stream processing.
  • Customizable, it integrates with Jupyter notebooks which was really helpful for our team.
  • Easy to use and implement.
  • It allows us to quickly build microservices.
  • Release cycles can be faster.
  • Sometimes it kicked some of the users out due to inactivity.
It is beneficial to use Apache Spark if:
  • You are working with big data, preprocessing data before machine learning
  • Building simple microservices and creating PoC. It makes it easier to create REST and simple web APIs.
  • If you need great customer service, Apache Spark would be a great choice since they provide it 24/7.
Read this authenticated review
March 06, 2019
Anonymous | TrustRadius Reviewer
Score 8 out of 10
Vetted Review
Verified User
Review Source
Only one of our departments is using Apache Spark to work on very large datasets. We are thinking of implementing it to other departments as well.
  • It is very fast.
  • It is gaining usability now that the PySpark community is growing and more functions are being developed.
  • Programmers can run different languages on the servers.
  • PySpark does not have the same ease of use and functionality that Pandas does yet.
It is well suited for very large datasets that would run slowly on Hadoop servers or if you want to do real-time analytics on streaming data. It is overkill and not worth the loss of programming functionality for smaller datasets.
Read this authenticated review
December 13, 2017
Anonymous | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User
Review Source
At my current company, we are using Spark in a variety of ways ranging from batch processing to data analysis to machine learning techniques. It has become our main driver for any distributed processing applications. It has gained quick adoption across the organization for its ease of use, integration into the Hadoop stack, and for its support in a variety of languages.
  • Ease of use, the Spark API allows for minimal boilerplate and can be written in a variety of languages including Python, Scala, and Java.
  • Performance, for most applications we have found that jobs are more performant running via Spark than other distributed processing technologies like Map-Reduce, Hive, and Pig.
  • Flexibility, the frameworks comes with support for streaming, batch processing, sql queries, machine learning, etc. It can be used in a variety of applications without needing to integrate a lot of other distributed processing technologies.
  • Resource heavy, jobs, in general, can be very memory intensive and you will want the nodes in your cluster to reflect that.
  • Debugging, it has gotten better with every release but sometimes it can be difficult to debug an error due to ambiguous or misleading exceptions and stack traces.
If you are running a distributed environment and are running applications that make use of batch processing, analytics, streaming, machine learning, or graphing then I cannot recommend Spark enough. It is easy to get going, simple to learn (relative to similar technologies), and can be used in a variety of use cases. All while giving you great performance.
Read this authenticated review
January 23, 2018
Anonymous | TrustRadius Reviewer
Score 7 out of 10
Vetted Review
Verified User
Review Source
In our company, we used Spark for a healthcare analytical project, where we need to do large-scale data processing in a Hadoop environment. The project is about building an enterprise data lake where we bring data from multiple products and consolidate. Further, in the downstream, we will develop some business reports.
  • We used to make our batch processing faster. Spark is faster in batch processing than MapReduce with it in memory computing
  • Spark will run along with other tools in the Hadoop ecosystem including Hive and Pig
  • Spark supports both batch and real-time processing
  • Apache Spark has Machine Learning Algorithms support
  • Consumes more memory
  • Difficult to address issues around memory utilization
  • Expensive - In-memory processing is expensive when we look for a cost-efficient processing of big data
Well suited:
1. Data can be integrated from several sources including click stream, logs, transactional systems
2. Real-time ingestion through Kafka, Kinesis, and other streaming platforms

Read this authenticated review
August 02, 2017
Anonymous | TrustRadius Reviewer
Score 10 out of 10
Vetted Review
Verified User
Review Source
We use it primarily in our department as part of a machine learning and data processing platform to build enterprise scale predictive applications.
  • Great APIs and tools.
  • Scale.
  • Speed for iterative algorithms.
  • No true streaming.
  • Lack of strongly typed yet convenient APIs.
Well suited for batch and near-real time data processing tasks as well as production deployments of machine learning, especially at large scale. Not well suited for general analytics workflows for small and medium sized data sets; SQL based data warehouses like Redshift, Vertica, and etc. are better for those use cases.
Read this authenticated review

What is Apache Spark?

Apache Spark Technical Details

Operating Systems: Unspecified
Mobile Application:No

Frequently Asked Questions

What is Apache Spark's best feature?

Reviewers rate Usability highest, with a score of 8.7.

Who uses Apache Spark?

The most common users of Apache Spark are from Enterprises and the Computer Software industry.