Skip to main content
TrustRadius
Apache Spark

Apache Spark

Overview

Learn from top reviewers

Return to navigation

Product Demos

Spark Project | Spark Tutorial | Online Spark Training | Intellipaat

YouTube

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

YouTube

Apache Spark Full Course | Apache Spark Tutorial For Beginners | Learn Spark In 7 Hours |Simplilearn

YouTube

Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka

YouTube

Introduction to Databricks [New demo linked in description]

YouTube

Apache Spark Tutorial | Spark Tutorial for Beginners | Spark Big Data | Intellipaat

YouTube
Return to navigation

Product Details

What is Apache Spark?

Apache Spark Technical Details

Operating SystemsUnspecified
Mobile ApplicationNo
Return to navigation

Comparisons

View all alternatives
Return to navigation

Reviews From Top Reviewers

(1-5 of 23)

A useful replacement for MapReduce for Big Data processing

Rating: 8 out of 10
September 12, 2016
JM
Vetted Review
Verified User
Apache Spark
1 year of experience
We are learning core Apache Spark + SparkSQL and MLLib, while creating proof-of-concepts as well as providing solutions for clients. It addresses the needs of quickly processing large amounts of data, typically located in Hadoop.
  • Scale from local machine to full cluster. You can run a standalone, single cluster simply by starting up a Spark Shell or submitting an application to test an algorithm, then it quickly can be transferred and configured to run in a distributed environment.
  • Provides multiple APIs. Most people I know use Python and/or Java as their main programming language. Data scientists who are familiar with NumPy and SciPy can quickly become comfortable with Spark, while Java developers would best served using Java 8 and the new features that it provides. Scala, on the other hand, is a mix between the Java and Python styles of writing Spark code, in my opinion.
  • Plentiful learning resources. The Learning Spark book is a good introduction to the mechanics of Spark although written for Spark 1.3, and the current version is 2.0. The GitHub repository for the book contains all the code examples that are discussed, plus the Spark website is also filled with useful information that is simple to navigate.
Cons
  • For data that isn't truly that large, Spark may be overkill when the problem could likely be solved on a computer with reasonable hardware resources. There doesn't seem to be a lot of examples for how a Spark task would otherwise be implemented in a different library; for instance scikit-learn and NumPy rather than Spark MLlib.
On the plus side, Spark is a good tool to learn to apply to various data processing problems.

As described in the Cons - Spark may not be needed unless there is truly a large amount of data to operate on. Other libraries may be better suited for the same task.

Apache Spark is the next generation of big data computing.

Rating: 9 out of 10
April 18, 2022
SL
Vetted Review
Verified User
Apache Spark
1 year of experience
We need to calculate risk-weighted assets (RWA) daily and monthly for different positions the bank holds on a T+1 basis. The volume of calculations is large: more than millions of records per day with very complicated formulas and algorithms. In our applications/projects, we used Scala and Apache Spark clusters to load all data we needed for calculation and implemented complicated formulas and algorithms via its DataFrame or DataSet from the Apache Spark platform.

Without adopting the Apache Spark cluster, it would be pretty hard for us to implement such a big system to handle a large volume of data calculations daily. After this system was successfully deployed into PROD, we've been able to provide capital risk control reports to regulation/compliance controllers in different regions in this global financial world.
  • DataFrame as a distributed collection of data: easy for developers to implement algorithms and formulas.
  • Calculation in-memory.
  • Cluster to distribute large data of calculation.
Cons
  • It would be great if Apache Spark could provide a native database to manage all file info of saved parquet.
For a large volume of data to be calculated, Apache Spark is the go-to; for intermediate or small volumes of data sets, Apache Spark is an option.

Very useful application for Big Data processing and excellent for large volume production workflows

Rating: 10 out of 10
August 28, 2018
CB
Vetted Review
Verified User
Apache Spark
2 years of experience
Apache Spark is being used by the whole organization. It helps us a lot in the transmission of data, as it is 100 times faster than Hadoop MapReduce in memory and 10 times faster in disk, as we work with Java this application. It allows native links for Java programming languages, ​​and as it is compatible with SQL, is completely adapted to the needs of our organization, because of the large amount of information that we use. We highly prefer Apache Spark since it supports in-memory processing to increase performance of big data analysis applications.
  • It performs a conventional disk-based process when the data sets are too large to fit into memory, which is very useful because, regardless of the size of the data, it is always possible to store them.
  • It has great speed and ability to join multiple types of databases and run different types of analysis applications. This functionality is super useful as it reduces work times
  • Apache Spark uses the data storage model of Hadoop and can be integrated with other big data frameworks such as HBase, MongoDB, and Cassandra. This is very useful because it is compatible with multiple frameworks that the company has, and thus allows us to unify all the processes.
Cons
  • Increase the information and trainings that come with the application, especially for debugging since the process is difficult to understand.
  • It should be more attentive to users and make tutorials, to reduce the learning curve.
  • There should be more grouping algorithms.
It is suitable for processing large amounts of data, as it is very easy to use and its syntax is simple and understandable. I also find it useful to use in a variety of applications without the need to integrate many other processing technologies, and it is very fast and has many machine learning algorithms that can be used for data problems. I find it less appropriate for data that is not so large, as it uses too many resources.

good solution for long and narrow data

Rating: 9 out of 10
May 20, 2021
Vetted Review
Verified User
Apache Spark
3 years of experience
We are building a model and due to the size of the data, we chose to use Apache Spark for the feature generation. The usage of the tool is limited within my department and one another department. The two departments need to deal with long dataset and the other departments does not need that.
  • quick
  • utilized CPU cores
  • trendy
Cons
  • lack of support
  • memory hungry
  • slow on wide data
I would recommend Apache Spark to the colleague if that person is working with long but narrow dataset. This would be a great tool to help the person fully utilize the CPU cores and speed up the work process. However, I would not recommend this tool if the dataset is wide not not very large.

Great open source tool for data processing

Rating: 9 out of 10
December 13, 2019
Vetted Review
Verified User
Apache Spark
5 years of experience
We do use Apache Spark for cluster computing for our ETL environment, data and analytics as well as machine learning. It is mainly used by our data engineering team to support the entire Data Lake foundation. As we have huge amounts of information coming from multiple sources, we needed an effective cluster management system to handle capacity and deliver the performance and throughput we needed.
  • Cluster management for ETL.
  • Data processing engine for our data lake.
Cons
  • You still need Hive or other HDFS to store information.
  • Security is behind compared to MapReduce.
Spark is a one-size-fits-all data processing platform. You can run batch and in-motion streams, you can use for ETL, machine learning or even graphs. You do not have multiple tools, so it makes your TCO and management tasks way easier. As every new platform, has room to grow: storage and security are the main opportunities we found.
Return to navigation