Item: Apache Spark
Rating: 9
Author: Verified User

Overall Satisfaction with Apache Spark

Use Cases and Deployment Scope

We use Apache Spark for cluster computing in large-scale data processing, ETL functions, machine learning, as well as for analytics. Its primarily used by the Data Engineering Department, in order to support the data lake infrastructure. It helps us to effectively manage the great amounts of data that come from our clusters, ensuring the capacity, scalability, and performance needed.

Pros and Cons

Pros

Speed: Apache Spark has great performance for both streaming and batch data
Easy to use: the object oriented operators make it easy and intuitive.
Multiple language support
Fault tolerance
Cluster managment
Supports DF, DS, and RDDs

Cons

Hard to learn, documentation could be more in-depth.
Due to it's in-memory processing, it can take a large consumption of memory.
Poor data visualization, too basic.

Return on Investment

Saved time and resources for the company because of it's agility
High performance data processing.

Support Rating

Never had to contact them, however, they offer 24/7 support and there are a large number of forums about Spark, well-integrated with python and supports SQL syntaxis.

Usability

The only thing I dislike about spark's usability is the learning curve, there are many actions and transformations, however, its wide-range of uses for ETL processing, facility to integrate and it's multi-language support make this library a powerhouse for your data science solutions. It has especially aided us with its lightning-fast processing times.

Key Insights

Do you think Apache Spark delivers good value for the price?

Yes

Are you happy with Apache Spark's feature set?

Yes

Did Apache Spark live up to sales and marketing promises?

Yes

Did implementation of Apache Spark go as expected?

Yes

Would you buy Apache Spark again?

Yes

Other Software Used

Hadoop, Apache Kafka

Likelihood to Recommend

Well suited for: large datasets, fault tolerance, parallel processing, ETL, batch processing, streaming, analytics, graphing, or machine learning. Mostly any kind of large-scale processing, since it will save you a lot of time (days of processing). Less appropriate for: smaller datasets, you are better off using pandas or other libraries.

Comments

Please log in to join the conversation

A powerhouse processing engine.