Item: Apache Spark
Rating: 9
Author: Verified User

Overall Satisfaction with Apache Spark

Use Cases and Deployment Scope

We are building a model and due to the size of the data, we chose to use Apache Spark for the feature generation. The usage of the tool is limited within my department and one another department. The two departments need to deal with long dataset and the other departments does not need that.

Pros and Cons

Pros

quick
utilized CPU cores
trendy

Cons

lack of support
memory hungry
slow on wide data

Most Important Features

parallelization
compatibility
speed

Return on Investment

reduce time
need tuning
hard to debug

Alternatives Considered

Python IDLE

There are a few alternatives that can do the same transformation and aggregation like Apache Spark can do but most of them are not able to perform parallel computation. For example, pandas is a really good tool to do that but not parallelized; However, there are some tools that leverage pandas interface and syntax with dask and ray on the backend.

Key Insights

Do you think Apache Spark delivers good value for the price?

Yes

Are you happy with Apache Spark's feature set?

Yes

Did Apache Spark live up to sales and marketing promises?

Yes

Did implementation of Apache Spark go as expected?

Yes

Would you buy Apache Spark again?

Yes

Other Software Used

Docker, Visual Studio IDE, Microsoft Visual Studio Code

Likelihood to Recommend

I would recommend Apache Spark to the colleague if that person is working with long but narrow dataset. This would be a great tool to help the person fully utilize the CPU cores and speed up the work process. However, I would not recommend this tool if the dataset is wide not not very large.

Comments

Please log in to join the conversation

good solution for long and narrow data