Item: Apache Spark
Rating: 10
Author: Ananth Gouri

Use Cases and Deployment Scope

If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that I was involved in using Apache Spark was a Recommendation Systems based project. My area or domain of research expertise is also Recommendation Systems. The deployment of a RecSys along with the use of Apache Spark - functionalities like scalability, flexibility of using various data sources along with fault-tolerant systems - are very easy. The built-in machine learning library MLlib is a boon to work. We don't require any other libraries.

Pros and Cons

Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
Scalable to any extent.
Has built-in machine learning library called - MLlib
Very flexible - data from various data sources can be used. Usage with HDFS is very easy

Its fully not backward compatible.
It is memory-consuming for heavy and large workloads and datasets
Support for advanced analytics is not available - MLlib has minimalistic analytics.
Deployment is a complex task for beginners.

Most Important Features

Scalability
We had data across multiple sources. Integration with those data source types was not a problem
Generation of recommendations was achievable easily

Return on Investment

We used Apache Spark for one of the research projects. The ROI though cannot be measured here - but the research paper got accepted to a good conference. What else would a project require??!!

Alternatives Considered

We used Surprise Kit for one of the other research works. It is more fine-tuned to Recommendation systems and their algorithms. Apache Spark has MLlib for majority of ML problems. Where as software like Surprse Kit - it suitable for a specific task of Recommendations only.

Key Insights

Do you think Apache Spark delivers good value for the price?

Yes

Are you happy with Apache Spark's feature set?

Yes

Did Apache Spark live up to sales and marketing promises?

I wasn't involved with the selection/purchase process

Did implementation of Apache Spark go as expected?

Yes

Would you buy Apache Spark again?

Yes

Other Software Used

ChatGPT, Python IDLE, IntelliJ IDEA

Likelihood to Recommend

Well suited: To most of the local run of datasets and non-prod systems - scalability is not a problem at all. Including data from multiple types of data sources is an added advantage. MLlib is a decently nice built-in library that can be used for most of the ML tasks.

Less appropriate: We had to work on a RecSys where the music dataset that we used was around 300+Gb in size. We faced memory-based issues. Few times we also got memory errors. Also the MLlib library does not have support for advanced analytics and deep-learning frameworks support. Understanding the internals of the working of Apache Spark for beginners is highly not possible.

Usability

Once we learn about the installation process and procedure - deploying Apache Spark for a prod-based system should not be a difficult task. Until we want to learn the internals of the software like Apache Spark - using it for high level work and with API should not be a big deal. Also with its amount of support available - we could get easy configuration based solutions to few of the errors. Their overall support is amazing.

Usability Pros and Cons

Pros	Cons
Like to use Easy to use Technical support not required Well integrated Consistent Quick to learn Convenient Feel confident using	Lots to learn

Apache Spark: Lightning-Fast Distributed Computing with a Learning Curve

Overall Satisfaction with Apache Spark