Apache Spark: Lightning-Fast Distributed Computing with a Learning Curve
August 18, 2023

Apache Spark: Lightning-Fast Distributed Computing with a Learning Curve

Ananth Gouri | TrustRadius Reviewer
Score 10 out of 10
Vetted Review
Verified User

Overall Satisfaction with Apache Spark

If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that I was involved in using Apache Spark was a Recommendation Systems based project. My area or domain of research expertise is also Recommendation Systems. The deployment of a RecSys along with the use of Apache Spark - functionalities like scalability, flexibility of using various data sources along with fault-tolerant systems - are very easy. The built-in machine learning library MLlib is a boon to work. We don't require any other libraries.
  • Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
  • Scalable to any extent.
  • Has built-in machine learning library called - MLlib
  • Very flexible - data from various data sources can be used. Usage with HDFS is very easy
  • Its fully not backward compatible.
  • It is memory-consuming for heavy and large workloads and datasets
  • Support for advanced analytics is not available - MLlib has minimalistic analytics.
  • Deployment is a complex task for beginners.
  • Scalability
  • We had data across multiple sources. Integration with those data source types was not a problem
  • Generation of recommendations was achievable easily
  • We used Apache Spark for one of the research projects. The ROI though cannot be measured here - but the research paper got accepted to a good conference. What else would a project require??!!
We used Surprise Kit for one of the other research works. It is more fine-tuned to Recommendation systems and their algorithms. Apache Spark has MLlib for majority of ML problems. Where as software like Surprse Kit - it suitable for a specific task of Recommendations only.

Do you think Apache Spark delivers good value for the price?

Yes

Are you happy with Apache Spark's feature set?

Yes

Did Apache Spark live up to sales and marketing promises?

I wasn't involved with the selection/purchase process

Did implementation of Apache Spark go as expected?

Yes

Would you buy Apache Spark again?

Yes

Well suited: To most of the local run of datasets and non-prod systems - scalability is not a problem at all. Including data from multiple types of data sources is an added advantage. MLlib is a decently nice built-in library that can be used for most of the ML tasks.

Less appropriate: We had to work on a RecSys where the music dataset that we used was around 300+Gb in size. We faced memory-based issues. Few times we also got memory errors. Also the MLlib library does not have support for advanced analytics and deep-learning frameworks support. Understanding the internals of the working of Apache Spark for beginners is highly not possible.

Using Apache Spark

Once we learn about the installation process and procedure - deploying Apache Spark for a prod-based system should not be a difficult task. Until we want to learn the internals of the software like Apache Spark - using it for high level work and with API should not be a big deal. Also with its amount of support available - we could get easy configuration based solutions to few of the errors. Their overall support is amazing.
ProsCons
Like to use
Easy to use
Technical support not required
Well integrated
Consistent
Quick to learn
Convenient
Feel confident using
Lots to learn
  • Usage of libraries
  • Usage of HDFS in particular
  • Basic analysis of data is possible
  • Understanding internals of the product
  • changing data sources - was kinda complex
  • Integration of other ML libraries is not so user friendly