Item: Apache Spark
Rating: 8
Author: Yogesh Mhasde

Use Cases and Deployment Scope

We were working for one of our products, which has a requirement for developing an enterprise-level product catering to manage a vast amount of Big data involved. We wanted to use a technology that is faster than Hadoop and can process large scale data by providing a streamlined process for the data scientists. Apache Spark is a powerful unified solution as we thought to be.
The main problem that we identified in our existing approach was that it was taking a large amount of time to process the data, and also the statistical analysis of the data was not up to the mark. We wanted a sophisticated analytical solution that was easy and fast to use. With using Apache Spark, the processing was made 5 times faster than earlier, giving rise to pretty good analytics. With Spark, across a cluster of machines, the data abstraction was achieved by using RDDs.

Pros and Cons

DataFrames, DataSets, and RDDs.
Spark has in-built Machine Learning library which scales and integrates with existing tools.

The data processing done by Spark comes at a price of memory blockages, as in-memory capabilities of processing can lead to large consumption of memory.
The caching algorithm is not in-built in Spark. We need to manually set up the caching mechanism.

Return on Investment

The ROI was increased by considerable percentage after using Apache Spark.
Apache Spark provided the agility towards supporting multiple applications.

Alternatives Considered

Hadoop and Amazon EMR (Elastic MapReduce)

1. Apache Spark is almost 100 % faster than Hadoop.
2. Apache Spark is more stable than Amazon EMR.
3. The end to end distributed machine library is more robust in Apache Spark.
4. For very large data sets, Apache Spark is more trustworthy than the other two.
5. For data transformations, Apache Spark provides a very rich set of APIs.
6. The interface provided for SQL in Apache Spark is easy to understand as compared to others.

Support Rating

1. It integrates very well with scala or python.
2. It's very easy to understand SQL interoperability.
3. Apache is way faster than the other competitive technologies.
4. The support from the Apache community is very huge for Spark.
5. Execution times are faster as compared to others.
6. There are a large number of forums available for Apache Spark.
7. The code availability for Apache Spark is simpler and easy to gain access to.
8. Many organizations use Apache Spark, so many solutions are available for existing applications.

Key Insights

Do you think Apache Spark delivers good value for the price?

Yes

Are you happy with Apache Spark's feature set?

Yes

Did Apache Spark live up to sales and marketing promises?

Yes

Did implementation of Apache Spark go as expected?

Yes

Would you buy Apache Spark again?

Yes

Other Software Used

Apache Camel, Azure Bot Service (Microsoft Bot Framework), Apache Kafka

Likelihood to Recommend

1. Suitable where the requirement for advanced analytics is prominent.
2. When you want big data to be processed at a very fast pace.
3. For large datasets, Spark is a viable solution.
4. When you need fault tolerance to be at a precision, go for Spark.

Spark is not suitable:
1. If you want your data to be processed in real-time, then Spark is not a good solution.
2. When you need automatic optimization, then Spark fails at that point.

Apache Spark -- The best big data solution

Overall Satisfaction with Apache Spark