Item: Apache Spark
Rating: 9
Author: Thomas Young

Use Cases and Deployment Scope

Apache Spark is used by certain departments to produce summary statistics. The software is used for data sets that are very, very large in size and require immense processing power. The software is also used for simple graphics. When the data are small enough, Apache Spark is not the preferred analytical tool. It's the big data that makes Spark useful.

Pros and Cons

Apache Spark makes processing very large data sets possible. It handles these data sets in a fairly quick manner.
Apache Spark does a fairly good job implementing machine learning models for larger data sets.
Apache Spark seems to be a rapidly advancing software, with the new features making the software ever more straight-forward to use.

Apache Spark requires some advanced ability to understand and structure the modeling of big data. The software is not user-friendly.
The graphics produced by Apache Spark are by no means world-class. They sometimes appear high-schoolish.
Apache Spark takes an enormous amount of time to crunch through multiple nodes across very large data sets. Apache Spark could improve this by offering the software in a more interactive programming environment.

Return on Investment

In one sense, Apache Spark has been a positive ROI because it helps us figure out details of the vast amounts of data. Sometimes the software leads to answers to questions that are surprising. Small data software tools probably would have failed in discovering some of the insights Spark makes possible.
Spark has been a negative ROI in the sense that it takes lots and lots of time to produce simple answers to simple questions, and often the answers are what was expected. Because of the confirmatory rather than insightful nature of the software, it seems like a lot of effort for the results garnered.
Apache Spark represents a positive ROI on the instances when it gives a well-producing machine learning model, a model that produces predictions that actually get used.

Alternatives Considered

Hadoop, Apache Flink, Amazon Kinesis, Amazon Kinesis Analytics and Amazon Elastic MapReduce

How does Apache Spark perform against competing tools? I think Apache Spark does well in processing large volumes of data. The machine learning models also seem to be easier to program and interpret. With that said, the programming side of Apache Spark seems more difficult to implement good models than Kinesis or other tools. You really have to have lots of data and very valuable questions to answer to justify the investment in Apache Spark.

Other Software Used

Amazon Elastic MapReduce, Amazon Kinesis, Amazon Kinesis Analytics, Hadoop, MySQL, Tableau Server, Sisense, Microsoft Azure

Likelihood to Recommend

The software appears to run more efficiently than other big data tools, such as Hadoop. Given that, Apache Spark is well-suited for querying and trying to make sense of very, very large data sets. The software offers many advanced machine learning and econometrics tools, although these tools are used only partially because very large data sets require too much time when the data sets get too large. The software is not well-suited for projects that are not big data in size. The graphics and analytical output are subpar compared to other tools.

Spark is useful, but requires lots of very valuable questions to justify the effort, and be prepared for failure in answering posed questions

Overall Satisfaction with Apache Spark