Item: Apache Spark
Rating: 10
Author: Nitin Pasumarthy

Use Cases and Deployment Scope

We use Apache Spark across all analytics departments in the company. We primarily use it for distributed data processing and data preparation for machine learning models. We also use it while running distributed CRON jobs for various analytical workloads. I am familiar with a story where we contributed an algorithm to Spark open source which is on Random Walks in Large Graphs - https://databricks.com/session/random-walks-on-large-scale-graphs-with-apache-spark

Pros and Cons

Rich APIs for data transformation making for very each to transform and prepare data in a distributed environment without worrying about memory issues
Faster in execution times compare to Hadoop and PIG Latin
Easy SQL interface to the same data set for people who are comfortable to explore data in a declarative manner
Interoperability between SQL and Scala / Python style of munging data

Documentation could be better as I usually end up going to other sites / blogs to understand the concepts better
More APIs are to be ported to MLlib as only very few algorithms are available at least in clustering segment

Return on Investment

Switching from PIG Latin to Apache Spark sped up the overall development time and also the resource utilization has gone up.
Our offline jobs also run faster than traditional map-reduce like systems.
Integrating with Jupyter like notebook environments, the development experience becomes more pleasant and we can iterate much faster.

Alternatives Considered

Apache Hive, Presto, Apache Pig and Hadoop

All the above systems work quite well on big data transformations whereas Spark really shines with its bigger API support and its ability to read from and write to multiple data sources. Using Spark one can easily switch between declarative versus imperative versus functional type programming easily based on the situation. Also it doesn't need special data ingestion or indexing pre-processing like Presto. Combining it with Jupyter Notebooks (https://github.com/jupyter-incubator/sparkmagic), one can develop the Spark code in an interactive manner in Scala or Python.

Other Software Used

TensorFlow, Keras

Likelihood to Recommend

Apache Spark has rich APIs for regular data transformations or for ML workloads or for graph workloads, whereas other systems may not such a wide range of support. Choose it when you need to perform data transformations for big data as offline jobs, whereas use MongoDB-like distributed database systems for more realtime queries.

Apache Spark: One stop shop for distributed data processing, machine learning and graph processing

Overall Satisfaction with Apache Spark