Item: Apache Spark
Rating: 8
Author: Jordan Moore

Overall Satisfaction with Apache Spark

Use Cases and Deployment Scope

We are learning core Apache Spark + SparkSQL and MLLib, while creating proof-of-concepts as well as providing solutions for clients. It addresses the needs of quickly processing large amounts of data, typically located in Hadoop.

Pros and Cons

Pros

Scale from local machine to full cluster. You can run a standalone, single cluster simply by starting up a Spark Shell or submitting an application to test an algorithm, then it quickly can be transferred and configured to run in a distributed environment.
Provides multiple APIs. Most people I know use Python and/or Java as their main programming language. Data scientists who are familiar with NumPy and SciPy can quickly become comfortable with Spark, while Java developers would best served using Java 8 and the new features that it provides. Scala, on the other hand, is a mix between the Java and Python styles of writing Spark code, in my opinion.
Plentiful learning resources. The Learning Spark book is a good introduction to the mechanics of Spark although written for Spark 1.3, and the current version is 2.0. The GitHub repository for the book contains all the code examples that are discussed, plus the Spark website is also filled with useful information that is simple to navigate.

Cons

For data that isn't truly that large, Spark may be overkill when the problem could likely be solved on a computer with reasonable hardware resources. There doesn't seem to be a lot of examples for how a Spark task would otherwise be implemented in a different library; for instance scikit-learn and NumPy rather than Spark MLlib.

Return on Investment

By learning Spark, we can become certified and/or provide proper recommendations or implementations on Spark solutions.
With a background in Hadoop distributed processes, it has been easy to understand and diagnose how Spark handles the transfer of data within a cluster. Especially when using YARN as the resource manager and HDFS as the data source.
Staying up to date with the latest changes to Spark has become a repetitive task. While most Hadoop distributions only support Spark 1.6 at the moment, Spark 2.0 has introduced some useful features, but those require a re-write of existing applications.

Alternatives Considered

mapreduce and Apache Pig

Spark has primarily replaced my use of writing pure Hadoop MapReduce or Apache Pig jobs for processing data. I like the fact that I can alternate between the main programming languages that I know - Java and Python - and use those to learn the Scala API. Spark also can be installed individually on any computer, and one can quickly get started writing applications using just the Spark Shell. I also enjoy the features that you can easily add community built packages into a Spark application such as connectors to different database sources or have various data processing libraries that aren't included in the programming language that is used.

Other Software Used

Apache Sqoop, Apache Pig, Apache Hive, Apache Maven

Likelihood to Recommend

On the plus side, Spark is a good tool to learn to apply to various data processing problems.

As described in the Cons - Spark may not be needed unless there is truly a large amount of data to operate on. Other libraries may be better suited for the same task.

Comments

Please log in to join the conversation

A useful replacement for MapReduce for Big Data processing