Item: Apache Spark
Rating: 9
Author: Kartik Chavan

Use Cases and Deployment Scope

My company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.

Pros and Cons

Easy ELT Process
Easy clustering on cloud
Amazing speed
Batch & real time processing

Debugging is difficult as it is new for most people
There are fewer learning resources

Return on Investment

Apache Spark has faster performance compared to MapReduce.
Combination of Python & Spark is the best. Shorter code, faster and efficient performance.
Can replace RDBMS

Alternatives Considered

Amazon Elastic MapReduce

Even with Python, MapReduce is lengthy coding. Combination of Python with Apache Spark will not only shorten the code, but it will effectively increase the speed of algorithms. Occasionally, I use MapReduce, but Apache Spark will replace MapReduce very soon. It has many built-in and faster features.

Other Software Used

Apache Hive, Amazon Elastic MapReduce, Apache Pig

Likelihood to Recommend

When the data is very big, and you cannot afford a lot of computational timing such as in a real-time environment, it is advisable to use Apache Spark. There are alternatives to Apache Spark, but it is the most common and robust tool to work with. It is great at batch processing.

My Apache Spark Review

Overall Satisfaction with Apache Spark