Item: Apache Spark
Rating: 9
Author: Anson Abraham

Overall Satisfaction with Apache Spark

Use Cases and Deployment Scope

Spark was/is being used in myriad of ways. With Kafka, using Spark Streams to grab data from kafka queue into our hdfs environment. SparkSQL used for analysis of data for those not familiar with spark. Using Spark for data analysis as well and for main workflow process. Using spark over mapreduce. Using Spark for some machine learning algo's with the data.

Pros and Cons

Pros

Machine Learning.
Data Analysis
WorkFlow process (faster than MapReduce).
SQL connector to multiple data sources

Cons

Memory management. Very weak on that.
PySpark not as robust as scala with spark.
spark master HA is needed. Not as HA as it should be.
Locality should not be a necessity, but does help improvement. But would prefer no locality

Return on Investment

Workflow process using spark went from 1 day to 2 hours
Spark Streaming allowed for quick determiniation of data validity
spark on yarn was good for manangement. But Spark with Kubernetes was easier to use.

Alternatives Considered

mapreduce and apache storm

vs MapRedce, it was faster and easier to manage. Especially for Machine Learning, where MapReduce is lacking. Also Apache Storm was slower and didn't scale as much as Spark does. Spark elasticity was easier to apply compared to storm and MapReduce.
managing resources for Spark was easier compared to storm as well. MapReduce is slower than spark.

Other Software Used

HBase, Cassandra, Apache Drill

Likelihood to Recommend

Spark is great as a workflow process and extract transform layer process tool. Is really good for machine learning especially for large datasets that can be processed in split file paralallelization.
Spark streaming is scalable for close to real-time data workflow process.
what it's not good for, is smaller subset of data processing.

Comments

Please log in to join the conversation

Apache Spark, the be all End All.