Item: Apache Spark
Rating: 7
Author: Verified User

Overall Satisfaction with Apache Spark

Use Cases and Deployment Scope

In our company, we used Spark for a healthcare analytical project, where we need to do large-scale data processing in a Hadoop environment. The project is about building an enterprise data lake where we bring data from multiple products and consolidate. Further, in the downstream, we will develop some business reports.

Pros and Cons

Pros

We used to make our batch processing faster. Spark is faster in batch processing than MapReduce with it in memory computing
Spark will run along with other tools in the Hadoop ecosystem including Hive and Pig
Spark supports both batch and real-time processing
Apache Spark has Machine Learning Algorithms support

Cons

Consumes more memory
Difficult to address issues around memory utilization
Expensive - In-memory processing is expensive when we look for a cost-efficient processing of big data

Return on Investment

We were able to make batch job faster by 20 times as compared to MapReduce
With the language support like Scala, Java, and Python, easily manageable

Alternatives Considered

We specifically choose Spark over MapReduce to make the cluster processing faster

Other Software Used

EMC Greenplum HD, Amazon Relational Database Service, AWS Lambda

Likelihood to Recommend

Well suited:
1. Data can be integrated from several sources including click stream, logs, transactional systems
2. Real-time ingestion through Kafka, Kinesis, and other streaming platforms

Comments

Please log in to join the conversation

Use Apache Spark to Speed Up Cluster Computing