Item: Amazon EMR (Elastic MapReduce)
Rating: 8
Author: Verified User

Use Cases and Deployment Scope

EMR is being used by our department, not the whole organization. We use it as the infrastructure on which we run Spark jobs. Those jobs are mainly used for data I/O, data processing, and machine learning applications.

Pros and Cons

Ease of use and ease to setup
Autoscaling functionality
Integrated into the AWS environment

Cost overhead is a bit high
Limited versions of frameworks that can be used

Return on Investment

It was easy to set up initial versions of Spark on this
Still used as our compute platform as its easy to manage
Certain times we forgot to shut down clusters and were overcharged

Alternatives Considered

Databricks, Cloudera Enterprise and Hortonworks Data Platform

The alternatives to EMR are mainly hadoop distributions owned by the 3 companies above. I have not used the other distributions so it is difficult to comment, but the general tradeoff is, at the cost of a longer setup time and more infra management, you get more flexible versioning and potentially faster access to newer versions of some frameworks such as Spark.

Other Software Used

Amazon S3 (Simple Storage Service), Amazon Relational Database Service, Apache Spark, Cassandra, Apache Kafka

Likelihood to Recommend

Well suited if you quickly want to setup a distributed compute platform, such as Spark. But you have to be advanced enough that you really want to separate compute from data storage. For example, for certain applications packaged solution such as MPP databases (e.g. Redshift) is much easier to set up that Spark on EMR and S3 with the appropriate file formats.

EMR review

Overall Satisfaction with Amazon Elastic MapReduce