Item: Amazon EMR (Elastic MapReduce)
Rating: 7
Author: Verified User

Use Cases and Deployment Scope

We have used AWS EMR before starting to use Databricks on EC2 instances. EMR was solving the problem but we needed a better solution (Enterprise edition) to manage our Workbooks and better scheduler for running or jobs. EMR was working fine but we did not find it user friendly to add the data nodes on demand. We used EMR primarily to process the data on AWS S3 using Hadoop and Spark frameworks. We have also used AWS SWF to orchestrate our job flow by adding steps. It was used widely by the data processing team and not by the entire organization as most of the data was on local servers. It addresses problems like processing data which might not need to be processed live as the cluster can be spun up and shut down once the job is completed. It is cost efficient (especially if you do not need data nodes and only task nodes), scalable and reliable.

Pros and Cons

EMR does well in managing the cost as it uses the task node cores to process the data and these instances are cheaper when the data is stored on s3. It is really cost efficient. No need to maintain any libraries to connect to AWS resources.
EMR is highly available, secure and easy to launch. No much hassle in launching the cluster (Simple and easy).
EMR manages the big data frameworks which the developer need not worry (no need to maintain the memory and framework settings) about the framework settings. It's all setup on launch time. The bootstrapping feature is great.

Sometimes bootstrapping certain tools comes with debugging costs. The tools provided by some of the enterprise editions are great compared to EMR.
Like some of the enterprise editions EMR does not provide on premises options.
No UI client for saving the workbooks or code snippets. Everything has to go through submitting process. Not really convenient for tracking the job as well.

Return on Investment

It was obviously cheaper and convenient to use as most of our data processing and pipelines are on AWS. It was fast and readily available with a click and that saved a ton of time rather than having to figure out the down time of the cluster if its on premises.
It saved time on processing chunks of big data which had to be processed in short period with minimal costs. EMR solved this as the cluster setup time and processing was simple, easy, cheap and fast.
It had a negative impact as it was very difficult in submitting the test jobs as it lags a UI to submit spark code snippets.

Alternatives Considered

Databricks and Hortonworks Data Platform

Having one of these enterprise edition license comes at its own costs. But, the flexibility to have the cluster spin up with the workbenches and code snippets on the same is really beneficial. Especially, if one had to move out of EMR and consider an option which reduces the debugging time in establishing connections to AWS resources, I would love to used the mentioned three resources on EC2. This would definitely make the processing time to reduce as there is a flexibility to test real time and execute the code snippet and look at the performance and monitor the snippet in real time.

Other Software Used

Databricks, Amazon Elastic Compute Cloud (EC2), Amazon DynamoDB, Amazon S3 (Simple Storage Service), Amazon Aurora, Amazon Redshift, Amazon CloudFront, Amazon CloudWatch

Likelihood to Recommend

EMR is suited if the jobs are long running and doesn't really need much monitoring. EMR is really flexible in processing the data on s3 as a developer doesn't need to spend time on debugging the connections to s3 from a big data framework as most of the configuration is taken care of by Amazon. Very cheap when compared to most of the solutions on the market and the ready to go configuration at the launch time reduces the amount of time required for admin tasks. So, considering the cheap cost, processing options on s3 and scalability via adding task nodes, EMR serves a better purpose for startups considering open source and cost efficient options.

However, EMR comes with its own disadvantages. There is no proper UI to track real time jobs which is however possible with Enterprise editions like Cloudera, Hortonworks etc. EMR could provide an interface to add workbooks and code snippets in the cluster as it would reduce the time to submit the tasks. EMR also lags the potential to automatically replace unhealthy nodes.

AWS EMR at a glance!!

Overall Satisfaction with Amazon Elastic MapReduce