AWS EMR at a glance!!
October 25, 2017
AWS EMR at a glance!!

Score 7 out of 10
Vetted Review
Verified User
Overall Satisfaction with Amazon Elastic MapReduce
We have used AWS EMR before starting to use Databricks on EC2 instances. EMR was solving the problem but we needed a better solution (Enterprise edition) to manage our Workbooks and better scheduler for running or jobs. EMR was working fine but we did not find it user friendly to add the data nodes on demand. We used EMR primarily to process the data on AWS S3 using Hadoop and Spark frameworks. We have also used AWS SWF to orchestrate our job flow by adding steps. It was used widely by the data processing team and not by the entire organization as most of the data was on local servers. It addresses problems like processing data which might not need to be processed live as the cluster can be spun up and shut down once the job is completed. It is cost efficient (especially if you do not need data nodes and only task nodes), scalable and reliable.
Pros
- EMR does well in managing the cost as it uses the task node cores to process the data and these instances are cheaper when the data is stored on s3. It is really cost efficient. No need to maintain any libraries to connect to AWS resources.
- EMR is highly available, secure and easy to launch. No much hassle in launching the cluster (Simple and easy).
- EMR manages the big data frameworks which the developer need not worry (no need to maintain the memory and framework settings) about the framework settings. It's all setup on launch time. The bootstrapping feature is great.
Cons
- Sometimes bootstrapping certain tools comes with debugging costs. The tools provided by some of the enterprise editions are great compared to EMR.
- Like some of the enterprise editions EMR does not provide on premises options.
- No UI client for saving the workbooks or code snippets. Everything has to go through submitting process. Not really convenient for tracking the job as well.
- It was obviously cheaper and convenient to use as most of our data processing and pipelines are on AWS. It was fast and readily available with a click and that saved a ton of time rather than having to figure out the down time of the cluster if its on premises.
- It saved time on processing chunks of big data which had to be processed in short period with minimal costs. EMR solved this as the cluster setup time and processing was simple, easy, cheap and fast.
- It had a negative impact as it was very difficult in submitting the test jobs as it lags a UI to submit spark code snippets.
Having one of these enterprise edition license comes at its own costs. But, the flexibility to have the cluster spin up with the workbenches and code snippets on the same is really beneficial. Especially, if one had to move out of EMR and consider an option which reduces the debugging time in establishing connections to AWS resources, I would love to used the mentioned three resources on EC2. This would definitely make the processing time to reduce as there is a flexibility to test real time and execute the code snippet and look at the performance and monitor the snippet in real time.
Comments
Please log in to join the conversation