We have used AWS EMR before starting to use Databricks
on EC2 instances. EMR was solving the problem but we needed a better solution (Enterprise edition) to manage our Workbooks and better scheduler for running or jobs. EMR was working fine but we did not find it user friendly to add the data nodes on demand. We used EMR primarily to process the data on AWS S3 using Hadoop and Spark frameworks. We have also used AWS SWF to orchestrate our job flow by adding steps. It was used widely by the data processing team and not by the entire organization as most of the data was on local servers. It addresses problems like processing data which might not need to be processed live as the cluster can be spun up and shut down once the job is completed. It is cost efficient (especially if you do not need data nodes and only task nodes), scalable and reliable.