AWS EC2 for Data Science
August 30, 2016

AWS EC2 for Data Science

David Choi | TrustRadius Reviewer
Score 10 out of 10
Vetted Review
Verified User

Overall Satisfaction with Amazon Elastic Compute Cloud (EC2)

The organization uses EC2 for any cloud computing we don't want to do locally. Specifically, my team uses EC2 to do large data processing jobs. We have Docker images of environments that have exactly the installations of languages and dependencies that we need for a specific task or set of tasks--from there, EC2 reads in data from the data source and writes data to some database or S3.
  • Flexible: Can get exactly the specs you need, on demand.
  • AWS CLI: The EC2 API via the AWS CLI is great for debugging, monitoring, etc.
  • Reliable: Rarely have problems or unexpected behavior related to EC2 itself.
  • Logging: Sometimes getting the correct logs are difficult.
  • Speed: Spinning up a cluster isn't always fast.
  • Pricing: The documentation isn't super clear on how hours are incurred for pricing.
  • Positive: Easy to set up, very effective
  • Positive: Easy to maintain, doesn't require much engineering hours for maintenance
  • Positive: Very flexible, fits well with our current stack
  • AWS EMR
For Hadoop/Spark jobs, we use AWS EMR. We evaluated this vs. just using EC2 and installing the necessary software on it ourselves. We went with EMR to ensure consistent builds, although it is slightly more expensive.
EC2 is appropriate for:
  • Long running tasks
  • Tasks that require additional computing power
  • Tasks that require variable amounts of computing power
  • Scheduled tasks
  • Tasks that require a specific build of a language

It is not as appropriate for:
  • Doing scheduling itself
  • Very on-demand tasks (other AWS options are better)
  • Companies on an extreme budget