Item: AWS Glue
Rating: 9
Author: Apurv Doshi

Overall Satisfaction with AWS Glue

Use Cases and Deployment Scope

We use AWS Glue for ETL of the healthcare data. The input data come from different source systems and so with different formats. With help of the AWS Glue jobs, we translate the data into a common format. With help of python scripts and the scheduled job feature, the data is fetched in a periodic manner, processed with help of the python script, converted to the parquet format, and stored in the S3 bucket. The glue catalog generates the schema of the stored data and allows AWS Athena to query the same for analytics purposes.

Pros and Cons

Pros

It is extremely fast, easy, and self-intuitive. Though it is a suite of services, it requires pretty less time to get control over it.
As it is a managed service, one need not take care of a lot of underlying details. The identification of data schema, code generation, customization, and orchestration of the different job components allows the developers to focus on the core business problem without worrying about infrastructure issues.
It is a pay-as-you-go service. So, there is no need to provide any capacity in advance. So, it makes scheduling much easier.

Cons

The sample code should cover more scenarios. They are quite basic. However, you can find good pointers from the internet and AWS community and tickets.
AWS Glue runs on Apache Spark. So, to take the best of the AWS Glue service, the developer should have a good idea of Apache Spark.

Most Important Features

AWS Glue Data catalog to write the efficient queries.
AWS Glue Crawler for the automatic schema recognition.
AWS Glue schedule job to perform certain ETL tasks on the defined interval.

Return on Investment

We were transforming the data using a simple python script and were facing a lot of orchestration issues. The failure of the script was quite prominent as the nature of the data was a bit more dynamic. With help of AWS glue, we could fix ~80% of orchestration issues. With help of automatic schema generation, dynamism is also addressed very well. So, we have started realising the ROI from day 1.

Alternatives Considered

AWS Data Pipeline

Glue comes in form of a managed service. However, the AWS Data Pipeline puts additional responsibility to manage the infrastructure. We were not requiring fine-grained control of the hardware which the AWS Data Pipeline provides. We also want to park our data on DynamoDB. AWS Glue allows storing the data to DynamoDB but the same is not possible with the AWS Data Pipeline. So, we decided to move ahead with AWS Glue.

Key Insights

Do you think AWS Glue delivers good value for the price?

Yes

Are you happy with AWS Glue's feature set?

Yes

Did AWS Glue live up to sales and marketing promises?

Yes

Did implementation of AWS Glue go as expected?

Yes

Would you buy AWS Glue again?

Yes

Other Software Used

Amazon SageMaker, Alexa, Amazon Lex

Likelihood to Recommend

When the data which requires ETL has different formats, schema, and volume, this service suits them best. So, when the volume is not consistent (typical use-case of healthcare and online shopping), AWS Glue can be the prime choice. When the data is available in both batch and streaming mode, the developer needs to generate a separate codebase. This increases the source code management efforts. So, prefer to go with Glue when the nature of the data is the same (either batched or streamed).

Comments

Please log in to join the conversation

AWS Glue - The managed ETL service for your data