Dataflow Eliminating ETL Infrastructure Overhead
Use Cases and Deployment Scope
Pros
- We require exactly once processing for our invoices where accuracy is very important.
- The native connectors for Bigquery and Storage and BQtoStorage templates made our job easy as we didn't have to write custom templates.
- We chose Google Cloud Dataflow because of the unified stream and batch processing capabilities. As we are working on stream processing for data we get from Google in Billing Exports.
Cons
- More templates for Bigquery and App Engine. There is only limited options for templates so the things we use can limit.
- I would like native connectors for Excel (XLSX) to reduce the need for custom wrappers in financial pipelines.
- Debugging Google Cloud Dataflow using only logs in Cloud Logging can be overwhelming sometimes, and it’s not always obvious which specific element in the flow caused a failure. IT uses a lot of time.
Return on Investment
- IT has automated our workflow and data enrichment steps which were very resource and time hungry steps.
- Unlike traditional ETL tools that require a 24/7 server, Dataflow scales to zero when there are no files are in GCS which is very important for us.
- With the Apache Beam SDK you can write a pipeline once and handle the entire GCS-to-BigQuery flow.
