Dataflow Eliminating ETL Infrastructure Overhead
Use Cases and Deployment Scope
We use Google Cloud Dataflow as the primary ETL engine for our billing application. Our architecture ingests raw financial data stored in Cloud Storage (Excel format), which is then processed via Dataflow pipelines to handle data cleansing, schema mapping, and validation. We use Google Cloud Dataflow's batch processing to transform this unstructured data into structured datasets within BigQuery. This automatically triggers a generation of new invoice and keeps it ready for download.
Pros
- We require exactly once processing for our invoices where accuracy is very important.
- The native connectors for Bigquery and Storage and BQtoStorage templates made our job easy as we didn't have to write custom templates.
- We chose Google Cloud Dataflow because of the unified stream and batch processing capabilities. As we are working on stream processing for data we get from Google in Billing Exports.
Cons
- More templates for Bigquery and App Engine. There is only limited options for templates so the things we use can limit.
- I would like native connectors for Excel (XLSX) to reduce the need for custom wrappers in financial pipelines.
- Debugging Google Cloud Dataflow using only logs in Cloud Logging can be overwhelming sometimes, and it’s not always obvious which specific element in the flow caused a failure. IT uses a lot of time.
Likelihood to Recommend
It is best in cases where you have batch as well as streaming data. Also in some cases where you have batch data right now and in future you will get streaming data. In those cases Dataflow is very good.
Also in cases where most of your infra is on GCP.
It might not be good when you already are on AWS or Azure. And also you want in-depth control over security and management. Then you can directly use Apache beam over Dataflow.