Dataflow Eliminating ETL Infrastructure Overhead
Use Cases and Deployment Scope
We use Google Cloud Dataflow as the primary ETL engine for our billing application. Our architecture ingests raw financial data stored in Cloud Storage (Excel format), which is then processed via Dataflow pipelines to handle data cleansing, schema mapping, and validation. We use Google Cloud Dataflow's batch processing to transform this unstructured data into structured datasets within BigQuery. This automatically triggers a generation of new invoice and keeps it ready for download.
Pros
- We require exactly once processing for our invoices where accuracy is very important.
- The native connectors for Bigquery and Storage and BQtoStorage templates made our job easy as we didn't have to write custom templates.
- We chose Google Cloud Dataflow because of the unified stream and batch processing capabilities. As we are working on stream processing for data we get from Google in Billing Exports.
Cons
- More templates for Bigquery and App Engine. There is only limited options for templates so the things we use can limit.
- I would like native connectors for Excel (XLSX) to reduce the need for custom wrappers in financial pipelines.
- Debugging Google Cloud Dataflow using only logs in Cloud Logging can be overwhelming sometimes, and it’s not always obvious which specific element in the flow caused a failure. IT uses a lot of time.
Return on Investment
- IT has automated our workflow and data enrichment steps which were very resource and time hungry steps.
- Unlike traditional ETL tools that require a 24/7 server, Dataflow scales to zero when there are no files are in GCS which is very important for us.
- With the Apache Beam SDK you can write a pipeline once and handle the entire GCS-to-BigQuery flow.
Usability
Other Software Used
Google Cloud Datastore, Google BigQuery, Google App Engine
