Google Cloud Dataflow

Score9.2 out of 10

36 Reviews and Ratings

What is Google Cloud Dataflow?

Google offers Cloud Dataflow, a managed streaming analytics platform for real-time data insights, fraud detection, and other purposes.

Categories & Use Cases

Streaming Analytics

Data Ingestion from Multiple Data Sources
Ability to ingest data from many sources including Internet of Things (IoT) endpoint data, stock trading data etc, as well as static data
Category average: 0
Low Latency
How many milli-seconds or seconds it takes to ingest, analyze and respond to an incoming event or data-point
Category average: 0
Real-Time Data Analysis
Ability to analyze data in motion
Category average: 0

Integrated Development Tools
Tools to allow developers to rapidly create streaming applications via a graphical user interface and selection of predefined functions and operators
Category average: 0
Machine Learning Automation
Machine learning helps automate predictive scoring on streaming data
Category average: 0
Visualization Dashboards
Easy-to-understand pictorial illustration of data with graphs charts and dashboards
Category average: 0

#1 most frequent

Professional, Scientific, and Technical Services

29.1%

1,121 installations of 3,858

#2 most frequent

Information

16.6%

640 installations of 3,858

#3 most frequent

Finance and Insurance

12.1%

467 installations of 3,858

Verified User

Engineer in Information Technology (10,001+ employees employees)

Use Cases and Deployment Scope

We use Google Cloud Dataflow as the primary ETL engine for our billing application. Our architecture ingests raw financial data stored in Cloud Storage (Excel format), which is then processed via Dataflow pipelines to handle data cleansing, schema mapping, and validation. We use Google Cloud Dataflow's batch processing to transform this unstructured data into structured datasets within BigQuery. This automatically triggers a generation of new invoice and keeps it ready for download.

Pros

We require exactly once processing for our invoices where accuracy is very important.
The native connectors for Bigquery and Storage and BQtoStorage templates made our job easy as we didn't have to write custom templates.
We chose Google Cloud Dataflow because of the unified stream and batch processing capabilities. As we are working on stream processing for data we get from Google in Billing Exports.

Cons

More templates for Bigquery and App Engine. There is only limited options for templates so the things we use can limit.
I would like native connectors for Excel (XLSX) to reduce the need for custom wrappers in financial pipelines.
Debugging Google Cloud Dataflow using only logs in Cloud Logging can be overwhelming sometimes, and it’s not always obvious which specific element in the flow caused a failure. IT uses a lot of time.

Return on Investment

IT has automated our workflow and data enrichment steps which were very resource and time hungry steps.
Unlike traditional ETL tools that require a 24/7 server, Dataflow scales to zero when there are no files are in GCS which is very important for us.
With the Apache Beam SDK you can write a pipeline once and handle the entire GCS-to-BigQuery flow.

Usability

Other Software Used

Google Cloud Datastore, Google BigQuery, Google App Engine

Sathish Palanivel(He/Him)View profile

Cloud Architect in Information Technology at Cognizant (10,001+ employees employees)