TrustRadius: an HG Insights company

Google Cloud Dataflow

Score9.1 out of 10

36 Reviews and Ratings

What is Google Cloud Dataflow?

Google offers Cloud Dataflow, a managed streaming analytics platform for real-time data insights, fraud detection, and other purposes.

Categories & Use Cases

Top Performing Features

  • Data Ingestion from Multiple Data Sources

    Ability to ingest data from many sources including Internet of Things (IoT) endpoint data, stock trading data etc, as well as static data

    Category average: 8.8

  • Low Latency

    How many milli-seconds or seconds it takes to ingest, analyze and respond to an incoming event or data-point

    Category average: 8.7

  • Real-Time Data Analysis

    Ability to analyze data in motion

    Category average: 8.6

Areas for Improvement

  • Integrated Development Tools

    Tools to allow developers to rapidly create streaming applications via a graphical user interface and selection of predefined functions and operators

    Category average: 7.4

  • Machine Learning Automation

    Machine learning helps automate predictive scoring on streaming data

    Category average: 8.2

  • Visualization Dashboards

    Easy-to-understand pictorial illustration of data with graphs charts and dashboards

    Category average: 8.6

Dataflow Eliminating ETL Infrastructure Overhead

Use Cases and Deployment Scope

We use Google Cloud Dataflow as the primary ETL engine for our billing application. Our architecture ingests raw financial data stored in Cloud Storage (Excel format), which is then processed via Dataflow pipelines to handle data cleansing, schema mapping, and validation. We use Google Cloud Dataflow's batch processing to transform this unstructured data into structured datasets within BigQuery. This automatically triggers a generation of new invoice and keeps it ready for download.

Pros

  • We require exactly once processing for our invoices where accuracy is very important.
  • The native connectors for Bigquery and Storage and BQtoStorage templates made our job easy as we didn't have to write custom templates.
  • We chose Google Cloud Dataflow because of the unified stream and batch processing capabilities. As we are working on stream processing for data we get from Google in Billing Exports.

Cons

  • More templates for Bigquery and App Engine. There is only limited options for templates so the things we use can limit.
  • I would like native connectors for Excel (XLSX) to reduce the need for custom wrappers in financial pipelines.
  • Debugging Google Cloud Dataflow using only logs in Cloud Logging can be overwhelming sometimes, and it’s not always obvious which specific element in the flow caused a failure. IT uses a lot of time.

Return on Investment

  • IT has automated our workflow and data enrichment steps which were very resource and time hungry steps.
  • Unlike traditional ETL tools that require a 24/7 server, Dataflow scales to zero when there are no files are in GCS which is very important for us.
  • With the Apache Beam SDK you can write a pipeline once and handle the entire GCS-to-BigQuery flow.

Usability

Other Software Used

Google Cloud Datastore, Google BigQuery, Google App Engine

Google Managed data processing service

Use Cases and Deployment Scope

In our company we are using Google Cloud Dataflow to create data pipe lines for data transformation and ingestion use cases before loading data into database. Flexibility to create our own flex templates for any special case handling. Capability to fit streaming and batch data loads are some benefits. We have some real time loads, which Dataflow helps alot.

Pros

  • Streaming, Real time work load
  • Batch processing
  • Auto scaling
  • flexible pricing

Cons

  • inbuild template options can be expanded
  • more data connector options
  • easy of use

Return on Investment

  • cost saving from managing our own data center for ETL servers
  • consumption based pricing
  • with auto scaling feature, we were able to expand components to support work load

Other Software Used

Google BigQuery, erwin Data Modeler, Microsoft Teams