Google Cloud Pub/Sub, the jewel of streaming data
Use Cases and Deployment Scope
We used Google Cloud Pub/Sub to solve ETL/Streaming and real-time processing problems for high volumes of data. We used it either to fill datalakes, process and store in warehouses or data marts and also for processing events, either using JSON or protobuf.
This was integrated for many languages such as python, java, golang and kotlin. We had configured kubernetes auto scaling system based on some Google Cloud Pub/Sub metrics which worked very well. The main observed metrics for alerts and overall health indicator of our systems were both the size of each queue and the oldest message in queue, either indicating a high volume jam or some random specific error for a single message, respectively.
We had to handle idempotency since a duplicated message delivery is a possibility, this was usually paired and a Redis Cache to guarantee idempotency for a reasonable time window.
Pros
- Data Streaming
- Even Sourcing
- Protobuf message format
- Scalability
- Easy to Use
- Observability
- Integrated Dead Letter Queue (DLQ) functionality
Cons
- Deliver Once (idempotency) - currently in preview
- Vendor locked to Google
Likelihood to Recommend
If you want to stream high volumes of data, be it for ETL streaming or event sourcing, Google Cloud Pub/Sub is your go-to tool. It's easy to learn, easy to observe its metrics and scales with ease without additional configuration so if you have more producers of consumers, all you need to do is to deploy on k8s your solutions so that you can perform autoscaling on your pods to adjust to the data volume. The DLQ is also very transparent and easy to configure. Your code will have no logic whatsoever regarding orchestrating pubsub, you just plug and play.
However, if you are not in the Google Cloud Pub/Sub environment, you might have trouble or be most likely unable to use it since I think it's a product of Google Cloud.
