Item: Prometheus
Rating: 7
Author: Animesh Kumar

Overall Satisfaction with Prometheus

Use Cases and Deployment Scope

We primarily use Prometheus for metrics and alerting.
We use Prometheus to monitor http endpoints of our services and provide real time metrics such as how many 2xx responses in last 5 minutes out of total responses for a particular endpoint and how many 5xx responses for the same period.
This helps us keep a watch on error rates and we also have alert rules defined in Prometheus such as if 5xx responses increase beyond 80% in last 5 minutes then fire an alert.
It is solving the problem of getting to know when something wrong happens on production so that we can respond to it and fix it immediately.
It also provides us with useful metrics regarding the performance of our endpoints.
This ensures higher uptime of the services and fewer issues reported by the clients.

Pros and Cons

Pros

Providing real time metrics of http endpoints
Setting up of rules for firing alerts when things go wrong
Firing alerts when certain threshold is reached for 5xx responses on a particular endpoint

Cons

Currently, the user interface of Prometheus is not very intuitive and has room for improvement
Prometheus can also provide more details about the errors that are causing 5xx responses. Currently it just reports on the metrics that this particular endpoint has these many 5xx responses in past these many minutes.
It tells us that something is wrong in the system and we need to find what is wrong and fix it. But Prometheus does not provide more context on what exactly is wrong.
Creating rules in Prometheus and then validating if they are correct and working as expected could be a daunting task. It could be made easier if more testing support is provided by Prometheus.

Return on Investment

Prometheus helps us in monitoring our error volumes and alerting when it goes beyond a threshold so that we can early detect any issues and fix them before users report it.
It helps us in reducing our time to detect and adhere to our uptime commitment of 99.9 %
Prometheus helps us setting up different rules for different endpoints so that we can make sure critical functionality is not affected and we're able to come to action if any anomaly is detected in let's say order submission endpoint.
Prometheus would fire an alert in such a scenario.
We've been able to achieve Mean Time To Detect(MTTD) of 15 minutes because of Prometheus based alerts.

Usability

It is usable and one can learn if few people in the team are already using it.
It can be difficult to understand at the beginning because of non intuitive UI and syntax of the rules. So, I've gone for 7 points as there is some room for improvement in user interface and rules syntax.

Alternatives Considered

We considered TICK stack as an alternative to our Prometheus/Grafana setup that we have for capturing, storing and visualizing the time series data.
But it seemed more complicated to learn and required a separate DB called InfluxDB to be setup.
So, after all these considerations, we thought of going with the Prometheus/Grafana setup.

Key Insights

Do you think Prometheus delivers good value for the price?

Yes

Are you happy with Prometheus's feature set?

Yes

Did Prometheus live up to sales and marketing promises?

Yes

Did implementation of Prometheus go as expected?

Yes

Would you buy Prometheus again?

Yes

Other Software Used

Grafana

Likelihood to Recommend

Prometheus is well suited for use cases when we need real time data points while monitoring events like http response. It is good for setting up alerts on such data streams.
It is not appropriate to use it where a full fledged error monitoring is needed with proper context and stack trace.
Also, it can not distinguish issues coming on same http endpoint and can not track further progress on the issue. For such cases we need an error monitoring tool like Raygun or Sentry.

Comments

Please log in to join the conversation

A closer look at monitoring and alerting using Prometheus