A closer look at monitoring and alerting using Prometheus
Overall Satisfaction with Prometheus
We primarily use Prometheus for metrics and alerting.
We use Prometheus to monitor http endpoints of our services and provide real time metrics such as how many 2xx responses in last 5 minutes out of total responses for a particular endpoint and how many 5xx responses for the same period.
This helps us keep a watch on error rates and we also have alert rules defined in Prometheus such as if 5xx responses increase beyond 80% in last 5 minutes then fire an alert.
It is solving the problem of getting to know when something wrong happens on production so that we can respond to it and fix it immediately.
It also provides us with useful metrics regarding the performance of our endpoints.
This ensures higher uptime of the services and fewer issues reported by the clients.
We use Prometheus to monitor http endpoints of our services and provide real time metrics such as how many 2xx responses in last 5 minutes out of total responses for a particular endpoint and how many 5xx responses for the same period.
This helps us keep a watch on error rates and we also have alert rules defined in Prometheus such as if 5xx responses increase beyond 80% in last 5 minutes then fire an alert.
It is solving the problem of getting to know when something wrong happens on production so that we can respond to it and fix it immediately.
It also provides us with useful metrics regarding the performance of our endpoints.
This ensures higher uptime of the services and fewer issues reported by the clients.
Pros
- Providing real time metrics of http endpoints
- Setting up of rules for firing alerts when things go wrong
- Firing alerts when certain threshold is reached for 5xx responses on a particular endpoint
Cons
- Currently, the user interface of Prometheus is not very intuitive and has room for improvement
- Prometheus can also provide more details about the errors that are causing 5xx responses. Currently it just reports on the metrics that this particular endpoint has these many 5xx responses in past these many minutes.
- It tells us that something is wrong in the system and we need to find what is wrong and fix it. But Prometheus does not provide more context on what exactly is wrong.
- Creating rules in Prometheus and then validating if they are correct and working as expected could be a daunting task. It could be made easier if more testing support is provided by Prometheus.
- Prometheus helps us in monitoring our error volumes and alerting when it goes beyond a threshold so that we can early detect any issues and fix them before users report it.
- It helps us in reducing our time to detect and adhere to our uptime commitment of 99.9 %
- Prometheus helps us setting up different rules for different endpoints so that we can make sure critical functionality is not affected and we're able to come to action if any anomaly is detected in let's say order submission endpoint.
- Prometheus would fire an alert in such a scenario.
- We've been able to achieve Mean Time To Detect(MTTD) of 15 minutes because of Prometheus based alerts.
We considered TICK stack as an alternative to our Prometheus/Grafana setup that we have for capturing, storing and visualizing the time series data.
But it seemed more complicated to learn and required a separate DB called InfluxDB to be setup.
So, after all these considerations, we thought of going with the Prometheus/Grafana setup.
But it seemed more complicated to learn and required a separate DB called InfluxDB to be setup.
So, after all these considerations, we thought of going with the Prometheus/Grafana setup.
Do you think Prometheus delivers good value for the price?
Yes
Are you happy with Prometheus's feature set?
Yes
Did Prometheus live up to sales and marketing promises?
Yes
Did implementation of Prometheus go as expected?
Yes
Would you buy Prometheus again?
Yes

Comments
Please log in to join the conversation