Item: Zabbix
Rating: 6
Author: Thomas Higgins

Overall Satisfaction with Zabbix

Use Cases and Deployment Scope

Zabbix is primarily being used to monitor servers and services running on them, though it is starting to be used also to monitor network components as well. Secondarily, it is being used as a Synthetic User Monitor for web applications.

Pros and Cons

Pros

Collecting hardware data - CPU, Memory, Network, and Disk Metrics are collected and reported on.
Flexible design - It is very easy to build out even very large environments via the templating system. You can also start where you are - network monitoring, server monitoring, etc. and then build it out from there as time and resources permit.
Provides a "plugin architecture" (via XML templates) to allow end users to extend it to monitor all kinds of equipment, software, or other metrics that are not already added into the software already.
Very complete documentation. Almost every aspect of Zabbix has been documented and reported on.
Cost - Zabbix is FOSS software and always free. Support is reasonably priced and readily available.

Cons

Zabbix is very complex and the documentation, while complete, is not particularly well organized. In particular, I would like to see step by step instructions (similar to the synthetic user monitoring example) for installation and setup; more about what some of the numbers mean; etc.
Zabbix system requirements are artificially high to cover every possibility, yet rarely are those resources used. Would like to see segmented resource requirements based on the size of monitoring to more efficiently size an environment.
Zabbix has some nasty "gotya's" that are not really addressed in the documentation. For example, when first setting up an environment, there is nothing discussing the order of setup (host group, then users, then host, for example); but doing it in the wrong order will make it much more difficult to use later on. A tutorial (or series of tutorials) setting up the first several devices would go a long way here.
Not so much a con as an UGLY that is common to most of this class of software - Zabbix requires a great deal of detailed understanding across several different IT disciplines. DBA knowledge for maintaining the database, System Administration for setting up and maintaining the server(s) and its software, Networking for setting up monitoring of the network, each software package you will have synthetic monitors of, etc. In most larger organizations, that means a lot of collaboration, but in smaller organizations, where it may only be a single person or team doing all the work, it means someone must be deeply knowledgeable about each aspect being monitored. It is no longer enough to just know the OS it is running on and leaving it to the user to know the software, or the network team to deal with the network issues.

Return on Investment

Zabbix allowed us to see where issues were with a new implementation of software that was having issues at one site but not the other. With the synthetic monitoring piece in play, we were able to isolate and quantify the issue and see who and what was actually having an issue (as compared to the typical user response of "slow").
It has taken over 9 man-months to fully implement across a 1600 server global environment. Some of that issue was due to the poor design of the environment (mostly due to M&A processes that were never fully integrated), but part of it was due to no easy way to distribute the agents. Now, with the very recent release of 4.2, there is an MSI to allow for GPO deployments to windows machines, which would help tremendously for Windows-based environments. (Linux and Mac environments will still require extensive scripting or manual installations).
Zabbix alerting allowed us to start alerting L2 application and server teams to be aware of disk space issues and resolve them before an outage occurs.

Alternatives Considered

SolarWinds Network Bandwidth Analyzer, SolarWinds Database Performance Analyzer, SolarWinds Log & Event Manager, SolarWinds N-central, SolarWinds Netflow Traffic Analyzer, SolarWinds Network Device Monitor, SolarWinds Network Performance Monitor, SolarWinds Remote Monitoring & Management, SolarWinds Server & Application Monitor, SolarWinds Virtualization Manager, SolarWinds VoIP and Network Quality Manager, Solarwinds Storage Resource Monitor, Solarwinds Web Performance Monitor, Nagios, Zenoss Cloud, New Relic, Datadog, WhatsUp Gold, Dynatrace, Dynatrace Synthetic Monitoring and PRTG Network Monitor

Most of the SolarWinds are separated out, whereas Zabbix includes templates and capabilities for all of them out of the box. Other solutions listed include most or all of them to varying degrees as well.

New Relic is more for Application Monitoring, but the New Relic Infrastructure is a direct competitor. Datadog and Zenoss Cloud are similar. In all cases, infrastructure monitoring is both stronger and cheaper using Zabbix. It is also available on-prem, whereas these other options are not. However, the application side is better for New Relic and Datadog. Have not used Zenoss Cloud to determine it's strength in Application monitoring.

For SolarWinds (all components), WhatsUpGold, and PRTG Network Monitor, Zabbix is an equal competitor. It is cheaper up front, with no recurring subscription or renewal costs (though there are support costs if you choose to purchase it). It can be more difficult, especially compared to WhatsUpGold, which is the easiest to use of all of them (and least flexible). Overall, it is a question of where your money is spent moreso than how much with any of these options.

Nagios is FOSS software as well, but much more difficult to get to a usable state. Once in place, it is probably equal to Zabbix for maintaining.

Finally, Dynatrace is by far the best solution, but it is at a significant price over and above the other options available. It was just too expensive for us to consider our needs.

Other Software Used

Remote Desktop Manager, Microsoft Visual Studio Code, Microsoft Office 365, CentOS

Likelihood to Recommend

Zabbix is probably the best classical monitoring software out there that is also FOSS. It is superior to Nagios and other similar software from implementation to utilization, and equal in capabilities. It is equally capable to SolarWinds (and competitors), and more expandable (thanks to the support of user-generated XML templates), but at the cost of time, knowledge, and effort. It serves a different market than pure cloud monitoring solutions, though they do overlap heavily, so it probably is not as well suited to cloud-only monitoring (though it can be set up to work effectively in this role as well). However, given the flexibility of on-prem monitoring as well, it can be an option in conjunction with, or in place of the cloud-only monitoring if that is a need.

Overall, I would put Zabbix on par with SolarWinds and the main differentiator is where are the costs going to be paid - in end-user training and support of Zabbix or in the commercial, ease of use provided by SolarWinds (and competitors).

Using Zabbix

Users and Roles

30 - We currently have 30 logins in the application but it is difficult to quantify the number who use it. It is a growing product and it has been expanding in use for the last 9 months. Currently, Infrastructure teams use it for server monitoring, and some application teams are starting to use it for application monitoring and synthetic user monitoring. We are trying to tie in Network teams as well to be able to start correlating server and network outages to get a better picture of the root cause. Ultimately, besides using it for these purposes, management is looking to use it to provide reporting on a variety of things, from server utilization to cost of an outage at various levels, to whatever.

Support Headcount Required

1 - I am currently the only in-house person supporting Zabbix and that is only part-time. That is part of why it is taking 9 months to roll out basic monitoring services. In reality, it would help to have 2 or 3 people who know the Zabbix application, at least 1 DBA (part time is fine), and someone who knows each application and/or networking members. This would spread out the implementation, especially with regard to agents and SNMP trap collection, while at the same time allowing people who know the applications best to be able to setup monitoring in a sane and useful way for them.

That said, once implemented, it only takes 1 person who is knowledgeable in Linux to maintain the application, and a DBA to maintain the database(s) (if it isn't the same person). This is independent of scale, in no small part due to the template system it uses.

Business Processes Supported

Network monitoring.
Server monitoring.
Application monitoring.
Synthetic user monitoring.

Innovative Uses

Don't know if it is unexpected or innovative, but we use it primarily to cut down on known recurring issues before they cause outages.
TBD - have not had it long enough to identify new ways to use yet.

Future Planned Uses

We want to use it to map our IT network.
We want to use it to correlate issues to speed problem identification and time to resolution.

Likelihood to Renew

It is free. It didn't cost anything to implement (other than my time and the cost incurred for it) and it is filling a badly needed gap in our IT infrastructure. Support is available if we have issues and can be done annually or paid for on a per incident basis as needed. Expansion, updates, and all other future lifecycle activities are likewise free of cost, so as long as someone is able to implement/maintain the software (and the OSS project is maintained) then I imagine the company will never leave it.

Evaluating Zabbix and Competitors

Products Replaced

Yes - SolarWinds - due to cost of SolarWinds products

Key Differentiators

Price

There was no other reason other than price.

Evaluation Lessons Learned

I was not part of the evaluation or selection process, however, I do know that it was chosen due to price. They no longer wanted to pay for SolarWinds products. Why Zabbix was chosen over Nagios (another free option) is simply because it was easier to implement a trial with the prebuilt VM offered.

Zabbix Implementation

Implementation Rating

We are a mainly Windows environment, so it would be useful if we could have used Active Directory to deploy agents. As of version 4.2, Zabbix has announced a new agent MSI file to allow exactly that. Unfortunately, we didn't have that option.

Also, for Linux and MAC deployments, there is no simple way to deploy that. Using remote scripts you may be able to create something, but most places will opt for either SNMP (agentless) or manual installation of agents to add to Zabbix. A way of deploying agents via discovery would go a long way to helping in the adoption of the tool.

Implementation Details / Implementation Partner

Implemented in-house

Implementation Phases

Yes - First rolled out the server/database. Connected a few servers to ensure it was working. Then deleted the servers from the application and added proxy servers. Pointed the servers to the proxy and verified they started collecting data. Once setup was verified, started a staged rollout of agents to all servers for monitoring. Added Proxy servers as needed to spread out the load and to provide resiliency against network outages between sites. After server agent rollout started implementing website monitoring. Now we are starting to setup SNMP monitoring of Network devices.

Change Management Lessons

Change management was minimal - Because we owned the servers the agents were being deployed to, we just needed to identify which servers were being added, and add them. No change management hassles were encountered. Though we did have issues due to our current design where many servers needed to be touched manually rather than a network deployment. This had nothing to do with change management, however, and everything to do with several M&A accquisitions that were not fully integrated with our network yet.

Implementation Issues

Scaling the environment. If you don't know what you are monitoring, then it is likely you will not set up the templates correctly to scale the system efficiently. For us, that meant I have had to go back and restart many agents just to get them to pick up new or changed templates.
Defining how you will monitor - How will you break down your host groups? By department? By application? By location? All of the above? Each of these works better if set up in advance, but for each implemented, you have exponential growth on the number and combination of templates to set up (at least if you are going to setup auto-registration, which is HIGHLY recommended for scaled deployments).
Avoid automatic discovery or use with caution. Unless it is a very small or well-segregated environment, automatic discovery tends to discover more...fluff, than useful information. Further, after discovery, there is a lot of cleanups that typically must be done (naming each device, adding appropriate templates, etc.).

Comments

Please log in to join the conversation

Best FOSS software for monitoring