Powerful troubleshooting if you have the $$$
June 30, 2019
Powerful troubleshooting if you have the $$$
Score 8 out of 10
Overall Satisfaction with ThousandEyes
ThousandEyes gets utilized by our NOC and network department to proactively monitor for outages to critical resources so that we can start addressing and diagnosing before tickets start pouring in.
- Alerting on outages. ThousandEyes provides a few different options to receive alerts: you can have alerts emailed to a subset of (or all) users, there is a basic Slack integration, and if more flexibility is required (or your preferred method of being alerted isn't built-in) webhooks can be used to hit another API.
- Speeding up mean time to resolution (or mean time to innocence if you're a more siloed and blame-happy organization). Failure alerts can be configured to include the cause of the failure instead of just "resource x is down." For example, the alerts can come out and say that a website was down due to an HTTP 500, which will help prevent staff from spinning their wheels trying to diagnose the network from the client to the web server.
- Post mortems and root cause analyses. After an outage has been resolved, it is possible to go back for up to 30 days without losing any level of detail for the test in question, and to view information like the DNS response received, the network path taken by the traffic, and any added latency incurred by an individual link. It can also be used to view Internet routing changes surrounding the incident.
- Support. Every ticket or chat I have opened has been met by a friendly and helpful staff member that has been able to provide helpful insight into what is causing a particular issue, and what steps they will take on their side to resolve an issue or provide suggestions of steps to take on our side if necessary.
- The elephant in the room is going to be cost. ThousandEyes is a great tool, but you will pay for it. There are other services that do a good job at providing a smaller subset of features compared to ThousandEyes. If all you need is that particular subset of features, ThousandEyes may not make fiscal sense for your organization.
- As a subset of the cost issue, within the last 18 months or so the pricing on enterprise (local) agents has been modified in a way that seems not to benefit the customer. Previously enterprise agents had a flat monthly cost associated with them with unlimited test usage (the only limit on test usage was based on concurrent tests running at any given point in time). This meant that instead of using a cloud agent and paying per-test, you had the option of spinning up an cheap Digital Ocean droplet and creating your own cloud agent for external testing without using Cloud Agents. When the change was made they eliminated the flat per-agent cost and instead treated the pricing the same as that of the cloud agents but cutting the number of "cloud units" per test in half for tests run from enterprise agents. For organizations with under-utilized enterprise agents, this may be helpful financially, but for organizations that push their local agents to the limit, the cost skyrocketed.
- BGP monitor peering sessions have been less than reliable. The data doesn't seem to be an issue, but the sessions seem to bounce or fail altogether on a fairly consistent basis. The routers or servers with which your routers peer sit behind some firewalls that have caused issues in the past.
- ThousandEyes has helped us quickly isolate issues on some high-profile (within the organization) incidents and whether the network (internally or on the Internet) is at fault. If it is, it becomes easier to see the "where" of the issues quicker so we can move onto what the issue is faster. In the case of non-network related issues, it helps us get the appropriate teams or individuals involved sooner.
We looked into RIPE Atlas in conjunction with BGPmon. Atlas credits can be generated by nodes hosted on your network for other Atlas users to use to run tests, so testing is essentially free if your local nodes are used a lot. The problem with Atlas is that visualizing that data is entirely on you to implement. BGPmon is a great tool as well, but ThousandEyes allowed us to kill two birds with one stone with less cost in man hours to implement and maintain.
ThousandEyes works well for BGP monitoring (e.g. watching for hijacks, monitoring external reachability, etc.). If you are already using a tool to perform these functions like BGPmon, going with ThousandEyes can absorb a lot of those features and save on OpEx compared to running two different tools. ThousandEyes also does a great job with reports. A great example of this is tracking availability/uptime. You can build reports including information like "what percentage of time was I not down" and put together reports about how many nines you had last month and pat yourself on the back. If you didn't have a great month, it's possible to isolate incidents that may have led to that reduced uptime, and help yourself find leverage for upgrades or enhancement. As with any tool, ThousandEyes is not going to be a good fit for your organization if only one or two individuals are ever going to use it. At its core, it really shines as a collaborative troubleshooting tool; if you're the only one troubleshooting through ThousandEyes and other people or departments are using other tools, its usefulness starts to fade a bit.