ScienceLogic - Enterprise Considerations
John "Frotz" Fa'atuai profile photo
Updated October 25, 2019

ScienceLogic - Enterprise Considerations

Score 10 out of 10
Vetted Review
Verified User
Review Source

Overall Satisfaction with ScienceLogic SL1

ScienceLogic brings significant abilities to the table (and on their roadmap) which enable critical business operations (such as maintaining cisco.com health and availability) to track against our ServiceNow (CMDB) data and our ScienceLogic (IT Infrastructure Monitoring) deployment. We have replaced our legacy / homegrown system with ScienceLogic to enable our on-prem, private cloud and multi-cloud deployments (sometimes using all three) the ability to ensure health and availability.

We are looking to move to ScienceLogic's AIOps and data lakes to further simplify the transformation of our event-storms to meaningful business information.
  • Deep monitoring across infrastructure components and with 8.12 across application layers
  • Device discovery which builds infrastructure components
  • ServiceNow Integration for CMDB driven device discovery as well as event-to-incident integration and automation
  • Runbook Automation controlling notification as well as leading to remediation capabilities
  • Event richness enabling detection of leading events which occur prior to a failure
  • 12 introduces #multitenancy support which enables Enterprise with many Organizations to share (read-only) properties and performance data across Organizations without compromising credentials with inappropriate access. This is the first release of this feature and we are only now evaluating the well-communicated delivery against our requirements. This has been communicated as only being applicable via the new UI.
  • Monitor thyself. We have found data gappiness issues which stem from incomprehensible / non-actionable system messages. The system is unable to communicate SIGTERM when used as a timeout (implying capacity issues) and SIGTERM when actually used as a fault (implying something non-actionable). The advisory services team is going to help, but this needs to be productized and shipped rather than be made available by customer success manager engagement. Things that SHOULD have results but do not should throw a under-collected event by the type of collection which is under reported. This should have a dynamic part of the event giving specific numbers not bland / generic statements that have to be interpreted. The platform team should immediately recognize the fault because the numbers are relevant. These events should be actionable, either referencing KB articles or some other specific remediation plan.
  • Data collector load v. capacity planning in both a vertical (cpu, memory, disk) and horizontal (more collectors). The data collector specs are very stale. 4x24G is recommended for 1000 devices but customers frequently view that as individual devices, not the DCM trees found during discovery. Those tend to be my expect N (<= 1000) devices + M (which are barely understood records and which are typically treated as zero, when in fact these devices are what blow through the assumed 4x24G capacity spec). Need a horizontal-infra-scaling event as well as a vertical-capacity-limit event to be thrown when more collectors are needed.
  • Actionable events. My end users barely understand the events. Referencing a KB article by URL might help users and admins in remediation. If you already understand the events they are obvious. If you don't, such as timeouts, having an article which helps people identify standard remediation steps will help close outages faster. Most events are contextual. Pointing users at that context will help.
  • For us, the ROI has been positive, but the specific ROI isn't the money spent on ScienceLogic itself, it is the money invested in the skills of the engineers who now leverage the platform to maintain our high standards of availability and performance monitoring while trending over years (5-10+). Those resource costs and investments far outweigh the cost of ScienceLogic and not only retention but valuation of those skills more the covers the specific ROI on ScienceLogic itself.
  • Coming from a history of two decades of homegrown build (e.g. free), the cost of ScienceLogic feels like a questionable value, but when measured across the resources educated to use the ScienceLogic platform, the ability to quickly ramp up those same resources on technologies they may have scant knowledge of and be successful is incalculable.
Yes - EMAN, a homegrown, 20 year availability monitoring, performance monitoring, CMDB, Change Management, Alerting, and many other unrelated enterprise functions.

The scale and complexity of EMAN includes access control, self-service mailing list management, cell phone ordering, home internet service registration and funding, pager service management, paging + email notification capability, DNS, DHCP, Telephony Number Management and probably another dozen things that I can't remember.

That scale and complexity could not be leveraged into the new world order of rapidly instantiated but short-lived containers and monitoring across a host of new world technologies and APIs with an ossifying skillset and minimal support / development team.

ScienceLogic allowed the vast majority of those resources supporting EMAN to move to migrate their skillset to ScienceLogic and expand our monitoring coverage with a minimum drop in functionality during the transition.

The relationship with ScienceLogic gives Cisco IT a higher leverage point for our development skills than we had with the EMAN platform.
As Cisco IT, we forced a common view to be rendered in ScienceLogic so that the vast majority of our 2200 users could see where they were supposed to go based on droplets of Cisco IT culture which we sprinkled across the ScienceLogic legacy user interface.

80% of our customers will use ScienceLogic without really using ScienceLogic.
20% of our customers will deeply use ScienceLogic PowerPacks, etc.

Everything has changed, but our core structure is visible throughout. If anything, the new structure is now more inline with our recently updated CMDB (ServiceNow) structure as opposed to our legacy CMDB (EMAN) structure. (These have been multi-year change cycles.)

ScienceLogic IT Services are how Cisco IT currently renders Host Clusters, Application Clusters and Application endpoints (regardless of lifecycle).

Cisco IT will be working with ScienceLogic on the 8.12 view of Application Management (which as configured won't match the Cisco IT model or requirements). We expect sometime in the 8.13 or 8.15 release cycle that we may have something usable which we will move our Application Monitoring and Service health monitoring onto Business Services and Application Health. For now, we will continue to use IT Services though Host Clusters will slide down into Device Services. It is unclear if we will move Application Clusters will move up to Business Services. We need more discussions to determine how Cisco IT will interact with Business Services since Cisco IT Management wants Application Monitoring, but Cisco IT can't consume it because our view of an application isn't supported by ScienceLogic's definition that it is rooted by a process in memory on a given box. (We run many applications through the same nginx / apache process... Exactly which application is up and which is down?)
ScienceLogic has supported our ability to integrate with AppDynamics before they really have support for it.
ScienceLogic is helping ensure that their ACI support matches as the Cisco engineering groups continue to rev it.
ScienceLogic is helping us support and monitor our multi-cloud strategy even as we struggle to comply with InfoSec, CASPR and other compliance processes governing what can / should be allowed to be hosted external to the Cisco network.
ScienceLogic's Cisco Partner status allows them to match Cisco's Customer-Zero program where we try to influence product operability and supportability from a Cisco IT perspective faster than Cisco IT could do it ourselves.
Cisco IT has only a few of our domain teams which have started to look at this.
The vast majority of integration and automation is limited by what SyncSever and our own Cisco IT integrations with ServiceNow for Host Composites, AppMon and AppMon Composites.

The teams doing this work are currently limited to event-to-incident automation, though Cisco IT is looking forward to our migration to Integration Server.
  • This question is ambiguous. As stated, it suggests that Cisco IT might have found an unexpected or innovative way which ScienceLogic (the platform) is used which surprises Cisco IT. The reality is that Cisco IT is driving how ScienceLogic is used and we are surprising ScienceLogic (the company) with our scale and Enterprise Use Case specifications and articulations. The remainder of the questions will answer from a company perspective, not a product perspective.
  • #multitenancy - Enterprise Use Case wants to share many data elements in a read-only way to the IT world. As originally deployed, 8.3 > 8.11 suffers from Managed Service Provider Use Case where Organizations are expected to be 100% independent from each other with no read rights and no world implications. 8.12 delivers our first glimpse of how Cisco IT will allow a more seamless Enterprise Use Case experience by allowing configuration of a WORLD rights permission that grants read (but not write; as is the case in 8.9 and prior) permissions to device configuration and performance data.
  • We have 467 Organizations with some 2200 users. Many in multiple organizations.
  • IT Services have been how we folded our host clusters onto Science Logic stacks. We've been involved in early discussions about Device Services and Business Services and as those become tangible in the platform we will provide our feedback based on issues we find due to our scale.
  • Given 467 Organizations and some 2200 Users, we found that pre-generating Critical Ping Failure and Availability runbook automations and actions and external contacts so that all of our teams had a starting point in the platform was a surprise to use as a mechanism whereby we provide base training to our end-users in the RBA space. Our statement is these are yours. Play with them. Learn from them. Delete them if you need clean copies and we will regenerate those for you.
  • Our use of ServiceNow as our upstream CMDB with monitoring configuration paint being applied downstream has been a great boon to us. This allows our operations teams to force a repaint to clean up problem configurations without concern to multi-master issues. We consistently find that SyncServer and Integration Server use cases are inverted for us in that ScienceLogic pushes to ServiceNow and a hand-wave on whether or not ServiceNow does its own discovery (thereby creating a multi-master problem).
We held a bake-off between Zenoss 5.0 and ScienceLogic 8.3 and while our UCV team found the support for their devices to be better in Zenoss 5.0, the Cisco IT platform team found that ScienceLogic 8.3 was more sustainable and easier to deploy / maintain / grown.

We were unhappy with the failure rate of Zenoss during the non-prod test deployment (full loss of the VM and rebuild required multiple times).

No such problems occurred with ScienceLogic.

The Cisco IT team found that ScienceLogic IT Services and powerpacks would get us farther down the road than the Zenoss 5.0 release would as we heard that the 5.0 release was a fundamental change for Zenoss at the time.
Regardless of the product or solution, all monitoring systems encourage coding. ScienceLogic enables those coding skills to be leveraged across an ever growing list of technologies if one codes inside the Science Logic platform. Coding monitoring solutions outside of the ScienceLogic platform will tend to collect more technical debt and decay faster than it otherwise would if the code is embedded inside the ScienceLogic platform. For large, enterprise use cases, ScienceLogic (alone) is not as appropriate (6-digit volumes). Here, their integration with ServiceNow turns that gap into a benefit.

Using ScienceLogic SL1

All are IT users, mostly from the operational response teams.
Networking.
Hosting
Storage
Unified Communications and Video
Engineering IT (the IT group which supports Cisco engineering business units)
Application teams (such as those supporting cisco.com), but ranging to anything Cisco IT calls an Application.
15 - Python developers (or those developers migrating from Perl)
Able to create RPMs
Able to create Ansible scripts
Able to create web interfaces
Able to interact with ServiceNow APIs at speed.
Able to create CLI infrastructure.
Able to create ELK stacks and facilitate deep product operational state analysis

Scrum Master
Product Owner
Functional (People) Manager
  • Infrastructure Monitoring (20% of our user base is in this category)
  • Application Monitoring via synthetic test transaction (80% of our user base is in this category)
  • Ability to feed all priority (P1 and P2) events through an Elasticsearch engine and render only P1 / P2 events (outages) on a single pane of glass given that we have 8 independent stacks due to our scale.
  • Minimal disruption to the vast majority of our users as we effect a sea-change in our monitoring capability (dump our old, homegrown system) and replace it with 8 independent ScienceLogic stacks around the world.
  • Educate our users to accept event-to-incident for non-priority (P3-P6) use cases
  • Drive AIOps and data lake construction so that we can be agnostic of the 8 independent stacks and treat the data as wholistic and global rather than regional and constrained.
  • Ensure that ScienceLogic product capabilities fit more and more Cisco IT use cases over time. Use our relationship as a large Enterprise deployment to educate ScienceLogic on the Enterprise Use Case and the challenges that come from a large deployment so that the product improves overtime. Fulfill our internal goal to support Cisco Partners (which ScienceLogic is) and grow their potential, even though they are also our vendor providing this specific capability.
  • Maximize the viability of our on staff engineering resources to support / grow the capability consumption for internal Cisco IT clients in spaces that we currently can't get to by ourselves, such as multi-cloud monitoring configurations.
We migrated away from our 20-year-old homegrown solution and have no back-tracking capability.
ScienceLogic is a Cisco Partner and we are supported in ensuring their success.
ScienceLogic is demonstrating new capabilities that we would not have been able to do on our own using our legacy system.
We understand the capabilities of competitors based on our bake-off selection where ScienceLogic won on capabilities and future near-term potential (expandability, platform growth). We know that those competitors are not really close to where we have been able to push ScienceLogic (as a partner).

Evaluating ScienceLogic SL1 and Competitors

  • Product Features
  • Product Usability
Our starting position was the retirement of our homegrown / legacy monitoring platform which had survived some 20 years.
Our key requirement was not whether ScienceLogic could meet all of our custom demands (we understood that moving from custom-fit to commodity-fit was going to be painful), but rather could ScienceLogic give us the ability to meet all of the new monitoring capabilities our internal customers were demanding as we re-oriented towards Cloud offerings and still maintain at least some level of translation from our existing expectation set without too much drop in functionality.
It won't change.

Our requirements will be:
[1] Do we have the talent in team to overcome any produce deficiencies?
[2] Which product offering provides the least amount of product deficiencies (from our expectation-set) which we then have to back-fill?
[3] Which product offering provides the most leverage to perform said back-fill?

ScienceLogic SL1 Implementation

Know your use case.
Know which inherent use case ScienceLogic approaches you with. (They still have a long history of Managed Service Provider that influences their thinking and they are learning to be self-aware and thoughtful about you rather than your deployment being just a templated deploy that they've done many times before.)

They are making progress, but your clear and persistent insistence on deployment according to your use case will help minimize problems.
Yes - NPRD and PROD.
Subphases included per-domain-team capabilities.
This included migration from the legacy system (which included spinning down the legacy monitoring system).
Networking wanted large router verification and modeling.
Hosting wanted large vCenter verification and modeling.
Unified Communications and Video wanted CMDB assistance (which wasn't a ScienceLogic concern, but was a migration concern).
Change management was a big part of the implementation and was well-handled - There are ways to force change and there are ways to force change.
Engaging the key stake holders (Compute/Hosting, Networking, Storage, Unified Communications and Video, Engineering IT) were critical for the success.
We gave those infrastructure teams (20% of our end-users) nearly a year to understand how things might work, worked them across all of the challenges each group had with the migration.
For all other users (80%) we forced the change in less than 90 days (since their use of the platform is paper-thin).
  • Undersizing the central database file system. 1.2TB was consumed surprisingly quickly. Default data retentions surprised us significantly and we had to scramble to ratchet down the data retentions until we could rebuild to 4TB internationally and 26TB (domestically). The engineering recommendations was 3-5TB "for someone our size". We no longer believe them at our scale and we feel much more comfortable with 10TB+ than we do with 4TB. (We're not concerned with one of our 4TB stacks, but we're keeping the data retention limited on our other 4TB stack because it is heavily used; a surprise based on our user consumption patterns.)
  • Disparity between the way we used to count and the way that ScienceLogic counts. They are not the same. You won't know until you start discovering devices how off your estimates are because the DCM trees balloon quickly and surprisingly.
  • This has implications on data collector sizing. Are you using large vCenters? Large routers? Large storage arrays?
  • Your resiliency strategy should inform how you should view your RAID configurations. We started at RAID-6 and rebuilt to RAID-10, then dropped to RAID-0 because we use a DR configuration. The starting point of RAID-6 was found to not be healthy for a heavy-write MySQL database. A RAID-10 did moderately well. The RAID-0 (with a DR config) actually performs very nicely, with all that this configuration means.

ScienceLogic SL1 Training

  • In-person training
  • Self-taught
On the Cisco IT side (students), we had a number of teams who were provided the deep developer training.
Of those students, the customized training provided a complete, 5 day training which enabled the Cisco IT platform team to successfully deploy and mitigate user-experience issues for the vast majority of our end-users, including some of the teams who attended the developer training.

The knowledge kept pace with the class and sped up / slowed down (within the time constraints) as needed throughout the course.

This was developer to developer training and for those students who were developers the training worked well. For those who were just coders it probably worked less well as some of the topics still do not apply (a function of our course outline specification based on our knowing nothing).

Due to problems in sequencing we did the developer course BEFORE the admin course and realized that our requested ORDER was wrong.

The onsite admin course was much better received and led to deeper understanding of the developer course held a few weeks prior.
I specialize in reverse-engineering systems.
I would not recommend that developers try to reverse-engineer the system.

I have yet to take the online courses, but will so that I can honestly evaluate their recommendation to new users in my end-user set for their ramp-up on the platform usage.

Configuring ScienceLogic SL1

SL1 mostly scales our our needs. We had to break our monitoring into 8 regional stacks, but within that space, the configuration seems to work.

We introduced Custom Device Attributes to contain the standardized meta data from our ServiceNow CMDB. We have 22 attributes in play, which includes CI priority, ServiceNow SYS_ID (we have a few distinct types, so multiple attributes) and last sync-type (very crucial to know that the monitoring configuration data is current with our CMDB data).

Being disciplined about what elements of configuration we encourage our end-users to consume is a key to managing the complex offering from ScienceLogic.
  1. Have your CMDB be upstream from your monitoring system(s). Allows you the benefit of repainting the configuration when you feel it is bad, including deletion and recreation (at the cost of lost of historical data, which you should be pulling out and centralizing somewhere else if you're on multiple stacks).
  2. Use CI Priority to allow you to distinguish between Critical (Severity) events and know which one to focus on first.
  3. For on-premise deployments, deploy half of your capacity to two(2) regionally close data centers, if you have them. Plan to be the last-to-fail and the first-to-recover.
  4. Use ActiveDirectory groups to authorize access. Use your access control systems to populate those AD groups.
  5. Limit the amount of distinct rights (Access Hooks) which you deploy. (We use 3; Operators (default rights); Leads (slighly elevated from Operators); and Admins.)
  6. If you have multiple teams with differing responsibilities, pregenerate their default runbook automations and runbook actions so that they can just "turn their stuff on" (because it already exists).
  7. Look at Business Services, IT Services and Device Services as you model your "behind-the-load-balancer" cluster availability monitoring and capacity. Make this as simple as possible for your end-users to consume from their CMDB definitions. Do the math for your customers in the IT Service configuration.
Some - we have done small customizations to the interface - 
  1. In the old UI, we have added Custom Navigation to allow users to get from a Device > Registry > (wrench) page to the following page types:
    1. IT Services when the current CI is a member of an IT Service.
    2. The "default" Run Test (it varies by device class for us) but the canonical button allows our users to ignore those differences and have a single place in the UI to go trigger that monitor.
    3. Links to our independent test transaction verification UI (is ScienceLogic broken? Or is your CI really down?)
  2. We have not yet seen the new UI, so we don't know how these limited navigational aids will be rendered there.
  3. Several of our larger teams have created dashboards to help them understand their fleets.
Yes - we have added extensive custom code - We have done limited amounts of in-product coding (dynamic applications). We usually leverage Professional Services to build those as needed for our end-users.

We have done extensive off-box integrations and audits and range the following gamut:
  • User account injection and Secondary Organization Alignment based on external access group definitions.
  • SyncServer Discovery Session Template generation based on end-user Device Template existence.
  • CMDB to SL1 custom monitoring configuration (we lifted and shifted our entire application monitoring semantic from our legacy, homegrown system onto SL1 in about a quarter).
  • Custom code deployment on our Central Databases / Data Collectors / Messaging Servers via our own RPMs and yum repos.
  • Ansible playbooks to handle all configuration of each server type; includes command-line access and named account creation / removal as people enter / leave the platform team.
  • ServiceNow Change Management injection (not all Changes are facilitated by SyncServer).
  • Replacement of SyncServer functionality due to our scale (includes physical device Change Requests which hit 2000 CI limits in the CR and then broke).
While we prefer to use the available APIs, we find that we frequently have to touch the database directly.

Much of our outside-in integrations and coding has been to support our multiple stacks (consistency enforcement).
Try to avoid doing configurations as much as possible. (You'll likely get dragged in due to your own business rules.)
Once you've developed the skillset needed to support doing integrations (or in the simple case, dashboarding), you'll feel more comfortable doing these types of things.

Our #1 requirement is being able to rebuild any machine in as little time as possible, that means that all of our central database configurations (under the hood and inside the UI) needed configuration scripts which could be run against a newly built machine.

These scripts allowed us to leverage them in Vagrants, which we use heavily to plan out our upgrade process before we touch any non-production environment where end-users might be testing. We upgrade the majority of our non-production environments before attempting a production environment.

Using Ansible and yum / RPMs has been a boon to our ability to satisfy "last-down / first-up" when the world is burning.

Avoid making hand-tweaks at all costs. Make it a script that makes it repeatable if you have to rebuild a server.

ScienceLogic SL1 Support

We are working through the reproducibility cost-of-entry issues.
There is room for improvement, but I also clearly want to commend the support team for their responsiveness and remediations when our stacks have significant problems.
ProsCons
Quick Resolution
Knowledgeable team
Kept well informed
Support cares about my success
Quick Initial Response
Poor followup
Problems left unsolved
Escalation required
Need to explain problems multiple times
Yes - Cisco IT is big, we're Enterprise and ScienceLogic is diving into the Enterprise Use Case is a noticeable, deliberate and methodical way.
Cisco IT expects premium support and as such it is inherently part of our contract.

Beyond that, Cisco IT's relationship with ScienceLogic is both vendor-customer but also Cisco-on-Cisco-Partner, so we have many avenues of bi-directional interaction and support.
Yes - My bugs are long-range bugs, so reporting bugs to Support (all Support organizations are quick-fix / break-fix / remediation organizations; they are not long-term bug reporting organizations).

For issues where Support is the appropriate venue, ScienceLogic Support does a better than average job. They are more attentive than I can be responsive.

Our Customer Success Manager is extremely helpful in allowing us to have long-range feature influence with the product team.
Our EAST1 stack had been suffering from repeated Medium Rows Behind scenarios with a MTBF of around 10 days.
One night I needed to report the issue because it didn't look like it could wait for morning.
They understood the severity of the issue for Cisco IT.
They moved the issue around their follow-the-sun support model.
Even though I had to hand the issue off to my follow-the-sun support team offshore, they were still able to get to a real root cause and remediate the symptom within a few hours of my falling off of the call.

The complexity of the product sometimes forces the invocation of the support team with an impacting severity and the priority-response portion of the support team was awesome!

The normal priority support team supports us in very key ways, which, while not "emergency" in nature, should still be recognized as above average support.

Using ScienceLogic SL1

The core functions are there.
The complexity is due to the complexity of the space.
The score is based on comfort (I no longer notice the legacy UI) and the promise that I see in the 8.12 Unified UI (a vast improvement).
It is also based on the fact that with 8.12, you can now do everything in the new UI but you still have the legacy UI as a fallback (which should now be unnecessary for new installations).
Further, the score is influenced by ScienceLogic being a Cisco Partner and while Cisco IT is a customer, we are also executing on internal expectations to make our partners successful by constructive (and sometimes pressured) criticism to better the platform.
ProsCons
Like to use
Technical support not required
Well integrated
Feel confident using
Unnecessarily complex
Slow to learn
Cumbersome
  • Once configured, the runbook automations can be magical. Doing the configurations requires technical skills. The Cisco IT team pre-populates the obvious runbook automations and actions and all configurations so that most of our users don't have to touch them. They just have to enable them. The Cisco IT team pre-filtered event types down from nearly 5000 to just about 25 interesting event types.
  • Once designed, the authentication model works moderately well. Cisco IT has operators, leads and (platform admins). The configuration design was interesting because our first swing was to grant rights to each Organization. We found that a two dimensional rights / orgs matrix made this simpler. We almost have two rights, but the leads have a very few extra rights which the operators don't have.
  • Cisco IT takes ServiceNow data for clusters and paints IT Service configuration onto each of our stacks so that our end-users don't have to deal with the configuration of an IT Service. (Doing that integration took a little time.) We are looking for ways for ScienceLogic to consume our implementation semantics so that the behaviors make it into the product, presumably via Integration Server.
  • Cisco IT goes to great lengths to allow the vast majority of our users to ignore most of the legacy user interface by focusing on a few local / cultural works to trigger behaviors which those users need to execute (RunTest, AppMon Bypass). The 8.12 Unified UI solves these issues. We are currently evaluating when we can move our end-users to that Unified UI later this summer.
  • Designing whether ScienceLogic should participate with ServiceNow in a multi-master scenario or if ScienceLogic should be the single-master or if ServiceNow should be the single-master requires assistance. Cisco IT prefers ServiceNow as our single-master allowing us the flexibility of repainting a configuration downstream if things go sideways. This is an Enterprise Use Case concern.
  • Cisco IT generates Availability and Critical Ping runbook automations for each of our end-user teams (by Organization). The design and implementation work was left as an exercise to Cisco IT. We will be looking for ways to encourage ScienceLogic to make this Enterprise Use Case concern simpler for future Enterprise adoptions. Our implementation was complicated by our desire to NOT use ServiceNow group data. We expect this to be migrated to the new Integration Server as we continue to morph this integration over time.
  • Multiple stacks (an Enterprise Use Case concern) is a challenge for configuration, especially if you design for rebuild repeatability. We've lost disks on our secondaries on one stack and have been forced to rebuild because we failed to estimate our disk capacity correctly. We originally estimated 1.2TB, but due to our scale, we rebuilt (several times) against 4TB internationally and 26TB domestically using RAID-0 with a dual-data center DR configuration. We found that RAID-6 cost us database speed on UCS 220C bare metals. We found that RAID-10 was better, but that RAID-0 was the fastest. We accept the DR fail-over model as our resiliency strategy rather than a full backup (since we're downstream from the CMDB and almost everything is disposable and repaintable from the single-master). We didn't find out these problems until after we had started deploying to the production environment which meant full rebuilds on each of our 8 stacks. The reconfiguration after each rebuild was made simple only because we reverse-engineered the authentication configuration and all other central database configurations and pushed those into our Ansible playbooks.
Not Sure - The legacy and new user interface seem to work well from an iPad.
While I've seen reference to the mobile UI, I've not played with it as Cisco IT has deliberately designed around the need for it.

ScienceLogic SL1 Reliability

Our deployment model is vastly different from product expectations.

Our global / internal monitoring foot print is 8 production stacks in dual data centers with 50% collection capacity allocated to each data center with minimal numbers of collection groups.

General Collection is our default collection group.
Special Collection is for monitoring our ASA and other hardware that cannot be polled by a large number of IP addresses, so this collection group is usually 2 collectors).

Because most of our stacks are in different physical data centers, we cannot use the provided HA solution. We have to use the DR solution (DRBD + CNAMEs).

We routinely test power in our data centers (yearly).

Because we have to use DR, we have a hand-touch to flip nodes and change the DNS CNAME half of the times when there is an outage (by design). When the outage is planned, we do this ahead of the outage so that we don't care that the Secondary has dropped away from the Primary.

Hopefully, we'll be able to find a way to meet our constraints and improve our resiliency and reduce our hand-touch in future releases. For now, this works for us and our complexity. (I hear that the HA option is sweet. I just can't consume that.)
For us, monitoring needs to degrade gracefully. That means that, by design, you allow your monitoring system to begin giving you less and less information until an ultimate failure point. Our motto is Last-Down / First-Up and our deployment architecture reflects specific design choices to minimize being part of someone else's failure-mode.

ScienceLogic SL1 at 8.12 provides a wealth of monitoring data. This comes at a cost as you scale up.

Can you consume the wealth of monitoring data?
Can you consume the wealth of warnings and errors that being to get generated as you scale up?

There are challenges, most of which are due to our deployment design choices and scale.

We are working with ScienceLogic to introduce Priority as a first-class data element. We've added this as a Custom Attribute for all of our CIs. We use this to distinguish between the failure of a development host vs the failure of our production website.

The score represents the known opportunity for improvement that we see on our joint roadmaps.
Care should be taken when creating dashboards which present all of the collected data available to you.
We have many routers at work.
Almost all of them have 10K interfaces on each one.

Creating dashboards which present many different views of many different, large routers can create undue database load. (We're actively discouraging their use as a long-running process because each dashboard widget gets its own database connection, not a shared database connection.) We will kill all long running queries in the database if they run longer than 30 minutes.

The database schema design makes interesting trade-off decisions (optimizing for ease of write at the expense of complex / cross-device reads).

The new GraphQL schema model looks promising, but we're still waiting for that to begin to appear in the SL1 core functionality.

Given enough hardware and budget, you can go a long ways before you start pushing the boundaries of the platform architecture.

Being conservative in what you monitor for the vast majority of your CIs allows you to retain compute capacity for dashboards, runbook engines and the like.

Having large routers or vCenters will require an upscaling of your data collector sizings just to complete the Device Component Map data collection.

Relationship with ScienceLogic

Our sales process was unique as we were bolting onto an existing and large contract with Cisco Managed Services (CMS).

As a Cisco Partner, the recognition of the role that Cisco IT plays in the success of Cisco engineering business units.

They provided a very competent technical account manager on site with us for a limited amount of time. He answered design and deployment and configuration questions as we had them allowing us to assimilate the product capabilities at our own pace as we figured out the bake-off comparison with other vendors.
The post-sales engagement was periodic onsite presence by the Technical Account Manager who helped design our deployment and facilitated our early conversations with product management and product engineering teams about some of our long-term concerns and desires.

The success of our deployment, regardless of the hiccups, is a testament to this Technical Account Manager.
I was not involved in the negotiation process.
Once you sign, figure out how you want to deploy. This is far easier as an Enterprise Use Case deployment as you already have specific must-do capabilities that you have to fold into the product somehow.

Be precise. Write clear and unambiguous PROBLEM STATEMENTS. Draw diagrams. Articulate your onboarding requirements. Articulate your end-user communities. Challenge the initial spreadsheet which you'll get when you're being helped with the initial deployment planning.

Identify your MUST-DO and WOULD-BE-NICE critieras.

Know your own skillset contributions to the deployment. Expecting them to deploy for you with zero skin in the game on your side won't be pleasant or successful. You can get them to do the deployment for you, but they still come from a Managed Service Provider semantic and their default reaction is based on that use case. If you are an Enterprise Use Case, you have inverted concerns at time from their default mode and you have to make sure that what they give you matches your use cases. If your use cases are not crystal clear or written down, you will be unhappily surprised by being deployed with the wrong deployment use case.

Know when you want simplicity for your end-users.
Know when you need to apply your own elbow grease to maintain that simplicity for your end-users.

Work with the vendor. Don't just expect them to read your mind.

Upgrading ScienceLogic SL1

Yes - I stepped in to take over the platform upgrades (8 stacks of 2 baremetals in dual data centers). We have a huge deployment. Your times will be far smaller.
Previously, the two minor upgrades seemed to take a long time and seemed to have lots of problems.

At our 8.5.1.2 to 8.9.0 upgrade, we quantified for ScienceLogic the number of pain points required to upgrade and regain a measure of stability.
We suggested that use of RPM under the hood would ease their upgrade problems.
These were painful at 8-10 hour days per stack with a fair amount of cascading problems after the upgrade.

We worked with ScienceLogic to prepare for the 8.12 upgrade and had them verify our upgrade procedures.

At our 8.9.0 to 8.12.0.1 upgrade, we still had problems, but the overall stress of 8-10 hour days was replaced by a leisurely 4-6 hour days where one could take an hour out for lunch away from the desk.

Our upgrade process is now:
1. Install and upgrade on a vagrant - write the procedure.
2. Upgrade our non-prod environments.
3. Upgrade our prod environments.
4. Consult with ScienceLogic on upgrade challenges when complete.

  • Getting over the 8.10 update process change. Future upgrades are expected to be simpler.
  • Getting closure to the delivery of device role-based access capability in the new user interface. Expected in 8.14.
  • Staying relatively current in terms of patch updates. Getting too far behind makes the process painful and complex.
  • ScienceLogic Engineering has moved to semi-annual releases.
  • We are timed to upgrade shortly after each semi-annual release.
  • Device role-based access in the new user interface. (As an Enterprise Use Case customer, we want our teams to see most information across organizational boundaries without having to elevate rights inappropriately and risk exposure of credentials to unrelated organization members.)
  • 14 will be our production platform on which we deploy our Integration Service (replacing the now end-of-lifed SyncServer) to handle our basic ServiceNow / CMDB integrations.
  • Enabling our transition away from bolt-on ServiceNow integrations into Integration Service integrations so that we can simplify our integration costs and platform dependencies.
Yes - We are already at the Enterprise level. We will not be changing that. (The form doesn't allow me to clarify my "No".)