Our global / internal monitoring foot print is 8 production stacks in dual data centers with 50% collection capacity allocated to each data center with minimal numbers of collection groups.
General Collection is our default collection group.
Special Collection is for monitoring our ASA and other hardware that cannot be polled by a large number of IP addresses, so this collection group is usually 2 collectors).
Because most of our stacks are in different physical data centers, we cannot use the provided HA solution. We have to use the DR solution (DRBD + CNAMEs).
We routinely test power in our data centers (yearly).
Because we have to use DR, we have a hand-touch to flip nodes and change the DNS CNAME half of the times when there is an outage (by design). When the outage is planned, we do this ahead of the outage so that we don't care that the Secondary has dropped away from the Primary.
Hopefully, we'll be able to find a way to meet our constraints and improve our resiliency and reduce our hand-touch in future releases. For now, this works for us and our complexity. (I hear that the HA option is sweet. I just can't consume that.)