Understanding failover and recovery

After a previously degraded or unavailable server has recovered, it should be eligible to start receiving traffic within the time configured for the health-check-frequency property that is 30 seconds by default. However, failover and recovery also depend on the load-balancing algorithm in use.

The load-balancing algorithm provides an ordered list of servers to check, with the number of servers in the list based on the maximum number of retry attempts. The server checks to see if affinity should be used and, if so, whether an affinity is set for that load-balancing algorithm.

If there is an affinity to a particular server and that server is classified as available, that server is always the first in the list.

Next, the PingDirectoryProxy server creates a two-dimensional matrix of servers based on:

The health check state, with available preferred over degraded and unavailable not considered at all
Location, with backend servers in the same location as the PingDirectoryProxy server most preferred, then servers in the first failover location, then the second, and so forth

Within each of these sets, and ideally at least one server in the local data center is classified as available, the load-balancing algorithm selects the servers in the order of most preferred to least preferred based on whatever logic the load-balancing algorithm uses. The load-balancing algorithm keeps selecting servers until enough of them have been selected to satisfy the maximum number of possible retries.

The load-balancing algorithm includes a configuration option that allows you to decide whether to prefer location over availability or availability over location, such as specifying if a local degraded server is more or less preferred than a remote available server.

By default, the algorithm prefers available servers over degraded ones, even if it has to go to another data center to access them. You can change the load-balancing algorithm to try to stay in the same data center if at least one server is not unavailable.

The PingDirectoryProxy server does both proactive and reactive health checking.

Proactive health checking: The PingDirectoryProxy server periodically (by default, every 30 seconds) runs a full set of tests against each backend server. The result of these tests are used to determine the overall health check state (available, degraded, or unavailable) and score (an integer value from 10 to 0).
Reactive health checking: The PingDirectoryProxy server kicks off a lesser set of health checks against a server if an operation forwarded to that server did not complete successfully.

Proactive health checking can be used to promote and demote the health of a server. Reactive health checking can only be used to demote the health of a server. As a result, if a server is determined to be unavailable, then it will remain that way until a subsequent proactive health check determines that it has recovered. If a server is determined to be degraded, it might not become available until the next proactive health check, but it could be downgraded to unavailable by a reactive check if other failures are encountered against that server.

Both proactive and reactive health check assignments take effect immediately and are considered for all subsequent requests routed to the load-balancing algorithm. If a server is considered degraded, then it’s immediately considered less desirable than available servers in the same data center, and possibly less desirable than available servers in more remote data centers. If a server is considered unavailable, then it’s not eligible to be selected until it is reclassified as available or degraded.