HA Automatic Failover and Failback

Definitions

Automatic FailoverThis is the process of performing a switchover from the Primary server when it is active, to the Secondary server when the system detects the failure of a critical component on the Primary server. For this to occur, the Secondary server must be operational.
Automatic FailbackThis is the process of returning the system to the normal operational state, that is, the Primary server is active and the Secondary server is passive. This automatic switchover will only occur when the Secondary server is active and the Primary server is operational and the Automatic Failover feature has been enabled.

Overview

Automatic failover is the process of Prognosis Server self detection of failures on the Primary server and performing an automatic switchover of the active and passive servers.

When the system detects a failure of a critical component on the Primary server, and the Secondary server is functioning properly, an automatic switchover will occur. This operation will make the Secondary server active since the Primary server is unresponsive. 

After the automatic failover has occurred, the automated failover detection is temporarily disabled preventing any other switchovers to occur until the cause has been investigated and resolved.

After an automatic failover has occurred, the following steps should be taken to return the system back to the normal operating state:

  1. Analyze the Primary server, and bring the server back online.
  2. Re-enable the Automated Failover detection, see Enabling HA Autofailback.
  3. This will cause the system to return to the normal operating state by automatically switchover the servers to make the Primary server active again.

The system uses the following mechanisms to monitor and detect system failures.

Enable Auto Failover

The automatic failover feature is not enabled by default and does require some manual steps to set it up. To enable the automatic failover feature, the following steps must be performed.

Logon to the Web Application - Administration Tool on the Management Server.

In the navigation panel on the left, click on Home, then select the Management Server of the HA Pair from the server navigation panel 

Start the Threshold

This threshold is used to collect the health information from the HA Pair servers and to ensure that even in the cause of an unplanned outage we still get heartbeat information from both servers. For instructions see Thresholds and Alerts for HA.

Start the IRHAMGR Process

The IRHAMGR will receive heartbeat information from the threshold, and depending on the state of the system will make the decision whether to perform a switchover. For instructions see Starting the HA Manager

Customizing the Automatic Switchover

The default behaviour for the automatic detection and failover will take between 50 seconds for the cutover to occur. If this is not desirable, it is possible to set up the threshold for faster or slower response time.

There are 7 conditions in the threshold, the following descriptions provide details on how these are used and what can be changed to reconfigure the response times.

Condition NameDescription
<Server>UpThere are two of these, one each for the Active and Passive servers. This configures the number of heartbeats to receive before notifying the IRHAMGR that the machine is up
<Server>DownThere are two of these, one each for the Active and Passive servers. This configures the number of heartbeats to receive before notifying the IRHAMGR that the machine is down
<Server>Down Heartbeat

While waiting for N intervals is good in delaying decisions we also don't want to fail-over to a machine that is unstable. This condition will send a message to IRHAMGR every time it receives a down heartbeat. This makes sure we know immediately when a machine might not be stable enough for a fail-over.

The Down Heartbeat condition doesn't tell IRHAMGR that the server is down, rather it tells IRHAMGR that it is unsure and should wait for more information before performing any actions

HA Auto-Failover Logging

Checks WVLOG for any errors relating to HA auto-failover within the last hour. The default interval is 10 seconds, with user acknowledgment and no off event. This means that the alert will be closed when the user acknowledges it regardless if the problem has been resolved or not.

Default destinations are problem summary (PROBSUM), SNMP Trap and Dispatch Manager. Further customization can be done for other destinations if needed.

Customization

The number of heartbeats until we report UP or DOWN to HAMGR and the heartbeat interval are both customizable The default is 5 heartbeats with a 10 second interval. This means 50 seconds before reporting UP or DOWN.

These variables can be customized in the timing tab of each Threshold Condition.

  • The number of Heartbeats can be configured in the Log after section
  • The interval of each heartbeat is configured in the Check Interval section
  • This will need to be done for all 6 conditions separately (<Server>Up, <Server>Down and <Server>Down Heartbeat)

Availability Monitoring

The way that Prognosis High Availability works overall is that we use Availability Monitoring to ensure that our entities are up and operational.

There are two key components to Availability Monitoring. There are the High Availability Monitoring nodes which are responsible for Self-Monitoring of the Critical Prognosis components and there is the Management Server which is responsible for monitoring the heartbeat of the Self-Monitoring components running on the Monitoring Servers.

Introduced in version 11.9, these Availability Monitoring components are automatically configured for High Availability Pairs when linked using the Web Application - Administration tool. When upgrading from earlier versions, these entries will need to be entered manually.

For the High Availability Monitoring Server, the following Availability options need to be included to allow for High Availability components:

SUBSYS AVAILABILITY

...

! Start Monitor HA as an application
PORT ADDPORT (1970)
PROCEXE ADDEXE (irhamgr)
PROCEXE ADDEXE (irhasync)
APPLICATION ADDEXE( "PrognosisHealthHA", "irhamgr" )
APPLICATION ADDEXE ( "PrognosisHealthHA", "irhasync" )
APPLICATION ADDEXE( "PrognosisCriticalHA", "irhamgr" )
APPLICATION ADDEXE ( "PrognosisCriticalHA", "irhasync" )
APPLICATION ADDAPP( "PrognosisHealth", "PrognosisHealthHA" )
APPLICATION ADDAPP ( "PrognosisCritical", "PrognosisCriticalHA" )
! End Monitor HA as an application

For the High Availability management Server, for each Prognosis High Availability Pair a new entry will need to be made to ensure that the Autofailover Threshold can trigger the failover and failback conditions.

SUBSYS AVAILABILITY

...

! Start <pairName>
APPLICATION ADDAPP (HAPAIR-<pairName>, \<primaryNodeName>.PrognosisCritical, 51)
APPLICATION ADDAPP (HAPAIR-<pairName>, \<secondaryNodeName>.PrognosisCritical, 49)
! End <pairName>
Provide feedback on this article