WCCILevent - REDU/SEVERE - EventManager, evMain, Redundant peer recovery timeout - aborting recovery (active recovery)

Enclosed you'll find the explanation for a log-message which can be displayed during startup in a redundant system when the recovery for the event manager failed. The log-message is written to the PVSS_II.log-file.

WCCILevent (0), 2014.09.24 14:03:35.975, REDU, SEVERE, 54, Unexpected state, EventManager, evMain, Redundant peer recovery timeout - aborting recovery

Log message with symbolic names:

WCCILevent (0), <TIMESTAMP>, REDU, SEVERE, 54, Unexpected state, EventManager, evMain, Redundant peer recovery timeout - aborting recovery

The log message is written when the allowed time is exceeded on the system which is running and therefore making the active recovery. The maximum time for the recovery is defined with the following config entry in the config-redu-file in the [event]-section (value is defined in seconds):

activeRecoveryTimeout = 3600

The time starts when the recovery was initiated by the project on the other server which is starting up. Within the timeout the recovery for the database, the startup of the project and the recovery for the event manager (exchange of buffered data) needs to be executed.

The timeout can be reached when recovering the database and/or the startup of the project takes very long (slow network, insufficient read/write performance on the hard disc) or when a lot of buffered data needs to be exchanged.

If you want to change the timeout you have to do it in a config.redu-file saved in your project.

If the timeout was reached you’ll see the following block of log messages. The messages describe that the own system is aborting the recovery, the data manager is informed by the event manager:

WCCILevent (0), <TIMESTAMP>, REDU, SEVERE, 54, Unexpected state, EventManager, evMain, Redundant peer recovery timeout - aborting recovery
WCCILdata (0), <TIMESTAMP>, REDU, WARNING, 0, , Recovery request aborted from event.

On the other server project you will see a block of corresponding log messages. The first message describes that the recovery abort from the other server was received. The data manager closes the connection to the data manager on the other server and stops afterwards to initiate the recovery and startup again:

WCCILdata (0), <TIMESTAMP>, REDU, WARNING, 0, , Recovery aborted from data.
WCCILdata (0), <TIMESTAMP>, SYS, INFO, 181, Closing connection to (SYS: 0 Data -num 0 CONN: 2)
WCCILdata (0), <TIMESTAMP>, REDU, WARNING, 54, Unexpected state, DataManager, passiveRecovery, Lost connection to other replica while receiving updates.
WCCILdata (0), <TIMESTAMP>, SYS, INFO, 39, Connection lost, MAN: (SYS: 0 Data -num 0 CONN: 2), Connection closed
WCCILdata (0), <TIMESTAMP>, SYS, INFO, 2, Manager Stop