NGA - High Availability Configuration (HAC)

This chapter provides an overview of the High Availability Configuration for redundant or non-redundant WinCC OA projects (High Availability Configuration - HAC) and explains the roles of the components etcd and Patroni, which are required to set up HAC.

The HAC is based on the streaming replication and hot standby capabilities of a PostgreSQL® cluster, which is managed and monitored by the Patroni tool. The HAC ensures that the PostgreSQL® cluster remains available and consistent, even in the event of node or network failures.

As shown in the figures Patroni Architecture - Single / Redundant Project below, in the HAC the PostgreSQL® backends of the Next Generation Archiver (NGA) communicate directly with a two-node PostgreSQL® cluster. Each of the PostgreSQL® nodes is monitored and managed by a corresponding Patroni instance. These operate independently of WinCC OA. Patroni manages the PostgreSQL® processes, handles node failover, and relies on PostgreSQL®'s built-in streaming replication and hot standby capabilities (https://www.enterprisedb.com/docs/supported-open-source/patroni/).

In a HAC, Patroni is configured to use etcd (/ˈɛtsiːdiː/, “distributed etc directory”), a highly consistent distributed key-value store, to store cluster information such as Patroni and PostgreSQL® cluster configuration, the leader node of Patroni, and diagnostics. This architecture supports the election of the Patroni leader node and failover in the PostgreSQL® cluster without manual intervention.

Table 1. Patroni Architecture - Single / Redundant Project

Patroni

On each PostgreSQL® database server, an associated Patroni instance runs as a service. It controls the local PostgreSQL® processes and ensures consistency by preventing split-brain scenarios. Together, the Patroni instances and their associated PostgreSQL® nodes form a Patroni cluster. In a healthy cluster, one Patroni instance is elected leader, and the associated PostgreSQL® instance is promoted to primary within the PostgreSQL® cluster.

All other PostgreSQL® instances operate in standby mode (permitting read-only operations only) and perform continuous write-ahead logging (WAL) streaming replication, applying transactions streamed from the primary node in near real-time. This ensures that any standby node can be promoted quickly to primary within the PostgreSQL® cluster in case of a primary node or network failure.

Etcd plays a crucial role in this process: it serves as a consensus mechanism for leader election and stores the current runtime state and configuration of the Patroni cluster.

Each Patroni instance is responsible for initializing its associated PostgreSQL® node. Depending on its role, the Patroni instance uses:

  • initdb to initialize the primary node of the PostgreSQL® cluster
  • pg_basebackup to initialize a standby node from the primary of the PostgreSQL® cluster

Finally, Patroni manages the promotion and demotion of PostgreSQL® nodes to primary and standby, respectively, according to the role of the associated Patroni nodes within the Patroni cluster. This enables automatic failover within the PostgreSQL® cluster and ensures the PostgreSQL® cluster's high availability.

etcd

Etcd is a key-value database and provides a reliable mechanism for distributed storage of data required by distributed systems. It consists of a preferably odd number of nodes which form the etcd cluster. Etcd gracefully handles the election of the etcd leader node during network partitions or etcd node failures.

As shown in the figures Patroni Architecture - Single / Redundant Project below, the HAC deploys three etcd nodes, which together serve as the DCS for Patroni. Each etcd node runs as a service on a separate machine to ensure high availability. The etcd cluster can tolerate the failure of one node and still maintain a majority quorum of 2 out of 3.
Note:
The number of nodes can be increased as required for etcd or Patroni, but in the WinCC OA HAC there is a fixed number of three nodes.

To ensure stable and symmetric system behavior in the event of a WinCC OA server failure, no etcd node is allowed on either of the two WinCC OA servers. Therefore, in the HAC, two etcd nodes are distributed on the two database servers, while the third etcd node is hosted on a dedicated additional machine.

For more information on NGA in general, see the NGA documentation.