Split brain solution using pg_rewind in PostgreSQL

Summary
In this blog I will explain pg_rewind use case to solve split brain after a database failover happens.
Split Brain Syndrome
Split Brain is often used to describe the scenario when two or more nodes in a cluster, lose connectivity with one another but then continue to operate independently of each other.
Streaming replication
Streaming replication, a standard feature of PostgreSQL, allows the updated information on the primary server to be transferred to the standby server in real time, so that the databases of the primary server and standby server can be kept in sync.
Repmgr Extension
repmgr is an open-source toolset from 2ndQuadrant, a leading specialist in PostgreSQL-related technologies and services. The product is used to automate, enhance, and manage PostgreSQL streaming replication.
repmgrd (daemon)
repmgrd is a daemon that runs on each PostgreSQL node, monitoring the local node, and (unless it’s the primary node) the upstream server (the primary server or with cascading replication, another standby) which it’s connected to repmgrd can be configured to provide failover capability in case the primary upstream node becomes unreachable, and/or provide monitoring data to the repmgr meta database.
Database Architecture
I configured two nodes PostgreSQL database with streaming replication technique and installed repmgr tool to achieve auto failover in PostgreSQL. After that started repmgrd daemon process.
[postgres@fahad archive]$ repmgr daemon status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+----------------------+---------+-----------+----------------------+---------+-------+---------+-------------------- 1 | athar.postgres-1.com | primary | * running | | running | 60943 | no | n/a 2 | fahad.postgres-2.com | standby | running | athar.postgres-1.com | running | 70493 | no | 2 second(s) ago
Scenario:
I forcefully killed the postmaster process on the primary database to do the failover. After failover, the old primary was stopped, and the other node started working as primary. Then I restarted the old primary manually and found that both nodes’ roles were primary.
This is one example of how a split-brain syndrome occurs in PostgreSQL.
[postgres@fahad ~]$ repmgr cluster show --verbose NOTICE: using provided configuration file "/var/lib/pgsql/repmgr.conf" INFO: connecting to database ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+----------------------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------- 1 | athar.postgres-1.com | primary | ! running | | default | 100 | 1 | host=192.168.202.228 user=postgres dbname=postgres connect_timeout=2 2 | fahad.postgres-2.com | primary | * running | | default | 100 | 2 | host=192.168.202.229 user=postgres dbname=postgres connect_timeout=2 WARNING: following issues were detected - node "athar.postgres-1.com" (ID: 1) is running but the repmgr node record is inactive [postgres@athar ~]$ repmgr daemon status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+----------------------+---------+----------------------+----------+---------+------+---------+-------------------- 1 | athar.postgres-1.com | primary | * running | | running | 5084 | no | n/a 2 | fahad.postgres-2.com | standby | ! running as primary | | running | 2267 | no | n/a WARNING: following issues were detected - node "fahad.postgres-2.com" (ID: 2) is registered as standby but running as primary
Solution:
As mentioned above, we have seen a split brain between clusters. Both nodes work as primary and independent.
Here we need the old primary to rejoin the replica configuration and rewind that database to sync the log sequence from the current primary. This will discard any transactions that could have arrived at the old primary after the automatic switchover happened and was started. Before rewind, just stop all database services on node1.
rewind dry run to verify all pre-requisites are met
[postgres@athar ~]$ repmgr node rejoin -d 'host=192.168.202.229 dbname=postgres user=postgres' --config-files=/var/lib/pgsql/12/data/postgresql.conf --verbose --force-rewind --dry-run NOTICE: using provided configuration file "/var/lib/pgsql/repmgr.conf" INFO: replication slots in use, 2 free slots on node 20 INFO: replication connection to the rejoin target node was successful INFO: local and rejoin target system identifiers match DETAIL: system identifier is 6967965196868380905 NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2 DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 0/A000028 INFO: prerequisites for using pg_rewind are met INFO: temporary archive directory "/tmp/repmgr-config-archive-athar.postgres-1.com" created WARNING: specified file "/var/lib/pgsql/12/data//var/lib/pgsql/12/data/postgresql.conf" not found, skipping INFO: 0 files would have been copied to "/tmp/repmgr-config-archive-athar.postgres-1.com" INFO: temporary archive directory "/tmp/repmgr-config-archive-athar.postgres-1.com" deleted INFO: pg_rewind would now be executed DETAIL: pg_rewind command is: pg_rewind -D '/var/lib/pgsql/12/data' --source-server='host=192.168.202.229 user=postgres dbname=postgres connect_timeout=2' INFO: prerequisites for executing NODE REJOIN are met
Run rewind
[postgres@athar ~]$ repmgr node rejoin -d 'host=192.168.202.229 dbname=postgres user=postgres' --config-files=/var/lib/pgsql/12/data/postgresql.conf --verbose --force-rewind NOTICE: using provided configuration file "/var/lib/pgsql/repmgr.conf" NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2 DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 0/A000028 INFO: prerequisites for using pg_rewind are met WARNING: specified file "/var/lib/pgsql/12/data//var/lib/pgsql/12/data/postgresql.conf" not found, skipping INFO: 0 files copied to "/tmp/repmgr-config-archive-athar.postgres-1.com" NOTICE: executing pg_rewind DETAIL: pg_rewind command is "pg_rewind -D '/var/lib/pgsql/12/data' --source-server='host=192.168.202.229 user=postgres dbname=postgres connect_timeout=2'" NOTICE: 0 files copied to /var/lib/pgsql/12/data INFO: directory "/tmp/repmgr-config-archive-athar.postgres-1.com" deleted INFO: creating replication slot as user "postgres" NOTICE: setting node 1's upstream to node 2 WARNING: unable to ping "host=192.168.202.228 user=postgres dbname=postgres connect_timeout=2" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: starting server using "pg_ctl -w -D '/var/lib/pgsql/12/data' start" INFO: node "athar.postgres-1.com" (ID: 1) is pingable INFO: node "athar.postgres-1.com" (ID: 1) has attached to its upstream node NOTICE: NODE REJOIN successful DETAIL: node 1 is now attached to node 2
Cluster Status after rewind
[postgres@athar ~]$ repmgr daemon status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+----------------------+---------+-----------+----------------------+---------+------+---------+-------------------- 1 | athar.postgres-1.com | standby | running | fahad.postgres-2.com | running | 6345 | no | 0 second(s) ago 2 | fahad.postgres-2.com | primary | * running | | running | 5184 | no | n/a