Monitoring your 5.7 InnoDB cluster status

Written by Matthias Crauwels | Feb 4, 2019 5:00:00 AM

Recently, we worked with a customer seeking assistance in monitoring their MySQL 5.7 InnoDB cluster. Following a network outage that caused a split-brain scenario with no writable primaries, they required a robust monitoring strategy to prevent future downtime.

While I initially expected to pull node roles (Primary vs. Secondary) directly from the performance_schema tables—a standard feature in version 8.0—it turns out this functionality was never back-ported to 5.7. Despite documentation suggesting otherwise, the MEMBER_ROLE column is missing in 5.7, leading me to file a bug report with Oracle for either a back-port or a documentation update.

Initial monitoring attempts

Our primary strategy focuses on tracking the online node count. For a cluster to remain writable, it requires a quorum (at least two nodes, though three are preferred for fault tolerance).

Here is how a healthy 5.7 cluster appears in the MySQL Shell:

// node1:3306 ssl JS > cluster.status(); {     "clusterName": "innodb_cluster",     "defaultReplicaSet": {         "status": "OK",         "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",         "topology": {             "node1:3306": { "mode": "R/W", "role": "HA", "status": "ONLINE" },             "node2:3306": { "mode": "R/O", "role": "HA", "status": "ONLINE" },             "node3:3306": { "mode": "R/O", "role": "HA", "status": "ONLINE" }         }     } }

In the performance_schema, this translates to the following:

SELECT * FROM performance_schema.replication_group_members;

CHANNEL_NAME	MEMBER_ID	MEMBER_HOST	MEMBER_STATE
group_replication_applier	...bcf1	node2	ONLINE
group_replication_applier	...00ec	node1	ONLINE
group_replication_applier	...0c64	node3	ONLINE

To simulate the customer's failure, I used iptables to drop traffic on the cluster communication adapter:

iptables -A INPUT -p tcp -i eth1 -j DROP

The error log immediately captures the split:

[Warning] Member with address node3:3306 has become unreachable.
[Note] Group membership changed to node2:3306, node1:3306.

To monitor this via SQL, use this count query:

SELECT COUNT(*)  FROM performance_schema.replication_group_members  WHERE MEMBER_STATE = 'ONLINE';

Alert Thresholds:

3+ Nodes: Healthy.
2 Nodes: Warning (Operational, but no longer tolerates another failure).
1 Node: Critical (Partitioned node; no quorum, therefore not writable).

Alternative approach

A more targeted approach is to verify that at least one PRIMARY node is active. Since InnoDB clusters can technically support multiple primaries (though not by default), we check for "at least one" instead of "exactly one."

You can identify the current primary by joining the global_status with replication_group_members:

SELECT COUNT(*)  FROM performance_schema.replication_group_members  WHERE member_id = (     SELECT variable_value      FROM performance_schema.global_status      WHERE variable_name = 'group_replication_primary_member' )  AND MEMBER_STATE = 'ONLINE';

A result of 1 confirms the current primary is online and functional.

Let's go JSON

Because the cluster is designed to promote a new primary automatically as long as quorum exists, monitoring the cluster.status() JSON output is the most comprehensive method. Even for a 5.7 cluster, it is recommended to use the MySQL Shell 8.0 for its enhanced scripting capabilities.

Using mysqlsh combined with the jq command-line tool allows for precise status parsing:

Cluster State	Command	Result
Healthy	`mysqlsh root@127.0.0.1 -- cluster status	jq .defaultReplicaSet.status`
Degraded	`mysqlsh root@127.0.0.1 -- cluster status	jq .defaultReplicaSet.status`

Handling "NO_QUORUM" States:

When a cluster loses quorum, mysqlsh outputs warnings that can break JSON parsing. You can use grep to clean the output before passing it to jq:

mysqlsh root@127.0.0.1 -- cluster status | \ grep -v WARNING | \ grep -v 'Write operations on the InnoDB cluster will not be allowed.' | \ jq .defaultReplicaSet.status

Result: "NO_QUORUM"

Conclusion

By implementing one or more of these three options—Node Counting, Primary Verification, or JSON Status Parsing—you can gain reliable insights into your InnoDB cluster's health and ensure your monitoring system alerts you before a total partition occurs.

MySQL Database Consulting

Looking to optimize your MySQL use?

View full post