MySQL crashes on DDL statement: a lesson on purge threads

Written by Pythian Marketing | Dec 5, 2017 5:00:00 AM

Restructuring your database environments can be a high-stakes operation. Recently, I’ve encountered several reports of DDL activity causing MySQL crash scenarios. These range from dropping multiple databases in rapid succession to the more complex case of large-scale table migrations during version upgrades.

The Case Study: Upgrading to MySQL 5.7

In a recent upgrade to MySQL 5.7, we encountered a significant failure while running mysql_upgrade. The process triggered an ALTER TABLE FORCE on a 2.2Tb table to convert it to the new data format supporting microsecond precision.

The issue appeared at a critical juncture: the intermediate table had been completely filled with data, and MySQL was attempting to swap the existing table for the new one. Instead of a smooth transition, the system stalled and eventually crashed.

Identifying the Symptoms in the Error Log

A review of the InnoDB monitor output in the error log revealed a fatal bottleneck. The diagnostic info highlighted three specific warnings:

Redo Log Capacity Exceeded: [ERROR] InnoDB: The age of the last checkpoint is 379157140, which exceeds the log group capacity 377483674
Long Semaphore Waits: Warnings indicated that threads were waiting for an X-lock on an RW-latch (created in dict0dict.cc).
Intentional Crash: [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server...

Root Cause Analysis: Purge Events and Checkpoints

The key takeaway from the log is that a purge event (row0purge.cc) on the data dictionary (dict0dict.cc) created a semaphore that lasted longer than the 600-second threshold. When InnoDB detects a semaphore wait exceeding 10 minutes, it assumes a hang and intentionally crashes the server to prevent data corruption or further inconsistency.

Furthermore, the checkpoint age warning indicated that redo log accumulation was exceeding InnoDB’s ability to flush changes to persistent storage. This created a massive I/O bottleneck during the table swap phase.

Strategic Solutions to Prevent DDL Crashes

To resolve this and ensure the ALTER TABLE could complete successfully, we implemented three major changes:

1. Expanding Redo Log Capacity

We increased the redo log space to 8G. Initial testing indicated that an ALTER TABLE FORCE on a 2.2Tb table writes approximately that much data to the redo log. Providing this "breathing room" prevents the checkpoint age from exceeding log group capacity.

2. Optimizing Memory and Buffer Pool

We increased both the overall system memory and the InnoDB Buffer Pool size. This allowed for faster data processing and reduced the time required for the intermediate table to be populated and swapped.

3. Tuning the Purge Threads

The most critical change was adjusting the innodb_purge_threads variable. While MySQL 5.7 defaults this to 4, we reduced it to 1.

Expert Tip: According to the MySQL Reference Guide, you should keep this setting low to prevent threads from contending with each other for access to busy tables.

Reducing the threads eliminated the contention that caused the 600-second semaphore wait on the data dictionary, allowing the table swap to complete without triggering a fatal error.

Conclusion

The takeaway from this is simply to suggest that if you are running MySQL 5.7, you will want to consider changing the variable innodb_purge_threads from it’s new default value of 4, to the 5.6 default value of 1 to avoid purge thread contention. That is, unless you have a strong need for multiple purge threads running. This also begs the question of why the default was changed from 1 to 4 in 5.7 as it could be considered as an unsafe value.

MySQL Database Consulting

Looking to optimize your MySQL use?

View full post