Database replication is quite an important component of the database architectures of many of our clients. Managed properly, it offers real-time redundancy and fault-tolerance, and allows for increased scalability. When replication goes wrong, however, it can turn into a morass of data conflicts, lag and strange little mysteries to distract your DBA team.
PalominoDB recently helped a client out of one of these strange little replication mysteries, and we document it here to help you avoid getting stuck in the same sort of situation.
The symptoms: Our slave database (which we’ll call DB_Slave) was lagging. The logs revealed that the value of Seconds_Behind_Master was jumping from 0 to a random high value, and then back to 0.
What we knew: DB_Slave connects to the master (which we’ll call DB_Master), and DB_Master sends binlog events to DB_Slave.
Because this data is pushed from the master rather than pulled from the slave, there was a small window where, in theory, a replication channel could be broken, and DB_Slave wouldn’t notice until “slave-net-timeout” seconds (with a default of 3600) passed.
We also knew that Seconds_Behind_Master is calculated not as the difference between the current datetime and the datetime of the binlog event being processed, but as the difference between the time in master binlog and current executed relay log. That means that if DB_Master is not sending binlog events to DB_Slave (for example, due to a network issue), the slave could be lagging behind the master but not be aware of it at all.
Further, if DB_Slave then received some binlog events from DB_Master, and realized at that point that it was lagging, it could still fail to notice if DB_Master once again stopped pushing binlog events.
Investigation: Our theory that there was in fact a network issue was supported by some evidence from the logs. We saw that SHOW SLAVE STATUS didn’t show any progress in Master_Log_File and Read_Master_Log_Pos : that is, DB_Slave was not getting any data from DB_Master.
We also found this in the processlist of DB_Master:
104062295 slave_user [IP Address]:10124 NULL Binlog Dump 1140 Writing to net NULL
The status “Writing to net” means that DB_Master is sending (or trying to send) data to DB_Slave.
We suspected this might be an issue related to network bandwidth, but when we ran a series of STOP SLAVE / START SLAVE processes, we noticed that the number of threads in status “Writing to net” was increasing, with up to 20 concurrent open threads. Connections simply weren’t being closed. Further investigation revealed that those connections were staying in CLOSE_WAIT status.
Based on our observations we concluded that the limited network bandwidth was at fault, and we needed a solution to compress the transmission between the master and slave databases. Luckily MySQL provides a convenient parameter, which is dynamic and is set on the slave: slave_compressed_protocol=1
mysql> show global variables like ‘slave_compressed_protocol’;
| Variable_name | Value |
| slave_compressed_protocol | OFF |
1 row in set (0.00 sec)
mysql> stop slave;
mysql> set global slave_compressed_protocol=1;
Query OK, 0 rows affected (0.00 sec)
mysql> start slave;
| Variable_name | Value |
| slave_compressed_protocol | ON |
Resolution: We set slave_compressed_protocol=1 and restarted the slave. Replication started catching up at a very surprising speed. Even though it was catching up, we noticed the same behavior we’d noticed in the past: Seconds_Behind_Master was jumping from 0 to a random high value, and then back to 0, the io_thread was behind, and we didn’t see a high load on the server.
As you can see in these graphs, network traffic did not increase much, but CPU usage and commands/sec had surprisingly high jumps.
In short, DB_Slave was processing more queries because the IO thread was receiving more data from DB_Master, thanks to compression.
Interested in working with Rene? Schedule a tech call.