Replication Issues: Never purge logs before slave catches them!!
A few days ago one of our customers contact us to inform a problem with one of their replication servers. This server was reporting this issue: Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. After brief research we found the customer had deleted some binary logs from the master and relay logs from slave to release space since they were having space issues. The customer requested us to get slave working again without affecting the production environment. Our goal here was to figure out how to restore replication without going through an entire rebuild of slave, although this may have been the easiest solution, it had been necessary to lock the master to take a good backup, restore the binary logging, and then restore the slave of a proper replication point. This kind of things starts to be complicated when we talk about more than 1TB to rebuild. Unfortunately customer didn't backed up relay-logs, but fortunately customer was backed up binary logs so that would be our starting point. First of all, we had to get the slave status to know when exactly it stops replication:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: xxx.xxxxx.xxx Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.003079 Read_Master_Log_Pos: 39705328 Relay_Log_File: xxxxxxxxxxxx-relay-bin.002256 Relay_Log_Pos: 4 Relay_Master_Log_File: mysql-bin.002671 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Skip_Counter: 0 Exec_Master_Log_Pos: 1073741674 Relay_Log_Space: 438126861751 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec)
Ok, with that information in our minds we could start with modifications, and this should be our action plan:
-
Restore backed up binary logs in Master.
-
Modify mysql-bin.index file in Master to make binary logs “visible”
-
Reset slave to restart replication information.
-
Restart replication.
-
Customer doesn't have enough space in default folder to restore all deleted binary logs (about 250GB) so they had to mount a NFS folder.
-
There isn't a good documented way to refresh cached information about binary logs reported with SHOW BINARY LOGS statement.
-
Easy one.
-
Easy one.
-
Once we had the list of each recovered binary log (wich was from mysql-bin.002671 to mysql-bin.002877) we moved to next step.
-
We modified mysql-bin.index file to include lost binaries. After finish it looks this way:
mysql> show binary logs; +------------------+------------+ | Log_name | File_size | +------------------+------------+ | mysql-bin.002671 | 1073742130 | | mysql-bin.002672 | 1073741934 | | mysql-bin.002673 | 1073742001 | | mysql-bin.002674 | 1073742857 | | mysql-bin.002675 | 1073741944 | ...and so onAt this point thing looks good so we continued with steps 3 and 4 and try to get replication back. We took notes about master information: [pythian@XXXslave data]$cat master.info Password: 14 mysql-bin.003079 160765462 XXXX.XXXXXX.XXX rep1 pwd 3306 60 0 And start with slave reset: mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: xxx.xxxxx.xxx Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.003079 Read_Master_Log_Pos: 160765462 Relay_Log_File: xxxxxxxxxxxx-relay-bin.002256 Relay_Log_Pos: 4 Relay_Master_Log_File: mysql-bin.002671 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 1073741674 Relay_Log_Space: 438247922177 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec) mysql> reset slave; Query OK, 0 rows affected, 244 warnings (6 min 47.97 sec) mysql> show warnings; +-------+------+--------------------------------------------------------------------+ | Level | Code | Message | +-------+------+--------------------------------------------------------------------+ | Error | 6 | Error on delete of './xxxxxxxxxxxx-relay-bin.002256' (Errcode: 2) | | Error | 6 | Error on delete of './xxxxxxxxxxxx-relay-bin.002257' (Errcode: 2) | | Error | 6 | Error on delete of './xxxxxxxxxxxx-relay-bin.002258' (Errcode: 2) | and so on, which were expected warnings since relay logs were deleted.
Continuing changes: mysql> change master to master_host ='xxxx.xxxxxx.xxx', master_user='repl', master_password='pwd', master_log_file='mysql-bin.002671', master_log_pos=1073741674; Query OK, 0 rows affected (0.16 sec) mysql> START SLAVE; Query OK, 0 rows affected (0.00 sec) mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: oxxxx.xxxxxx.xxx Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.002671 Read_Master_Log_Pos: 1073741674 Relay_Log_File: xxxxxxxxxxxx-relay-bin.000001 Relay_Log_Pos: 98 Relay_Master_Log_File: mysql-bin.002671 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 1073741674 Relay_Log_Space: 98 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec)1 row in set (0.00 sec)Something doesn't work as expected. [pythian@xxxxxxxxxxxx data]$ tail -20 xxxxxxxxxxxx.xxxxx.xxx.err ... 100929 11:21:47 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.002671' at position 1073741674, relay log './xxxxxxxxxxxx-relay-bin.000001' position: 4 100929 11:21:47 [Note] Slave I/O thread: connected to master 'repl@oxxxx.xxxxxx.xxx:3306', replication started in log 'mysql-bin.002671' at position 1073741674 100929 11:21:47 [ERROR] Error reading packet from server: Could not find first log file name in binary log index file ( server_errno=1236) 100929 11:21:47 [ERROR] Got fatal error 1236: 'Could not find first log file name in binary log index file' from master when reading data from binary log 100929 11:21:47 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.002671', position 1073741674 100929 11:22:08 [Note] Error reading relay log event: slave SQL thread was killed This errors tell us references to a binary log on master was missed. The weirdest part here is when we perform a SHOW BINARY LOGS statement in master it actually showed us the binaries we wanted to use, so at this point we were stuck in a non-sense point because some cached information in master about binary logs needs to be refreshed dynamically (we just couldn't restart master). Solution for this was dig into source code to know how MySQL deal with logs, we found this interesting function (funny part it red)
-- void MYSQL_LOG::new_file(bool need_lock)
{
char new_name[FN_REFLEN], *new_name_ptr, *old_name;
enum_log_type save_log_type;
DBUG_ENTER("MYSQL_LOG::new_file");
if (!is_open())
{
DBUG_PRINT("info",("log is closed"));
DBUG_VOID_RETURN;
}
if (need_lock)
pthread_mutex_lock(&LOCK_log);
pthread_mutex_lock(&LOCK_index);
safe_mutex_assert_owner(&LOCK_log);
safe_mutex_assert_owner(&LOCK_index);
/*
if binlog is used as tc log, be sure all xids are "unlogged",
so that on recover we only need to scan one - latest - binlog file
for prepared xids. As this is expected to be a rare event,
simple wait strategy is enough. We're locking LOCK_log to be sure no
new Xid_log_event's are added to the log (and prepared_xids is not
increased), and waiting on COND_prep_xids for late threads to
catch up.
*/
if (prepared_xids)
{
tc_log_page_waits++;
pthread_mutex_lock(&LOCK_prep_xids);
while (prepared_xids)
pthread_cond_wait(&COND_prep_xids, &LOCK_prep_xids);
pthread_mutex_unlock(&LOCK_prep_xids);
}
/* Reuse old name if not binlog and not update log */
new_name_ptr= name;
/*
If user hasn't specified an extension, generate a new log name
We have to do this here and not in open as we want to store the
new file name in the current binary log file.
*/
if (generate_new_name(new_name, name))
goto end;
new_name_ptr=new_name;
if (log_type == LOG_BIN)
{
if (!no_auto_events)
{
/*
We log the whole file name for log file as the user may decide
to change base names at some point.
*/
THD *thd = current_thd; /* may be 0 if we are reacting to SIGHUP */
Rotate_log_event r(thd,new_name+dirname_length(new_name),
0, LOG_EVENT_OFFSET, 0);
r.write(&log_file);
bytes_written += r.data_written;
}
/*
Update needs to be signalled even if there is no rotate event
log rotation should give the waiting thread a signal to
discover EOF and move on to the next log.
*/
signal_update();
}
old_name=name;
save_log_type=log_type;
name=0; // Don't free name
close(LOG_CLOSE_TO_BE_OPENED | LOG_CLOSE_INDEX);
/*
Note that at this point, log_type != LOG_CLOSED (important for is_open()).
*/
/*
new_file() is only used for rotation (in FLUSH LOGS or because size >
max_binlog_size or max_relay_log_size).
If this is a binary log, the Format_description_log_event at the beginning of
the new file should have created=0 (to distinguish with the
Format_description_log_event written at server startup, which should
trigger temp tables deletion on slaves.
*/
/* reopen index binlog file, BUG#34582 */
if (!open_index_file(index_file_name, 0))
open(old_name, save_log_type, new_name_ptr,
io_cache_type, no_auto_events, max_size, 1);
my_free(old_name,MYF(0));
end:
if (need_lock)
pthread_mutex_unlock(&LOCK_log);
pthread_mutex_unlock(&LOCK_index);
DBUG_VOID_RETURN;
}
This function is called by FLUSH LOGS command, which close and re opens all MySQL logs (including binaries) and refresh cached binary index. Additionally before perform this command we went through these steps:
-
Moved the last log that slaves catches ( mysql-bin.002671) to it's original position.
-
Modified again the mysql-bin.index to point this file to proper path.
-
Perform FLUSH LOGS and FLUSH PRIVILEGES command to force server to refresh the information about binary logs and clients.
-
Went to slave and restart replication (and cross fingers of course)
Share this
Previous story
← What's in your Exadata Smart Flash Cache?
You May Also Like
These Related Stories
When SHOW SLAVE STATUS and the error log Disagree
When SHOW SLAVE STATUS and the error log Disagree
Apr 25, 2008
3
min read
MySQL Recipes: Promoting a Slave to Master or Changing Masters
MySQL Recipes: Promoting a Slave to Master or Changing Masters
Dec 8, 2006
2
min read
How to verify if a slave running in MySQL 5.7
How to verify if a slave running in MySQL 5.7
Jul 18, 2016
1
min read
No Comments Yet
Let us know what you think