Replication Issues: Never purge logs before slave catches them!!

May 10, 2011 / By Francisco Bordenave

Tags: , , ,

A few days ago one of our customers contact us to inform a problem with one of their replication servers.

This server was reporting this issue:

Last_Error: Could not parse relay log event entry. The possible reasons are: the master’s binary log is corrupted (you can check this by running ‘mysqlbinlog’ on the binary log), the slave’s relay log is corrupted (you can check this by running ‘mysqlbinlog’ on the relay log), a network problem, or a bug in the master’s or slave’s MySQL code. If you want to check the master’s binary log or slave’s relay log, you will be able to know their names by issuing ‘SHOW SLAVE STATUS’ on this slave.

After brief research we found the customer had deleted some binary logs from the master and relay logs from slave to release space since they were having space issues.

The customer requested us to get slave working again without affecting the production environment.

Our goal here was to figure out how to restore replication without going through an entire rebuild of slave, although this may have been the easiest solution, it had been necessary to lock the master to take a good backup, restore the binary logging, and then restore the slave of a proper replication point.

This kind of things starts to be complicated when we talk about more than 1TB to rebuild.

Unfortunately customer didn’t backed up relay-logs, but fortunately customer was backed up binary logs so that would be our starting point.

First of all, we had to get the slave status to know when exactly it stops replication:

mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: xxx.xxxxx.xxx
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.003079
Read_Master_Log_Pos: 39705328
Relay_Log_File: xxxxxxxxxxxx-relay-bin.002256
Relay_Log_Pos: 4
Relay_Master_Log_File: mysql-bin.002671
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
Skip_Counter: 0
Exec_Master_Log_Pos: 1073741674
Relay_Log_Space: 438126861751
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

Ok, with that information in our minds we could start with modifications, and this should be our action plan:

  1. Restore backed up binary logs in Master.
  2. Modify mysql-bin.index file in Master to make binary logs “visible”
  3. Reset slave to restart replication information.
  4. Restart replication.

All of this sound easy but each steps had it’s own complexity. Let’s take a look on each one:

  1. Customer doesn’t have enough space in default folder to restore all deleted binary logs (about 250GB) so they had to mount a NFS folder.
  2. There isn’t a good documented way to refresh cached information about binary logs reported with SHOW BINARY LOGS statement.
  3. Easy one.
  4. Easy one.

Now a deepest explanation of each issue we’ve found during our work:

  1. Once we had the list of each recovered binary log (wich was from mysql-bin.002671 to mysql-bin.002877) we moved to next step.
  2. We modified mysql-bin.index file to include lost binaries. After finish it looks this way:

[pythian@oasdb data]$ cat mysql-bin.index

/XXXX_logs/Logs/mysql-bin.002671

/XXXX_logs/Logs/mysql-bin.002672

….

./mysql-bin.002878

./mysql-bin.002879

./mysql-bin.002880

… and so on

Notice we have to add absolute path to some location since they weren’t in the expected folder

We’ve checked if changes takes effect:

mysql> show binary logs;
+------------------+------------+
| Log_name         | File_size  |
+------------------+------------+
| mysql-bin.002671 | 1073742130 |
| mysql-bin.002672 | 1073741934 |
| mysql-bin.002673 | 1073742001 |
| mysql-bin.002674 | 1073742857 |
| mysql-bin.002675 | 1073741944 |
...and so on

At this point thing looks good so we continued with steps 3 and 4 and try to get replication back.

We took notes about master information:

[pythian@XXXslave data]$cat master.info

Password:
14
mysql-bin.003079
160765462
XXXX.XXXXXX.XXX
rep1
pwd
3306
60
0

And start with slave reset:

mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: xxx.xxxxx.xxx
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.003079
Read_Master_Log_Pos: 160765462
Relay_Log_File: xxxxxxxxxxxx-relay-bin.002256
Relay_Log_Pos: 4
Relay_Master_Log_File: mysql-bin.002671
Slave_IO_Running: No
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1073741674
Relay_Log_Space: 438247922177
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)
mysql> reset slave;
Query OK, 0 rows affected, 244 warnings (6 min 47.97 sec)
mysql> show warnings;
+——-+——+——————————————————————–+
| Level | Code | Message |
+——-+——+——————————————————————–+
| Error | 6 | Error on delete of ‘./xxxxxxxxxxxx-relay-bin.002256′ (Errcode: 2) |
| Error | 6 | Error on delete of ‘./xxxxxxxxxxxx-relay-bin.002257′ (Errcode: 2) |
| Error | 6 | Error on delete of ‘./xxxxxxxxxxxx-relay-bin.002258′ (Errcode: 2) |

and so on, which were expected warnings since relay logs were deleted.

Continuing changes:

mysql> change master to master_host ='xxxx.xxxxxx.xxx', master_user='repl', master_password='pwd', master_log_file='mysql-bin.002671', master_log_pos=1073741674;

Query OK, 0 rows affected (0.16 sec)
mysql> START SLAVE;
Query OK, 0 rows affected (0.00 sec)
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: oxxxx.xxxxxx.xxx
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.002671
Read_Master_Log_Pos: 1073741674
Relay_Log_File: xxxxxxxxxxxx-relay-bin.000001
Relay_Log_Pos: 98
Relay_Master_Log_File: mysql-bin.002671
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1073741674
Relay_Log_Space: 98
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

1 row in set (0.00 sec)Something doesn’t work as expected.

[pythian@xxxxxxxxxxxx data]$ tail -20 xxxxxxxxxxxx.xxxxx.xxx.err

100929 11:21:47 [Note] Slave SQL thread initialized, starting replication in log ‘mysql-bin.002671′ at position 1073741674, relay log ‘./xxxxxxxxxxxx-relay-bin.000001′ position: 4
100929 11:21:47 [Note] Slave I/O thread: connected to master ‘repl@oxxxx.xxxxxx.xxx:3306′, replication started in log ‘mysql-bin.002671′ at position 1073741674
100929 11:21:47 [ERROR] Error reading packet from server: Could not find first log file name in binary log index file ( server_errno=1236)
100929 11:21:47 [ERROR] Got fatal error 1236: ‘Could not find first log file name in binary log index file’ from master when reading data from binary log
100929 11:21:47 [Note] Slave I/O thread exiting, read up to log ‘mysql-bin.002671′, position 1073741674
100929 11:22:08 [Note] Error reading relay log event: slave SQL thread was killed

This errors tell us references to a binary log on master was missed. The weirdest part here is when we perform a SHOW BINARY LOGS statement in master it actually showed us the binaries we wanted to use, so at this point we were stuck in a non-sense point because some cached information in master about binary logs needs to be refreshed dynamically (we just couldn’t restart master).

Solution for this was dig into source code to know how MySQL deal with logs, we found this interesting function (funny part it red)

-- void MYSQL_LOG::new_file(bool need_lock)
{
  char new_name[FN_REFLEN], *new_name_ptr, *old_name;
  enum_log_type save_log_type;

  DBUG_ENTER("MYSQL_LOG::new_file");
  if (!is_open())
  {
    DBUG_PRINT("info",("log is closed"));
    DBUG_VOID_RETURN;
  }

  if (need_lock)
    pthread_mutex_lock(&LOCK_log);
  pthread_mutex_lock(&LOCK_index);

  safe_mutex_assert_owner(&LOCK_log);
  safe_mutex_assert_owner(&LOCK_index);

  /*
    if binlog is used as tc log, be sure all xids are "unlogged",
    so that on recover we only need to scan one - latest - binlog file
    for prepared xids. As this is expected to be a rare event,
    simple wait strategy is enough. We're locking LOCK_log to be sure no
    new Xid_log_event's are added to the log (and prepared_xids is not
    increased), and waiting on COND_prep_xids for late threads to
    catch up.
  */
  if (prepared_xids)
  {
    tc_log_page_waits++;
    pthread_mutex_lock(&LOCK_prep_xids);
    while (prepared_xids)
      pthread_cond_wait(&COND_prep_xids, &LOCK_prep_xids);
    pthread_mutex_unlock(&LOCK_prep_xids);
  }

  /* Reuse old name if not binlog and not update log */
  new_name_ptr= name;

  /*
    If user hasn't specified an extension, generate a new log name
    We have to do this here and not in open as we want to store the
    new file name in the current binary log file.
  */
  if (generate_new_name(new_name, name))
    goto end;
  new_name_ptr=new_name;

  if (log_type == LOG_BIN)
  {
    if (!no_auto_events)
    {
      /*
        We log the whole file name for log file as the user may decide
        to change base names at some point.
      */
      THD *thd = current_thd; /* may be 0 if we are reacting to SIGHUP */
      Rotate_log_event r(thd,new_name+dirname_length(new_name),
                         0, LOG_EVENT_OFFSET, 0);
      r.write(&log_file);
      bytes_written += r.data_written;
    }
    /*
      Update needs to be signalled even if there is no rotate event
      log rotation should give the waiting thread a signal to
      discover EOF and move on to the next log.
    */
    signal_update();
  }
  old_name=name;
  save_log_type=log_type;
  name=0;    // Don't free name
close(LOG_CLOSE_TO_BE_OPENED | LOG_CLOSE_INDEX);

  /*
     Note that at this point, log_type != LOG_CLOSED (important for is_open()).
  */

  /*
     new_file() is only used for rotation (in FLUSH LOGS or because size >
     max_binlog_size or max_relay_log_size).
     If this is a binary log, the Format_description_log_event at the beginning of
     the new file should have created=0 (to distinguish with the
     Format_description_log_event written at server startup, which should
     trigger temp tables deletion on slaves.
  */

  /* reopen index binlog file, BUG#34582 */
 if (!open_index_file(index_file_name, 0))
    open(old_name, save_log_type, new_name_ptr,
         io_cache_type, no_auto_events, max_size, 1);
  my_free(old_name,MYF(0));

end:
 if (need_lock)
    pthread_mutex_unlock(&LOCK_log);
  pthread_mutex_unlock(&LOCK_index);

  DBUG_VOID_RETURN;
}

This function is called by FLUSH LOGS command, which close and re opens all MySQL logs (including binaries) and refresh cached binary index. Additionally before perform this command we went through these steps:

  1. Moved the last log that slaves catches (mysql-bin.002671) to it’s original position.
  2. Modified again the mysql-bin.index to point this file to proper path.
  3. Perform FLUSH LOGS and FLUSH PRIVILEGES command to force server to refresh the information about binary logs and clients.
  4. Went to slave and restart replication (and cross fingers of course)

This was the result:

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: oxxxx.xxxxxx.xxx
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.002672
Read_Master_Log_Pos: 563985312
Relay_Log_File: xxxxxxxxxxxx-relay-bin.000005
Relay_Log_Pos: 131915515
Relay_Master_Log_File: mysql-bin.002672
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 131915378
Relay_Log_Space: 563985741
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 946273
1 row in set (0.00 sec)

Which show us a better picture.

Conclusion: this issue was caused by a customer error, I mean, is not a good idea delete logs that wasn’t cached by slave. In case this is needed at least be sure to keep relay logs in slave to keep replication working.

Logs should be deleted ONLY with PURGE BINARY LOGS command instead of delete them using the OS commands.

Dig into source code could give you a good approach of some MySQL internals and usually it will helps more than on line documentation.

And last, but not least, I would like to say thanks to Marco Tusa for his great support in code review.

One Response to “Replication Issues: Never purge logs before slave catches them!!”

  • Rob says:

    This is brilliant… nice workaround to avoid having to bounce the master. I understand the need for FLUSH LOGS, but why do you also include FLUSH PRIVILEGES?

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>