Pythian has openings for MySQL and MS SQL Server DBAs in each of our offices in Ottawa, Canada; Boston, USA; Dubai, UAE; and Hyderabad, India. If you are a MySQL and/or SQL Server DBA and would like to evaluate this opportunity, please send us your résumé with an introductory paragraph to hr@pythian.com.

When SHOW SLAVE STATUS lies

April 6th, 2008 - by Paul Moen

Over-the-Top Tales from the Trenches.
Motto: Bringing order to the chaos of every day DBA life.

So you have got your nice MySQL Master-Slave replication pair setup. Everything is sweet, then the master dies/restarts or you have a slightly extended network outage.

Your monitoring software (in our case Avail) fires off a page and you are rudely interrupted from reading the Pythian blog.

These real world interruptions, what can I say… it pays the bills.

Anyway being the rounded DBA or enlightened DBA as Babette would say, you are capable of handling any type of database. You log into the machine and check out why the slave threw an error or if your monitoring is slow, why the slave is lagging by 2 hours.

You run SHOW SLAVE STATUS\G

mysql> show slave status \G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: x.x.x.x
                Master_User: rep
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysql-bin.000480
        Read_Master_Log_Pos: 690470773
             Relay_Log_File: db2-relay-bin.000028
              Relay_Log_Pos: 683977007
      Relay_Master_Log_File: mysql-bin.000480
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB: avail,avail
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error: Could not parse relay log event entry. The possible reasons are: the master’s binary log is corrupted
                 		(you can check this by running ‘mysqlbinlog’ on the binary log), the slave’s relay log is corrupted
                 		(you can check this by running ‘mysqlbinlog’ on the relay log), a network problem,
                 		or a bug in the master’s or slave’s MySQL code. If you want to check the master’s binary log or slave’s relay log,
                 		you will be able to know their names by issuing ‘SHOW SLAVE STATUS’ on this slave.
               Skip_Counter: 0
        Exec_Master_Log_Pos: 126
            Relay_Log_Space: 690471192
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

The bits which are important are in red.

Now the normal procedure for this kind of error is to get the Exec_Master_Log_Pos and tell the slave to restart from there. Using a command like:

change master to Master_Log_File='mysql-bin.000480', Master_Log_Pos=126;
start slave;

In this case you can kiss goodbye to your slave as a consistent copy of your master database. SHOW SLAVE STATUS has basically lied through its teeth. The master_log_pos is not 126.

To quote my mate Yannick:

this is the error log file entries when the problem happened:

080206 16:03:48 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
080206 16:03:48 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000480′ position 683976897
080206 16:03:48 [Note] Slave: connected to master 'rep@x.x.x.x:3306',replication resumed in log 'mysql-bin.000480' at position 683976897
080206 16:04:31 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 1966107, event_type: 16
080206 16:04:31 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
080206 16:04:31 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted
(you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running
'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master'
s binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0
080206 16:04:31 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE
START". We stopped at log ‘mysql-bin.000480′ position 126

Comparing both output, “show slave status” tells us that “I’m stuck at …”

Master_Log_File: mysql-bin.000480
Exec_Master_Log_Pos: 126

while the error log file says at first that it is stuck at:

log 'mysql-bin.000480' position 683976897

then later in the error log is says it is stuck at:

mysql-bin.000480' position 126

So If I had to choose I would choose to restart the slave at log ‘mysql-bin.000480′ position 683976897.
If not you are going to try to re-run a full binlog that has been already applied.

Yannick’s Rules for dodgy master log positions when a slave gets stuck on the relay log corruption:

  1. If the position is less then let says 500, suspect that the position is wrong (it might be right but it needs to be checked out).
  2. Always go and check the MySQL error log to see what it has to say about position.
  3. If the position in the error log as well as in the “show slave status” agree more or less (both are a big number), then normally I would pick the one from “show slave status”.
  4. If not, then I would pick the one in the error log.

NOTE: you must understand that it is possible for the slave to be stuck at a small position.
This would happen if the master did change binlog and the slave got stuck just after the switch of the master binlog, so you have to be careful.

Have Fun

Paul

Bookmark online using:These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Reddit
  • Spurl
  • Furl
  • blogmarks

Tags:

4 Responses to “When SHOW SLAVE STATUS lies”

  1. kamiar kanani Says:

    I think these are tow different errors, one inside I/O thread and another in sql thread. error related to I/O thread occurred at :
    ‘mysql-bin.000480′ position 683976897
    and error related to SQL thread at :
    ‘mysql-bin.000480′ position 126

  2. Harrison Fisk Says:

    What version was this on?

    I recently worked with the replication team to find and fix bugs of this nature. The resulting fixes were pushed into 5.0.54 and 5.1.23, and I haven’t seen one case of SHOW SLAVE STATUS being wrong in this manner since then.

    The bugs in question were:

    http://bugs.mysql.com/bug.php?id=31190
    http://bugs.mysql.com/bug.php?id=12691

    Thanks!

  3. paulm Says:

    Hi Harrison,

    Thanks for the update. This was MySQL 5.0.41

  4. Log Buffer #92: a Carnival of the Vanities for DBAs Says:

    […] Here on the Pythian Group Blog, Paul Moen posted about a situation in which SHOW SLAVE STATUS lies. […]

Leave a Reply

Filling out the following captcha not only allows us to cut down on automated blogspam but also helps digitize books. Please feel free to send comments on this approach directly to Paul at vallee@pythian.com.

NOTE: After submitting your comment, verify that it is added to the blog. New comments will be marked as "waiting for moderation" (we only moderate for spam). If the level of spam is as low as we hope, we will bypass this step.