THE WORLD DISCUSSES #PYTHIAN ON TWITTER. HAVE A QUESTION? USE OUR HASHTAG AND ASK AWAY.

When SHOW SLAVE STATUS lies

Over-the-Top Tales from the Trenches.
Motto: Bringing order to the chaos of every day DBA life.

So you have got your nice MySQL Master-Slave replication pair setup. Everything is sweet, then the master dies/restarts or you have a slightly extended network outage.

Your monitoring software (in our case Avail) fires off a page and you are rudely interrupted from reading the Pythian blog.

These real world interruptions, what can I say… it pays the bills.

Anyway being the rounded DBA or enlightened DBA as Babette would say, you are capable of handling any type of database. You log into the machine and check out why the slave threw an error or if your monitoring is slow, why the slave is lagging by 2 hours.

You run SHOW SLAVE STATUS\G

mysql> show slave status \G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: x.x.x.x
                Master_User: rep
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysql-bin.000480
        Read_Master_Log_Pos: 690470773
             Relay_Log_File: db2-relay-bin.000028
              Relay_Log_Pos: 683977007
      Relay_Master_Log_File: mysql-bin.000480
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB: avail,avail
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted
                 		(you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted
                 		(you can check this by running 'mysqlbinlog' on the relay log), a network problem,
                 		or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log,
                 		you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
               Skip_Counter: 0
        Exec_Master_Log_Pos: 126
            Relay_Log_Space: 690471192
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

The bits which are important are in red.

Now the normal procedure for this kind of error is to get the Exec_Master_Log_Pos and tell the slave to restart from there. Using a command like: Read the rest of this entry . . .

Oracle Standby Recovery Rate Monitoring

So you have created your standby database using the RMAN DUPLICATE command, you have set the ARCHIVE_LAG_TARGET to maintain a minimum lag target, and you have sorted out those nasty datafile missing errors using automatic file management. You’ve even added standby redo logs to improve the Mean Time To Recovery (MTTR).

Now management are demanding to know why the standby periodically lags during the day. They have you on the phone and they are asking what the current redo apply rate is.

Luckily, you read the Pythian blog regularly, or like most people google for a solution or script to help out so you don’t reinvent the wheel.

Management’s first question was, does the standby redo apply rate lag during the day and if so, when. The following SQL script joins v$archived_log to v$dataguard_status, which contains the most recent messages from Data Guard. We join to that table to get the timestamp for when a specific archive log started to be applied.

rem Reports standby apply rate with lag
rem
select TIMESTAMP,completion_time "ArchTime",
SEQUENCE#,round((blocks*block_size)/(1024*1024),1) "SizeM",
round((TIMESTAMP-lag(TIMESTAMP,1,TIMESTAMP) OVER (order by TIMESTAMP))*24*60*60,1) "Diff(sec)",
round((blocks*block_size)/1024/ decode(((TIMESTAMP-lag(TIMESTAMP,1,TIMESTAMP)
OVER (order by TIMESTAMP))*24*60*60),0,1,
(TIMESTAMP-lag(TIMESTAMP,1,TIMESTAMP) OVER (order by TIMESTAMP))*24*60*60),1) "KB/sec",
round((blocks*block_size)/(1024*1024)/ decode(((TIMESTAMP-lag(TIMESTAMP,1,TIMESTAMP)
OVER (order by TIMESTAMP))*24*60*60),0,1,
(TIMESTAMP-lag(TIMESTAMP,1,TIMESTAMP) OVER (order by TIMESTAMP))*24*60*60),3) "MB/sec",
round(((lead(TIMESTAMP,1,TIMESTAMP) over (order by TIMESTAMP))-completion_time)*24*60*60,1) "Lag(sec)"
from v$archived_log a, v$dataguard_status dgs
where a.name = replace(dgs.MESSAGE,'Media Recovery Log ','')
and dgs.FACILITY = 'Log Apply Services'
order by TIMESTAMP desc;

Read the rest of this entry . . .

Oracle Standby redo Logs

So you have created your standby database using the RMAN DUPLICATE command, you have set the minimum log switch individual using ARCHIVE_LAG_TARGET, and you have sorted out those nasty “datafile missing” errors using automatic file management.

Management is now happy with the Mean Time To Recovery (MTTR)… but not really. They review the documentation generated from your database failover test and see that the time to switchover is slow and that there is still a possibility of losing some data.

Enter standby redo logs. These are required to enable real time apply of redo data onto the standby. This is not the same level of redundancy or availability of Oracle RAC, but getting close.

Essentially, the standby redo logs are populated with redo information as fast as the primary redo logs, rather than waiting for the redo log to be archived and shipped to the standby. So the loss of data in the event of a failover is minimized.

Rather than regurgitate the documentation, I have attached the links to the specific parts of the documentation. Since I’m focused on providing ways for you to “get the job done”, I have also included some SQL that uses dynamic SQL to generate a series of ALTER DATABASE ADD STANDBY LOGFILE commands to run on both the primary and standby databases.

Read the rest of this entry . . .

Oracle: Standby Automatic File Management

So you have followed the recipes to create a standby database, setup the database to switch logs automatically, and now, as with any good database, the space required to support the application grows and grows and grows.

Eventually, using your monitoring software (in my case, Pythian’s avail or dailies), you get an alert suggesting that you will need to add additional space. In this case you are going to add a datafile to a tablespace, or add a brand new tablespace.

You add the datafile on the primary and minutes later are paged with these kind of errors:

File #5 added to control file as 'UNNAMED00005' because
the parameter STANDBY_FILE_MANAGEMENT is set to MANUAL
The file should be manually created to continue.
MRP0: Background Media Recovery terminated with error 1274
Fri Apr  8 10:00:22 2005
Some recovered datafiles maybe left media fuzzy
Media recovery may continue but open resetlogs may fail

Basically, the standby is having some trouble applying the changes it has received from the primary redo log because the datafile is not there.

You can use the Oracle parameter standby_file_management along as:

  1. Your standby database doesn’t have major differences in datafile locations.
  2. The standby database is not using raw devices to store your datafiles.

This parameter set to AUTO means that Oracle will handle creating the new datafile on the standby if you add one or more on the primary.

But right now you are stuck with this new datafile and nowhere to go.

Method 1

Add the datafile to the standby database.

Oracle documentation: 9iR2, 10gR1 and 10gR2.

  1. Determine the impact of taking the tablespace offline!
  2. Take the tablespace offline OR put in backup mode: ALTER TABLESPACE name BEGIN BACKUP;
  3. Make a copy of the file locally.
  4. Bring the tablespace online OR take out of backup mode: ALTER TABLESPACE name END BACKUP;
  5. Copy the file to the standby server.
  6. ALTER SYSTEM ARCHIVE LOG CURRENT;

Method 2

ALTER DATABASE CREATE DATAFILE (on the standby).

  1. RECOVER MANAGED STANDBY DATABASE CANCEL;
  2. select file#,name from v$datafile where name like '%UNNAMED%';
  3. ALTER DATABASE CREATE DATAFILE 'unnamed datafile' AS 'proper filename';
  4. RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION;

Save yourself the hassle, just alter system set STANDBY_FILE_MANAGEMENT = AUTO;

Update: Thanks to Sverre for pointing out you can use tablespace hotbackup mode instead of tablespace offline.

Have Fun!
Paul.

Related Articles:

RMAN Duplicate command to create physical standby

Using ARCHIVE_LAG_TARGET to automatically switch logs

Working as an Onsite DBA Together with Pythian

I was reading a couple of items on gapingvoid

which inspired me to use our blog for a conversation with our customers (and with potential Pythian DBAs) about what it’s like to work with Pythian.

What exactly does an on-site DBA get from working with Pythian Remote DBAs?

  1. Sleep.
  2. Support: phone, IM or email some who either knows how to help or knows someone who knows how to help.
  3. Teammates: You can delegate tasks which are either too hard or too easy onto Pythian.
  4. Accountability. Every minute, every hour is documented for all to see, internally and by individual clients.
  5. Understanding: Someone who knows what being a DBA is like.

You are on-site. Picture the Pythian on-call team standing behind you or beside you; or working with you, or working on task parallel to yours or even on a task you started and need to delegate to get finished.

Behind those frontline DBAs is another 32 Senior DBAs. They are experienced in most Oracle technologies (Apps, RAC, Standbys, Dataguard), Microsoft SQL Server, MySQL, and PostgreSQL. Some of these DBAs are so good they are invited to present at DBA and database conferences worldwide.

These DBAs have been refined in the hellfires of years of problems and projects, are sharpened by experience and armed with both superior tools and a searchable past-support library of over 150,000 problem resolutions.

Picture the whole 24 hours in the day being available to be used by teams of DBAs around the world, so your problem is resolved faster. If Oracle Severity 1 Support Requests (SRs) require people to be available 24×7, we can do it.

As a remote DBA, my primary job is to make your job easier. This is because my goals are aligned with your goals. I too want my job to be easier and more relaxing. I too would rather work on proactive or project work, diminishing reactive firefighting and emergencies to a bare minimum.

Have Fun,
Paul

RMAN Recipes: Switch Oracle Logs Automatically with ARCHIVE_LAG_TARGET

So you have swallowed the standby bait. You have used RMAN duplicate to create a sparkling new standby and things are looking rosy. Then, when you check the lag between the primary and standby (in MAXIMUM PERFORMANCE mode), you discover the standby is miles behind.

You were expecting a solution that kept the standby as close to the primary as possible without the expense of making the primary wait until all changes are applied on the standby, as happens in MAXIMUM PROTECTION mode or MAXIMUM AVAILABILITY mode.

Oracle (at least from 9iG) rides to the rescue again. The parameter ARCHIVE_LAG_TARGET tells Oracle to make sure to switch a log every n seconds. It can be dynamically set using ALTER SYSTEM, e.g.:

ALTER SYSTEM SET ARCHIVE_LAG_TARGET = 600 SCOPE=BOTH;

This sets the maximum lag to 10 mins.

No more stuffing around with shell scripts to run essentially: ALTER SYSTEM SWITCH LOGFILE;. Oracle does it for you.

Feedback for Oracle: make the default something other than 0, maybe 1800. I am a crusty old DBA (well, not that crusty compared to some) and nowadays I expect stuff to work like this out-of-the-box. DWIM!

Have Fun!

Paul.

Update: Thanks to Howard for the correction, this has been around since 9i

Links: More info about Oracle data protection modes

RMAN reports, or What Is the ETA of My Backup?

Over-the-Top Tales from the Trenches.
Motto: Bringing order to the chaos of every day DBA life.

Dear Diary,

Are we there yet? How much farther?

Ever heard this during a recovery of a downed database?

The poor thing has either a minor scratch, requires immediate attention, or is dead on arrival (DOA).

You could cut the tension in the air with a knife, and you have the dreaded watcher1 standing so close you can see their face mirrored on your monitor, making your typing accuracy go to 10%.

Of course you are prepared. You have many recovery tests under your belt, and you know what the log apply speed is… right? Your nifty little SQL script even knows from the size of the archive redo logs what a good approximate ETA for a recovery is.

Forget wading through pages of rman outputs, tallying up each backup piece duration, Oracle 10G comes riding to the rescue, with some nifty v$rman_ views.

So what is the ETA of my backup?

Read the rest of this entry . . .

MySQL Recipes: Connections per hostname Using Pager

Over-the-Top Tales from the Trenches.
Motto: Bringing order to the chaos of every day DBA life.

This diary has made you conversant in the world of RMAN standbys, duplicates and corruptions… exposed you to the world the hidden nature of shmmax. You have seen the nastiness of Oracle MONTHS_BETWEEN. What more worlds and mysteries are left for us to conquer? Do we weep like Alexander, or board a boat and discover America?

Actually, my aim in any posting is to make your job as a DBA in Oracle, SQLserver and MySQL easier. This is not a contest on how complex or internal we can get. This is about getting the job done, so you can get back to

  1. playing with your kids/dog/partner.
  2. writing articles in a blog.
  3. generally not worrying about databases.

Ever wondered how to get a quick count of the number of connections per hostname from MySQL? Welcome to the arcane (but extremely powerful) world of the MySQL command line pager.

Read the rest of this entry . . .

MySQL: Tuning filesorts and temporary tables

Goal:

Getting rid of filesorts and temporary tables by tuning MySQL queries.

Background:

Filesorts and temp tables are a necessary evil in MySQL, used when MySQL must sort the data before returning the output to the user. They are the most common issue with slow queries in MySQL, the main reason being that if the output is too large, you can kiss goodbye in-memory performance, and say hello to disk access.

Common User and Manager symptoms:

  1. Sore throats due to excessive swearing at poor database performance.
  2. Sore hip pockets due to lack of scalability requiring continuous hardware purchases.
  3. Sore users who are sitting on high speed bandwidth and have to wait more than 1.5 secs for a response from any web page. Google has raised the bar, time to ditch the straddle technique and go for the Fosbury Flop.

Appropriate Medicine:

The quickest way to get rid of temp tables and filesorts, is to have already done the sorting work for MySQL in the form of an appropriate index.

Caveat:

Think of good indexes as an investment in improving read performance at the expense of write performance. Let your database usage guide what indexes to add. Please reread that last sentence — it is important.

Read the rest of this entry . . .

Oracle Gotcha: months_between 31-days-per-month assumption

Over-the-Top Tales from the Trenches.
Motto: Bringing order to the chaos of everyday DBA life.

Dear Diary,

Do you want to know a secret? Understanding it will prevent pain and gnashing of teeth, and also leave your face comfortably free of egg.

Here it is: Some Oracle functions assume there are 31 days for each month of the year!

Check out the fractional difference between the function MONTHS_BETWEEN which uses a 31 day month for all months and using the underlining Oracle DATE based on days:

SQL> select months_between(trunc(sysdate),to_date('2006-02-28','YYYY-MM-DD')) from dual;

MONTHS_BETWEEN(TRUNC(SYSDATE),TO_DATE('2006-02-28','YYYY-MM-DD'))
-----------------------------------------------------------------
7.25806452

SQL> select (trunc(sysdate) - to_date('2006-02-28','YYYY-MM-DD'))/365.242199*12  from dual;

(TRUNC(SYSDATE)-TO_DATE('2006-02-28','YYYY-MM-DD'))/365.242199*12
-----------------------------------------------------------------
7.19522554

The moral of the story: be careful when using MONTHS_BETWEEN for fractional dates, e.g. when using the output to go back to days.

Also, if you use a managed standby with a DELAY, which is not the latest version — i.e. less than 9.2.0.6 — there is a nasty bug related to a similar assumption, which will break the standby. So just when you thought your standby solution was robust and free of issues…

See metalink DocID 280909.1. You can fix this bug by patching to 9.2.0.6 or above, or by upgrading to a higher release such as 10G.

Update 2007/09/10: Dominic over at Orastory has a nice story about the INTERVAL function as well

Have Fun!

Paul

Start NowWith Pythian - database design, management and emergency handling capabilities...

Live Updates

pythian: RT @sheeri: #confoo talk "Bending Queries to your Will with EXPLAIN" slides http://bit.ly/explainslides & handout
more



Testimonials

  • Serge Racine

    DBA, Brookfield Energy

    We are very satisfied by the service given to us by Andre and Shakir in support of our recent data quality and reorganization initiative.... more