Moving Oracle Datafiles to a ZFS Filesystem with the Correct Recordsize

Aug 19, 2009 / By Don Seiler

Tags:

Full credit for this tale should go to my colleague Christo Kutrovsky for the inspiration and basic plan involved.

We recently migrated a large database from Solaris SPARC to Solaris x86-64. All seemed to go well with the migration, but in the next few weeks, we noticed some I/O issues cropping up. Some research led us to find that the ZFS filesystem used to hold the datafiles was killing us on I/O. The default “recordsize” setting for ZFS was 128k.

$ /usr/sbin/zfs get recordsize zfs-data
NAME      PROPERTY    VALUE     SOURCE
zfs-data  recordsize  128K      default

An Oracle database typically uses 8k for the block size, but in this case it was 16k. We saw basically the same thing that Neelakanth Nadgir described in his blog post, Databases and ZFS:

With ZFS, not only was the throughput much lower, but we used more [than] twice the amount of CPU per transaction, and we are doing 2x times the IO. The disks are also more heavily utilized. We noticed that we were not only reading in more data, but we were also doing more IO operations [than] what is needed.

The fix is to set the ZFS recordsize for a datafile filesystem to match the Oracle instance’s db_block_size. We also read in the ZFS Best Practices Guide that redo logs should be in a separate filesystem with the default ZFS recordsize of 128k. We already had them separate, so we just needed to get our datafiles on a ZFS filesystem with a 16k recordsize.

The first step is to change the ZFS recordsize for the current filesystem. Easy enough, but the change in recordsize will only apply to new files written there. It doesn’t do anything for our existing terabyte-plus of datafiles. This means we need to move the files to a new ZFS filesystem with the correct recordsize already set. That’s where the magic happens.

My first thought was that we’d need whole other slab of disk. Then I learned that we can simply create a new ZFS filesystem on the same pool of disks as the old filesystem. Obviously we won’t be able to hold two complete copies of the datafiles, but this would let us migrate groups of files at a time (for example, a tablespace). When the move is complete, disk space is released from the old filesystem, so the sum of used disk is essentially unchanged.

The rest is just a matter of moving datafiles. Now, we all know you can’t just move or copy Oracle datafiles from a running instance. We also know that clients want as little downtime as possible, so doing a complete shutdown/move/mount/rename/open is out of the question. We could offline a tablespace, move its datafiles, rename them in the database, then online the tablespace again. Here’s a sample scenario, starting with tablespace and datafile creation:

SQL> create tablespace zfstest
  2          datafile '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf'
  3          size 128m;

Tablespace created.

SQL> alter tablespace zfstest
  2          add datafile '/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf'
  3          size 128m;

Tablespace altered.

SQL> create table zfsobjects tablespace zfstest
  2          as select * from dba_objects;

Table created.

Accept the following:

  • /u01/app/oracle/oradata/orcl/zfs128k/ is a ZFS filesystem with default 128k recordsize
  • /u01/app/oracle/oradata/orcl/zfs16k/ is a ZFS filesystem with a 16k recordsize

So here we have datafiles created on a ZFS filesystem with the default 128k recordsize, and a small table just to test data access. To move the entire tablespace in one go, we do this:

SQL> alter tablespace zfstest offline;

Tablespace altered.

SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf 
     /u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf

SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf' to 
'/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf';

Database altered.

SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf 
/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf

SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf' to 
'/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf';
Database altered.

SQL> alter tablespace zfstest online;

Tablespace altered.

SQL> select count(*) from zfsobjects;

  COUNT(*)                                                                      
----------                                                                      
     68844

Pretty simple. We do the one escape to the OS mv command to move the datafile. If you want to be safe, you could use cp instead of mv to copy the file. Just know that disk space from the old filesystem won’t be released until you delete the old files.

Note that this operation could take a very long time in the case of large tablespaces. We could try a somwhat similar approach of taking one datafile offline at a time and moving it. We’d just need to perform a quick recovery before bringing it online. That would look like this:

SQL> alter database datafile '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf' offline;

Database altered.

SQL> 
SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf 
/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf

SQL> 
SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf' to 
'/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf';

Database altered.

SQL> 
SQL> recover datafile '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf';
Media recovery complete.
SQL> 
SQL> alter database datafile '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf' online;

Database altered.

SQL> 
SQL> alter database datafile 
'/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf' offline;

Database altered.

SQL> 
SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf 
/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf

SQL> 
SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf' 
to '/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf';

Database altered.

SQL> 
SQL> recover datafile '/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf';
Media recovery complete.
SQL> 
SQL> alter database datafile '/u01/app/oracle/oradata/orcl/zfs16k
/zfstest02.dbf' online;

Database altered.

SQL> 
SQL> select count(*) from zfsobjects;

  COUNT(*)                                                                      
----------                                                                      
     68844

Seems fine. Note that your instance needs to be in ARCHIVELOG mode to be able to take datafiles offline. Obviously, if someone tried to access data on an offline datafile they’d get this error:

SQL> select count(*) from zfsobjects;
select count(*) from zfsobjects
                     *
ERROR at line 1:
ORA-00376: file 6 cannot be read at this time
ORA-01110: data file 6: '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf'

So if a datafile is particularly large, this can mean still a few minutes of some data being offline. We have one more tool at our disposal that provides for a quicker turnaround: RMAN’s BACKUP AS COPY command. As it says, this creates a copy of a datafile as a backup. The benefit is that it does this while the current datafile (and data) remains online. We then just need to offline, switch, recover, and online each datafile, which takes very little time compared to the time spent moving a large datafile. That process would be something like this.

First, we want to know the file numbers of our datafiles. This is because the RMAN procedure doesn’t provide for an easy way to preserve datafile names when we backup an entire tablespace (which we’re going to do):

SQL> select file_id, file_name
  2  from dba_data_files
  3  where tablespace_name='ZFSTEST';

   FILE_ID FILE_NAME
---------- -------------------------------------------------------
	 6 /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf
	 7 /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf

So we have file numbers 6 and 7. Now we create our RMAN command file and execute it:

RMAN> backup as copy
2> 	tablespace zfstest format '/u01/app/oracle/oradata/orcl/zfs16k/%U';
3> 
4> sql 'alter database datafile 6 offline';
5> switch datafile 6 to copy;
6> recover datafile 6;
7> sql 'alter database datafile 6 online';
8> 
9> sql 'alter database datafile 7 offline';
10> switch datafile 7 to copy;
11> recover datafile 7;
12> sql 'alter database datafile 7 online';
13> 
Starting backup at 18-AUG-09
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=134 device type=DISK
channel ORA_DISK_1: starting datafile copy
input datafile file number=00006 name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf
output file name=/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-6_01kn0irh tag=TAG20090818T140640 RECID=2 STAMP=695225208
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:15
channel ORA_DISK_1: starting datafile copy
input datafile file number=00007 name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf
output file name=/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-7_02kn0is0 tag=TAG20090818T140640 RECID=3 STAMP=695225222
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:07
Finished backup at 18-AUG-09

sql statement: alter database datafile 6 offline

datafile 6 switched to datafile copy 
"/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-6_01kn0irh"

Starting recover at 18-AUG-09
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:00

Finished recover at 18-AUG-09

sql statement: alter database datafile 6 online

sql statement: alter database datafile 7 offline

datafile 7 switched to datafile copy 
"/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-7_02kn0is0"

Starting recover at 18-AUG-09
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:00

Finished recover at 18-AUG-09

sql statement: alter database datafile 7 online

Recovery Manager complete.

Voila! The tablespace is moved with minimal downtime:

SQL> select file_id, file_name
  2  from dba_data_files
  3  where tablespace_name='ZFSTEST';

   FILE_ID
----------
FILE_NAME
--------------------------------------------------------------------------------
	 6
/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-6_01kn0irh

	 7
/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-7_02kn0is0

SQL> select count(*) from zfsobjects;

  COUNT(*)
----------
     68844

Obviously, you can generate all your statements from SQL scripts, and you’ll definitely want to do this if your tablespace has a large amount of datafiles. If you are scripting, you could opt to backup one datafile at a time if you need to preserve file names. In this case, I wasn’t too concerned.

UPDATE: Teammate Marc Billette pointed out that the old datafiles were not removed when doing a datafile copy and switch. I clearly left them on disk, continuing to use space! So, once we are done with all the datafile switches, then our original datafiles become the copies, and we can list and delete those in RMAN like this:

RMAN> list datafilecopy all;

List of Datafile Copies
=======================

Key     File S Completion Time Ckp SCN    Ckp Time       
------- ---- - --------------- ---------- ---------------
4       6    A 18-AUG-09       629982     18-AUG-09      
        Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf

5       7    A 18-AUG-09       630001     18-AUG-09      
        Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf

RMAN> delete datafilecopy all;

allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=135 device type=DISK
List of Datafile Copies
=======================

Key     File S Completion Time Ckp SCN    Ckp Time       
------- ---- - --------------- ---------- ---------------
4       6    A 18-AUG-09       629982     18-AUG-09      
        Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf

5       7    A 18-AUG-09       630001     18-AUG-09      
        Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf

Do you really want to delete the above objects (enter YES or NO)? YES
deleted datafile copy
datafile copy file name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf RECID=4 STAMP=695225227
deleted datafile copy
datafile copy file name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf RECID=5 STAMP=695225228
Deleted 2 objects

Thanks to Marc for catching this early in the published life of the article!

I hope this post can help other people that run into this problem with Oracle on ZFS (I’ve heard from one on twitter already), or people just wanting to move datafiles. Again, thanks to Pythian DBA Christo Kutrovsky for laying this process out for us!

Note: this process was tested on Oracle 10gR2 (10.2.0.2) and 11gR1 (11.1.0.6).

8 Responses to “Moving Oracle Datafiles to a ZFS Filesystem with the Correct Recordsize”

  • Peter Moore says:

    Hello.

    Great article with some interesting ideas.

    One question though: Is there not the possibility that the datafile is being written to while the “backup as copy” is running (and until the point when the file is taken offline)?

    How do we ensure that the *new* copy of the file is up-to-date?

  • Don Seiler says:

    Peter: That’s why you have to run the “recover datafile X” command after the switch and before putting the new copy online.

  • [...] 10 – Using backup copy to move files after changing the recordsize on filesystem ? Don Seiler-Moving Oracle Datafiles to a ZFS Filesystem with the Correct Recordsize [...]

  • Moving Oracle Datafiles to a ZFS Filesystem with the Correct Recordsize « die Seilerwerks says:

    [...] leave a comment » Originally posted on The Pythian Group blog. [...]

  • Michael says:

    What you are suggesting here will not work for system level tablespaces such as SYSTEM, SYSAUX, UNDO. These will require the entire database to be down to move. They cannot be taken offline while the database is running. You’d still have a database outage!

    Also, taking individual datafiles offline in a running system can cause logical data problems which might not be immediately recognizable in an high-throughput application with referential integrity. For example, you could generate an order record in one tablespace, and then, when attempting to assign the order to a customer in an offline tables, you could end up with an orphaned order.

    We are encountering this same issue right now. Luckily, the system is not live yet. Good articles such as yours led us to decide to change the ZFS recordsize before we went live (thank goodness)!!!

  • Don Seiler says:

    Michael, you can do it with UNDO. Simply create a new UNDO tablespace in the new ZFS mount. Then change the system undo setting. Wait for all operations to use the new UNDO and you’ll be able to drop the old one.

    This is precisely what I did, same goes for TEMP.

  • Caner says:

    Hi ,
    Nice document.
    It is bad for me that I am facing the same I/O problem on my disk box. I have just migrated to ZFS filesystem and the reads are excessive. I am planning to change the recordsize to 8K I wnat to know if this solved the problem.
    Thanks

  • [...] Next, I limited the ZFS ARC cache to about 100MB and re-ran my tests. This time, I was able to get the disks to do something, but I maxed out at about 6.000 IOPS when I realized that the ZFS blocksize was set to 128kB which is propably a pretty stupid idea when doing random read tests on 8k data blocks. Oh, and I was not the first one to notice this, read about it in more detail from Don Seiler. [...]

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>