ASM multi-disk performance

May 12, 2006 / By Christo Kutrovsky

Tags: ,

If you have the ability to combine disk spindles at both the SAN level and Oracle (ASM) level, which one is better?

Should you combine all your spindles on the SAN and present 1 big disk to OS and give that to ASM? Or should you present each individual disk spindle to ASM and let ASM do the mirroring ?

One item should help you decide very quickly. ASM does not offer RAID 5, and that’s what most people would like to run for it’s low cost.

Another item is performance. Modern disks are able to push 50 to 70 Mb/sec in sequential reads easily. Combine 3 drives’ output and you get 210 Mb/sec which is approximately the bandwith limitation of 2GBit fibre channel. That’s of course, under optimal disk setup.

So imagine, as a DBA, you have full freedom on how to divide your hard disk devices. Don’t you wish it was like that for all DBAs?

I happen to have both setups. One diskgroup with one big disk (array) and another disk group with 2 disks (arrays). Those are on the same machine, attached to the same database. All 3 arrays are RAID 5 with 256kb striping. To visualize:

(14 x 36 gb) => Raid 5 LUN => ORA_DISK_1 => ASM DISK GROUP A

(14x36gb) Raid5 => ORA_DISK 2 + (14x36gb) => ORA_DISK 3
Together = ASM DISK GROUP B

There’s 1 more detail that matters – the machine has 2 Fibre Channel controllers, 2gbit bandwith each (~200 Mb/sec). The LUNs are split equaly alternating the controllers. In the LUNs that I am testing, for the disk group with 2 LUNs, each LUN is on a separate controller.

So I created the same tablespace with the same table on those 2 disk groups. And ran the following tests:

Full table scan of 15 gb table:
– Disk Group A (1 disk) – 136 seconds – ~110 Mb/sec
– Disk Group B (2 disks) – 184 seconds – ~81 Mb/sec

Surprised? The disk group with 2 disks is slower! Those results are consistent, and confirmed with diagnostic output from iostat. You may start to wonder why would 2 be slower then 1. It should be twice as fast!

I will have to give an example of this. Imagine you go to the library. In this specific library, you dont get access to books directly. You go at the desk and request them. The librarian goes and fetches the books you want. You been a smart guy, ask for multiple books at the same time, since you know they are in the same area – thus you are saving time.
Now imagine if there were 2 librarians. Now you have 2 people to ask for books, but what you do is ask for one, wait for your books, then ask the other librarian, alternating them. You never ask for them at the same time, either one or the other. You won’t get your books faster, you will get them at the same speed!
In our situation we got slower with 2 “librarians”. Why? Well it happened that our “librarians” were really smart, and when they went to get the books, they decided to get an extra set, in case you asked for it. So when you had 1 “librarian”, it was working great and some of the books you were asking for were already available. But now that you have 2 “librarians” to ask for stuff, by the time you come back to the librarian who just brought you the books, he would decide that you don’t need them and return them.

Now the same test, but in parallel. My parallelism level is 8, full scan of 15 gb table:
ASM disk group A (1 disk) – 78 seconds – 192 Mb/sec
ASM disk group B (2 disks) – 41 seconds – ~365 Mb/sec

Now I am sending more requests SIMULTANEOUSLY – I get to use the fact that I have 2 LUNs on separate controllers. In addition, it helped my 1 LUN disk group by providing a constant flow of requests.

And then the final test I ran, rman backup validate tablespace: It simply reads all the data. Since it’s 1 big tablespace, no parallelism is available, but that’s not important. Unfortunately, the tablespace backup tests I did at a later point, thus their sizes are different:

ASM disk group A (1 disk) – 17’500 Mb in 96 seconds – 182 Mb/sec
ASM disk group B (2 disks) – 46’700 Mb in 135 seconds – 345 Mb/sec

Even though that speed looks amazing, it’s actually a bit higher, as RMAN takes 1-2 seconds after the copy before taking the timing estimate. According to iostat I reached 196 Mb/sec in group A, and 392 Mb/sec in group B.

This is 1 backup. Why the difference between backup and full table scan? They were both limited by disk, why is it different?

The reason is ASYNC IO.

RMAN uses ASYNC IO extensively, keeping 32 read requests of 1 mb each in the read queue. This is clearly visible in iostat. ASYNC IO allows RMAN to keep requests in the queue, while processing them as they come. This allows the Disk IO subsystem to fetch them very efficiently.

Think about it, if you go to the librarian and give him a list of all the books you need, he will get them in the most efficient way for him.

Conclusion? Discussion and feedback is open in the comments!

22 Responses to “ASM multi-disk performance”

  • Hi Christo,

    Indeed, the difference in non-parallel full table scan is quite a surprise. Do you have an evidence that prefetching of SAN box is actually what was the cause for that?

    What would be also useful is to test concurrent random single block IO on high volume table (accessing rows through index lookups).

    Anyway, I would expect that throughput doubles as there are twice as many physical spindles and controllers (which is the case excluding “incident” with the first full table scan). Interesting to see when you give same volume, same physical spindles behind and same controllers – would ASM striping slow the things down?

    Regards,
    Alex

  • It’s interesting to note that although you have 2 gigabit – with a nominal throughput of about 200 Megabytes, the better serial test only gets to 110 MB.

    Can you also test with one LUN defined as 28 * 36GB ? It would be interesting to see if this has some impact on the parallel and rman tests.

    Your high-volume tests are very helpful – particularly since they show that you really can get 200MByte down a 2GBit connection. To date I have not seen a production system managing better than 50% of nominal (although I finally heard of one just a couple of days ago).

    Regards

    Jonathan Lewis

  • Christo Kutrovsky says:

    Alex,

    I dont quite understand what part the SAN prefetching could helping, could you elaborate?

    I cannot mess up with the way spindles are setup, i.e. have a big one with 28. But I can try random IO on the small one and combined.

  • Christo Kutrovsky says:

    Jonathan,

    As I said to Alex, I cannot re-arrange the spindles. There’s actual data on them.

    I am 99% sure that 1 lun with 28 spindles will have similar sequencial reading speed as the one with 14, as our limit is the controller bandwith. 1 LUN is exclusivelly serviced by 1 controller.

    If only full table scans were doing async IO.

    I was amazed I could be so close to theoretical saturation.

  • “I dont quite understand what part the SAN prefetching could helping, could you elaborate?”

    If I understood your comparison to librarian correctly than prefetching is the feature slowing down single thread full table scan on 2 LUNs – you ask for so 10 books and he goes and picks up 20 in case you might be interested. So within storage box it gets actually more blocks than you ask and this makes it slower. Is that what you meant?

  • Christo Kutrovsky says:

    Ok I understand now.

    I was going the opposite way. When you are reading relateivelly sequencially on a single LUN, some SAN (or hard disk) pre-fetching is happening, that optimizes overal speed.
    When using 2 LUNs, this optimization either happens to a lesser degree, or not at all.

    Thinking logically, it makes perfect sense to test with 28 spindles to determine whether it’s the spindle or SAN pre-fetching that’s not working quite so well.

    What’s on my list for testing thow is larger ASM allocation units when using 2 or more LUNs. Based on my previous pure-io testing, doing IO with 1Mb reads is just about optimal for 1 spindle. So If I have 14, it should be more like 8Mb or even 16 Mb. But I am not sure of the full implications of this so it’s for another time.

  • Goran Bogdanovic says:

    Hi Christo,
    Great test!
    I am surprised that SAN read-ahead feature with two LUN defined can cause a slow-down on a sequential disk reads.
    You did not specify how you have defined the ASM DISK GROUP B. I can guess that you have define it as a external redundancy and ASM will stripe files across both disks/LUN’s in the disk group using template for a datafile (coarse striping, stripe size 1M). Since the hardware stripe size is less than a ASM stripe size, question is also if this can cause slow-down when you have a disk group with a two disks.

    Also, since ASM allows kernelized asynchronous I/O, the “chance” to use this benefit should also have the FTS and not only RMAN, so the answer that ASYNC-IO is the reason for difference between backup and full table scan for me sounds a bit questionable.

    Cheers,

    Goran

  • Christo Kutrovsky says:

    Both disk groups were specified with External redundancy, with the default coarse (1mb) striping for datafiles.

    Hardware stripe been less then ASM stripe is not a problem, because ASM does not read in blocks of it’s stripe size. It only uses that to map data, and this is how much data is defined by a single “reference”. Very similar to extend management inside a datafile.

    Just because you can do asynchronous IO doesn’t mean it’s been used. You can do kernelized async IO on ext3 (in 2.6 kernel) and on raw devices, it’s not just ASM.
    But the application has to have extra code to use async IO. The application must use different system (or ASM) calls to request data in an asynchrouns mode. I guess the code for a full table scan does not have that functionality, but the code to read whole datafiles has it.

    You can easily confirm this with “iostat -x 5″. On an system that you are the only one, notice the disk queue when you run a full table scan query, and when you have an RMAN backup validate running.

  • “So If I have 14, it should be more like 8Mb or even 16 Mb.”
    Well, having 8K block with 128 MBRC gives you 1MB IO size. I am not aware that Oracle can do more than 128 block MBRC on any platform (doesn’t mean it can’t, though) so I guess the only chance is to burst you block size which might not always be feasible.

    Regarding async IO: IO itself doesn’t go through ASM instance (it only manages metadata requests and changes) but rather via OS’ standard Raw IO or ASMLib IO that should support async IO.

  • Christo Kutrovsky says:

    Right, a FTS cannot use 8 mb or 16 mb, but RMAN can. Actually it will fire a number of 1 mb requests, and those will benefit from the fact that there’s 8mb in one chunk.

    Yes, that’s right it doesnt actually go through ASM instance. ASM instance only reports where the segments begin and end. But you understand what I meant.

  • Azhar says:

    #

    Maximize the number of disks in a disk group for maximum data distribution and higher I/O bandwidth.
    #
    It is better to have more than one lun in one diskgroup.

  • Christo Kutrovsky says:

    Azhar,

    Are you quoting the documentation? Because that’s the point of my blog, that it’s not necessary better to have multiple disks. It really depends on your workload and your SAN’s features (such as multi-stream read ahead).

  • Max says:

    Chris,
    Wanted to clarify few things reg your comment “because ASM does not read in blocks of it?s stripe size”..

    when you say ASM is using stripe size of 1MB (default for datafiles etc), doesn’t it issue io requests in multiple of 1MB? What should I see on iostat data? Shouldn’t bps/tps = 1MB for all the requests?

    can you please clarify w/ some simple test/examples?

    In my AIX environment I see less than 1MB value for a io request size ( i.e. “bps: devided by “tps” from iostat ).

    -Max.

  • […] It’s a combination of good managers, thrust, and desire for performance. And you know what? They are getting their 400 Mb/sec. The new server is reaching 800 with the dual 4gbit fibre […]

  • Vijay says:

    Kutrovsky,

    IO also depend on SCSI queue depth & HBA card parameters.

    Simple test explample for SCSI IO waits,one LUN with 500Gb,the device is /dev/sda on Lunux & SCSI default queue depth is 32 ,more than 32 requests at any time will be queue.
    same 500GB divided into 5 Luns ,SDA SDB SDC SDD SDE then the same disks capable of 32*5 requests at a time.
    Tune at both SCSI layer & HBA layer we give better perfomance.

    -Vijay

    • Vijay,

      A deeper queue is not necessary an advantage. It really depends on the use case. I wouldn’t go to a 5 LUN setup, if I can have one, for a 5% disk IO improvement in a corner case.

      Remember, to fill up 5×32 IOs, you need to have 160 sessions requesting IO at the same time.

      • goran says:

        to my knowledge queue depth is not related to the number of sessions but rather to the number of IO requests issued … taking into game ASYNC IO, it can be fahr less than 160 sessions to fill up 5×32 queue depth.

        goran

  • Tina says:

    Question: I am the SAN administrator and was forced to give disk to the Oracle team at a time when I was really low on disk. They had a Raid-5 array with 5 disks they were using for a file system. The only disk I had available was a Raid-5 array that had 3 disks. I carved up the disk and presented it, and the Linux admin mounted it as an ASM volume. Now, they are complaining about database backups taking twice as long. I admit that the spindle count is super low — I would never have carved it up with so few spindles. We are trying to work out a strategy to give them back disk Raid-5 with 5 disks. But, my concern is ASM vs. regular file system. Is there any Oracle tuning that needs to take place in order to operate optimally while using ASM disk? I am not sure if our dba’s know how to tune the database, so if you could suggest a couple of links and give me a brief overview, I’d be grateful.

  • Hello Tina,

    There is nothing specific to do on the SAN side for ASM.

    For the database, it’s best to give dedicated spindles, in order to avoid interference from other databases/applications.

    Oracle has great performance accounting tools, but if other applications are touching the same disks, Oracle won’t be able to tell this.

    On the flip side, it is very easy to resolve your situation. All you have to do is give them another LUN with the 5 spindles, and ask them to add the new one and drop the old one.

    ASM will automatically move the data from the old one to the new one (online) and once finished, you can de-allocate the old lun. I recommend a reboot to ensure all file descriptors are closed.

    Whether this will solve your DBA team’s backup timing I cannot say.

    There are best practices for ASM, in particular to aligning the partitions on the LUNs. See my other blog posts for details.

  • Bestin says:

    Hi,

    We have got a Oracle RAC cluster on HP UX . There is one Database 900 GB on a 1.2TB single LUN. This is on EVA 4100.
    We are plannign to move this data by ASM rebalancing. We will connect the new SAN (EVA 6400) with 100GB *15 LUNS. We need to disconnect the OLD SAN EVA 4000. All activity has to be completed in 24 Hours is it possible ?

    Thanks
    BEstin

    • I don’t think you should be asking this question on the internet. Rather you should prepare a test case and run it.

      If you are to move 900GB over 24 hours, that averages 10 Mb/sec – sounds doable to me.

      But entire operation is online, I don’t see the reason for time constraint.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>