De-Confusing SSD (for Oracle Databases)

Nov 22, 2011 / By Gwen Shapira

Tags: ,

You never forget your first SSD.
For me, the first time I really *noticed* SSDs was when one of my customers encountered serious corruption on one of their databases and we had to restore an entire database. It was not a small database, around 300G in total file size. After I started RMAN restore and recovery process, the customer asked the inevitable question: “How long will this take?”. I replied that I’m not familiar with the performance of their storage, but from my experience a restore of this size can be expected to take 5 hours. Imagine my surprise when the restore was done after an hour.

This was enough to convince me that SSD is magic, and that if you have money and IO problem, just go SSD. Of course, if that was the end of the story, I wouldn’t have much of a blog post.

When I tried to share my excitement with other DBAs, I found out that SSD is actually kind of scary. Soon I became extremely confused by everything I’ve seen and heard:

SSD is fast for reads, but not for writes. Its fast for random writes, but not for sequential writes. You shouldn’t use it for redo, except that Oracle do that on their appliances. SSD gets slow over time. SSD has a limited lifespan and is unreliable. Performance depends on exactly which SSD you use. You can have PCI or SATA or even SAN. You can use SSD for flash cache, but only specific versions, maybe. You can have MLC or SLC. It can be enterprise or home grade.

With every conversation, the confusion grew. Until I finally had enough, and also had some spare time, and could sit down and untangle the web. Here’s what I found out:

First, there are two types of SSD – Flash based (also called NAND flash) and RAM based. For all practical purposes, RAM based doesn’t actually exist, and is used to confuse DBAs a bit more. Just ignore it for now – all SSDs are NAND flash.

NAND flash SSDs arrive in two flavors – multi-level (MLC) and single-level (SLC). It sounds like multi-level is better, but that’s wrong. MLC is slower and cheaper, SLC is faster and more expensive. Enterprise-grade SSDs are SLC.

SSD’s base memory unit is a cell, which holds 1 bit in SLC and 2 bits in MLC. Cells are organized in pages (usually 4k) and pages are organized in blocks (512K). Data can be read and written in pages, but is always deleted in blocks. This will become really important in a moment.

SSD is indeed very fast for reads. SSD can read 4K page in 25 *microseconds*. Remember that with spinning disks, 1ms read is very reasonable, and significantly longer read times from SAN storage. SSD is around 4 orders of magnitude slower than RAM and 3 orders of magnitude faster than spinning disks.

SSD is also very fast for writes. 4K block write takes 250 microseconds. Much slower than SSD reads, but much faster than writing to magnetic disks.

However, for SSD, almost all write-time statistics are misleading.

SSD is much faster than spinning disk for random writes on new media but it doesn’t have much advantage when it comes to sequential writes. The main advantage of SSD over magnetic disks is that it doesn’t have to spin. Writing to any area on the device is just as fast. When the magnetic disk doesn’t have to spin either because it is writing all the data to the same location, SSD is not much faster. This is why DBAs do not advise placing redo logs on SSD – it is all sequential writes, so there won’t be any performance improvement. Performance may actually degrade if the redo logs were places on a volume with large number of underlying disks. SSD’s parallelism is limited to the number of channels the device supports, usually fewer than 16.

The second problem with SSD is writing is *erasing* is very slow. Erasing is very slow because even if you want to delete one page, the SSD can only delete the entire block. The controller needs to read the entire block, erase everything and write back only the bits you want to keep. To make things even more fun, each SSD cell can only be written a certain number of times before it becomes unusable.

To keep the re-write overhead low, manufacturers use several techniques:

  • They use every cell before resorting to deleting existing cells
  • They will over-provision the SSD (i.e. build 2G SSD but only show the OS 1.8G), so writes can be completed quickly using the “spare space” and then the required deletes can be completed asynchronously in the background.

As you can see, this requires the controller to be pretty smart and maintain “free lists” of pages that can be used for writing, and pages that need cleaning. This is one of the major differences between different SSD devices – how well the controller manages the writing and erasing cycles on the device.

To compare SSDs, you can check the “write amplification” number for each device. This number shows how many Mbytes are actually written when you attempt to write 1M (on average). The extra writes are due to the cleanup processes involved. Devices with amplification factor that is closer to 1 will have better write performance.

When looking at SSD benchmarks, it is important to remember that writing on clean device is faster by definition than writing on used device, so make sure you are looking at the right numbers. If you run your own benchmarks, expect throughput to drop significantly in the first 30 minutes of the test, and only consider the numbers you see following the initial drop.

Of course, once the device is nearly full, the controller has much harder time finding free pages and it has to perform more cleanup on every write. This is the SSD version of fragmentation. When considering benchmarks, also check how full the device was while the benchmark ran.

The SSD performance blog is an excellent resource with many well-ran SSD benchmarks. You can use their results, but its even better to learn from their methodology.

The limit on how many times a cell can be used is not as scary as it initially sounds. On enterprise grade SSD cells can take around 1M writes, which is usually enough to last around 20 years. Since most disks aren’t expected to last over 3 years anyway, I wouldn’t worry about that. Except that – this estimate assumes that you use all parts of the at a uniform rate. This isn’t a natural pattern of using data – we usually use some of the data much more frequently than we use other data. The SSD controller usually contains logic to make sure it spreads the write-erase activity around regardless of how you use the data.

Devices also differ in the interfaces they offer. The common interfaces are either PCI-E or SATA. PCI-E is faster, offers more channels (better parallelism) and is more expensive. SATA is cheaper and fits into more legacy hardware.

Now that we know a bit about how SSDs work and how to choose your SSD device, its time for the important question: How do I use it to make my Oracle Database faster. Here are several options, from most effective to least recommended:

* If you are not IO-bound, re-consider your decision to use SSD. If you don’t see significant number of physical reads and sequential read wait events in your AWR report, you won’t notice much performance improvements from using SSD.

* If you can afford it, put all your data files on SSD. This is a quick way to make sure you get excellent read and write performance.

* If you have more data than space on your SSD, you want to place on the SSD segments that will benefit most from its random-read and random-write performance. Look for segments where you do most of the reading and writing, prefer segments with more reads than writes, and those that have sequential read (i.e. random access) to scattered read (i.e. full scans). I have some queries that I use to pick the segments, and I also use the statistics in AWR reports.

* If the most-used segments are the current partition in a partitioned table (a very frequent scenario), you’ll want to write a job that will create the partition on the SSD, but move it to the magnetic storage when it stops being active.

* If you can’t decide on which segments to use, you can use your SSD as a secondary cache to your SGA. The feature is called “Database Smart Flash Cache”, not to be confused with “Exadata Smart Flash Cache”. When you enable this feature, blocks that are evicted from the SGA are written to the SSD, where they can later be retrieved much faster than they can be from disk. This means that writes will not get any performance benefit from SSD, in the same way that a larger SGA won’t benefit writes either.

If your system is IO-bound and read-heavy, this is a very easy way to benefit from a smaller SSD without deeply inspecting your workload patterns and moving segments around. However, according to Guy Harrison’s tests, the performance benefits from using SSD this way are not as impressive as selecting the right tables and placing them on SSD.

Oracle’s white paper on the feature says: “The Sun Storage F5100 Flash Array and the Sun Flash Accelerator F20 PCIe Card provide a natural fit for Oracle Database Smart Flash Cache”, which led some people to believe that these are the only devices on which the feature is supported. This is not true – Oracle supports Database Smart Flash Cache on any SSD, they just recommend using their own hardware.

* Placing redo logs on SSD is not recommended. Exadata now has a “Smart Flash Logging” feature that uses redo logs on SSD. Note that it uses redo logs *also* on SSD. This feature allows Oracle to write redo in parallel to a file on SSD and a file on the magnetic disk, and finish the operation when one of the calls is successful. This is a no-lose situation that can only improve performance. This is not the case if you place your redo logs on both SSD and magnetic disks yourself, in which case the redo write will finish when the slowest device finishes writing, making it a no-win situation.

Even if you place all copies of your redo logs on the SSD, you are still likely to see performance degradation. This is expected because as we discussed before, SSD doesn’t have any performance benefit for sequential writes of the type redo logs usually experience, and it may even end up being slower. Our experience at Pythian with customers attempting to place redo logs on SSD confirms this issue, and so do the tests ran by Guy Harrison.

I hope this post is useful for someone and that I’m not the only DBA who just finished catching up on the entire SSD story. There is also the possibility that my understanding of SSD is still incomplete, in which case I’ll be happy if someone steps up and corrects me.

For reference, these are the sites I used to learn more on SSD:

http://en.wikipedia.org/wiki/Solid-state_drive

http://www.anandtech.com/show/2829

http://en.wikipedia.org/wiki/Flash_memory#NAND_flash

http://www.ssdperformanceblog.com/

http://guyharrison.squarespace.com/ssdguide/

http://www.storagesearch.com/ssdmyths-endurance.html

38 Responses to “De-Confusing SSD (for Oracle Databases)”

  • Excellent post, Gwen! I put my SSD’s to full use, but agree with no redo should live on them, absolutely love temp on it!

  • Jeff Holt says:

    “If you don’t see significant number of physical reads and sequential read wait events in your AWR report, you won’t notice much performance improvements from using SSD.”

    This almost sounds like you’re saying that if AWR shows “physical reads and sequential reads wait events” ARE significant, then you WILL noticed much performance improvement.

    That might be true if those events meant what you think they mean and only if reading doesn’t dominate some business transaction (e.g., restore) that’s being monitored.

    All those events mean is that an operating system function call was made. It doesn’t mean that the function call had to go to disk. The reads could have been satisfied by the filesystem cache.

    The information needed to determine the number of reads that come from disk is simply not stored in AWR. In fact, it’s not really stored anywhere (not even an extended sql trace file). Certainly, if you knew which reads finished in less that 500 microseconds, then you would probably know which ones went to disk and which ones did not go to disk. But the only way you can know that information with any degree of certainty is to trace the session.

    Respectfully.
    Jeff

    • Gwen Shapira says:

      Thanks Jeff. I really meant what I said: If you don’t see an IO bottleneck, you won’t benefit from SSD.
      Thanks for clarifying that the reverse doesn’t always hold. We recently saw this again with one of our customers.

      Physical IO are not always very physical :) And the workload type is also an important consideration.

      Now if Method-R can release a tool that advises on SSD decisions based on trace files…

      • Greg Rahn says:

        “I really meant what I said: If you don’t see an IO bottleneck, you won’t benefit from SSD.”

        That really isn’t true though is it? One may not be at a “bottleneck” but if you make something take less time, it can make things go faster. For example, maybe my storage yields 5ms IO times, but SSD yields sub 1ms. IOs at 5ms we’re necessarily a bottleneck but now my operations are 5X faster.

        • Gwen Shapira says:

          It sounds like in your scenario everything was working fine, but SSDs added extra oomph.

          I had in mind a customer where CPU was a serious bottleneck during high-load hours. Moving to SSD didn’t improve performance, it just caused longer run-queues and more “log file sync” waits.

          • Greg Rahn says:

            That makes sense — if a workload is CPU bound and you reduce IO times it will be “more” CPU bound. It’s akin to crashing a car into a wall — SSD just accelerated the speed at which the crash happens.

    • Greg Rahn says:

      “But the only way you can know that information with any degree of certainty is to trace the session.”

      Can’t one apply the same logic to ASH or AWR data — why do you need trace data? One can basically chose the same reasonable threshold in which rotating rust could return an IO, if it is below that threshold, it didn’t come from the platter, if it was above it, it did.

    • Jeff is right about the possible file system buffering effect. So if we are going to talk about possible cache hierarchy mucking up an AWR report we should mention array head cache too. Both will push waits well below 1ms. Jeff is right that AWR is telling you that a system call had to be made not what happened downwind to artificially speed up the wait. The converse is true. Your db file sequential reads above, say, 10ms can often be a reflexion of time spent in runable state waiting to get on a CPU.

      So to make the AWR numbers useful, let’s all join the 21st century and make sure we are using direct I/O (takes care of Jeff’s point) and for my point make sure you examine CPU contention during periods of high I/O wait. If the conditions intersect then part of the I/O wait is CPU starvation. You then have to drop to I/O stats at the OS level and examine I/O queue depth and wait times in such tools as iostat or better yet collectl

  • Gregory says:

    Interesting thoughts considering ODA FAQ seems to advise, at least from my understanding, to store redologs on SSD. It says “There are also four 73GB solid state disks for
    high performance processing of redo logs.”

  • Gregory says:

    Guy Harrison blog sounds more like a question than a *not recommended* scenario to me. Not everybody runs a 36 SAS superfast disk quarter exadata rack or the latest EMC.

    • Gwen Shapira says:

      One of my customers moved their redo logs from NFS with SATA drives (32 disks, I think). Performance degraded a bit.

      I’ve yet to see a case where performance improved as a result of moving redo to SSD.

      • Hi Gwen

        I really appreciate this article and the comments.

        I wonder if SSD would benefit redo logs on smaller, busy system, or at least those with fewer spindles available to the DB. If the writing to redo logs is writing to the same physical storage as is used to support other files/activities, their workload would cause the heads to seek away from the redo log data and then have to seek back. Testing is nearly always done on either dedicated test systems or shared test systems with low concurrent workload. Using a set of cheaper, smaller SSDs dedicated to the redo might well work in those situations?

        Testing on the usual sort of test rigs I’ve seen would not have concurrent load that would cause the contention (at disk level) with the redo logs that I am suggesting. However, I would not be surprised to see SSDs reducing this contention issue in live situations with shared storage. The problem with shared storage is that it’s, well, shared!

        • Gwen Shapira says:

          Hi Martin,

          Keeping in mind that SSD are still significantly more expensive than spinning disks, why not get two HDDs, use them as dedicated storage for the redo, and solve the moving head problem this way?

  • Hi Gwen,

    A few points if you’ll allow.

    1. Memory has not been microsecond-latency since before the turn of the century. I just blogged a bit about that point here: http://goo.gl/vKvoH

    2. I feel you left something important off your list of possible paths to SSD and that would be EMC FAST. Not here to sell anything, but these arrays do figure out what to dynamically place on SSD and it isn’t redo…and it (FAST) works.

    3. NEVER, EVER use Database Smart Flash Cache (DSFC) until you’ve maxed out the DRAM supported by your platform. Those utterly stupid Oracle whitepapers show tiny little SGAs on QPI-based servers augmented with FLASH. Oh, and the workloads in the one paper I have in mind is nearly read-only. Foolishness. Find me a production system that benefits from DSFC. I dare ya :-)

    4. Redo is not a disk problem (http://goo.gl/kZ900). The reason Exadata needed the Flash redo “feature” is because it can’t scale random writes. Redo was gets starved out. Yes, I said Exadata cannot scale random writes. The product offers 25,000 random WIOPS (net after normal redundancy ASM) on a full rack. That’s a ridiculously low ceiling on random writes for a 96 processor system.

    • Gwen Shapira says:

      About #1 – Ouch. Yes, I think end of last century is where I got my memory numbers. I did not realize that RAM changed a bit since. Looks like we are nicely in the low nanoseconds now. Thanks for the correction and I’ll update my post.

  • Simon says:

    Interesting Gwen… Was just recently testing the effects of SSDs on performance on our 1/4 Rack Exadata, and what you say here is pretty much what I’ve seen. I have seen very poor writes for the SSDs compared to normal disk. Wasn’t expecting that, but this post has help me to understand it a lot better. Thanks!

  • At one point I was investigating the possibility of leveraging thumb drives for a performance boost when building VMs on laptops. Found this site – not so much about Enterprise-grade SSDs, more about the consumer-grade pocket glash drives… but I still think it’s a very helpful and informative resource for the whole technology:

    https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey?action=show&redirect=WorkingGroups%2FKernelConsolidation%2FProjects%2FFlashCardSurvey

  • alex says:

    Re: SSDs for redo. I did my own test and can confirm that SSDs are pointless for redo… on my configuration :)

    http://alexodba.blogspot.com/2011/11/ssd-efd-devices-for-oracle-redo.html

  • Fuad Arshad says:

    A very well written post indeed. We actually went thru an exercise of determing read heavy tables and did see a significant improvement .

  • Luc says:

    sequential read (i.e. random access) to scattered read (i.e. full scans)

    The other way around.

    • Gwen Shapira says:

      No, it isn’t :)
      A bit confusing, but when Oracle accesses data through an index, its random access on the disk, but called “sequential read” in Oracle events, because the data is placed sequentially in memory. When Oracle is performing a full scan of table or index, the access on the disk is sequential, but the wait event is “scattered read” because the data is then scattered through memory.

      See AskTom for more detailed explanation: http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:2613874700346162097

      • >> when Oracle accesses data through an index, its random access

        …let’s not forget Index Fast Full Scan which is large, sequential, PGA-buffered and, for those who are interested, off-loadable to Exadata via Smart Scan.

      • >because the data is placed sequentially in memory.

        Actually that is the polar opposite of what happens. A db_file_sequential read is buffered in the SGA which is a list of scattered, aged buffers. That is, in fact, why scattered reads into the SGA (non-contiguous multi-block read or ncmbr) is such a difficult thing to get right.

        If “sequential in memory” is a term that alludes to physically contiguous RAM, that is a totally different topic. Just suffice it to say that Oracle on commodity platforms (with Unix derivations) never gets physically contiguous memory. They might on Windows and some aged legacy Unix environments did offer such optimizations but those were often times application skin in the game because the I/O device drivers had so much trouble with scatter/gather DMAs. That’s all a long time ago.

  • Cary Millsap says:

    Luc: It’s easier to understand (and remember) once you know that an Oracle “sequential read” isq a pread call, and an Oracle “scattered read” is a readv call. Once you read the man page for readv, all becomes clear, and you’ll know why the Oracle kernel developers named these timed events the way they did.

    • Cary, I hesitate to sound argumentative, but readv() and pread() are specific library routines available only on certain platforms and not used in all cases even on platforms that offer the routines.

      It’s up to the port (OSDs) to implement the underlying code to satisfy skgf*() routines. There are async port requirements for both single block and vectored (multiblock) and multi-block+multi-FD requirements. There are variable like whether filesystemio_options is set for asynchronous I/O. There’s the variable of whether or not a particular file in a particular file system with particular open args (direct, etc) can be accessed by a particular rouinte (some don’t support async I/O unless buffered with certain memory characteristics, etc). The branch-code in the OSDs can get pretty complex.

      Throughout Oracle history, ports have wandered through BSD, POSIX listio, LibC and hard-wired vendor-specific I/O libraries like we had at Sequent. Then there’s Oracle Disk Manager I/O library (Veritas, PolyServe), ASMLib and Direct NFS (leverages Oracle Disk Manager library) and libcell (Exadata iDB) all of which show nearly sanskrit-like calls to the OS (to the untrained eye) for I/O. Some implementations used vendor libraries to wrap readv() to implement multi-FD asnync vectored, etc, etc. And, of course, let’s not forget Oracle’s 3rd largest revenue platform: Windows where the calls are greek to me but certainly not LibC (pread()/readv()).

      The history behind the moniker db_file_sequential_read for random single block is that the *calls* occur sequentially in the foreground. Quite an unfortunate moniker considering the importance of the term “sequential” in an I/O context.

      So, after this moment of nightmarish reminiscence back to my days of toiling with port-level, highly-optimized, multi-file-descriptor asynchronous I/O I’ll crawl back into my hole and toil over Symmetrical MPP DW/BI here at EMC Data Computing Division :-)

  • Vishal Desai says:

    Q1 “SSD is much faster than spinning disk for random writes on new media but it doesn’t have much advantage when it comes to sequential writes.”

    — If we put only temporary tablespaces on SSD’s for large warehouse database,’direct path write temp’ with block counts(p3)=1 will be faster but ‘direct path write temp’ with higher block counts(p3) values wait times will not benefit from SSD?

    Q2. The second problem with SSD is writing is *erasing* is very slow.

    — Please explain how erasing will work with temporary tablespaces placed on SSD drives.

    Q3. Where does wait times fit into deciding what to put on SSD drives?

    Can I put tables/partitions with large sequential scan (scattered in database term) and high wait times on SSD considering bottlenecks are at Disk/SAN level?

    — segments with sequential scan (scattered in database term) with wait times 25ms-40ms – good candidate for SSD
    — segments with sequential scan (scattered in database term) with wait times <10ms – not so good candidate for SSD

    Thanks,
    Vishal

  • […] recently read an interesting article titled De-Confusing SSD (for Oracle Databases) that inspired several interesting comments.  I think that the following two statements in the […]

  • Karl Flanagan says:

    Great post Gwen, thanks.

    I see there is now TLC (Triple Level Cell) NAND coming out early next year which will increase the storage capacity of the cells by 50%.

  • […] for Oracle Gwen Shapira has investigated SSDs and Flash and published her findings as “De-Confusing SSD (for Oracle Databases)“.   The post *and* the comments are worth a […]

  • Mike says:

    Actually DDR SSDs have been around for years and have seen very successful implementations where redo logs cause a majority of wait events. Also, with features such as special log luns designed to enhance write speeds at the costr of space and native 512 byte access modern SSDs can be used quite successfully for redo logs.

  • […] Gwen Shapira: “De-Confusing SSD (for Oracle Databases)” […]

  • Matt Morris says:

    It still very much depends on how the NAND(flash) is integrated and deployed. SSDs, typically found in a disk drive form factor do not deliver the same level of performance as other engineered platforms. How the NAND is aggregated, the controller technology leveraged to manage it. Similar to Kevin, not trying to sell …, read the technology page on violin’s website and then do the same for other technologies. While it still NAND, SSD is a different form factor and as such will/does have different performance characteristics than other Flash implementations.
    A review of TPC, VMWare benchmarks will show how all-flash arrays can used for all data types under load. I have also published results using Kevin’s SLOB kit for testing our arrays with low-end 2 socket intel servers. Others at my company have also blogged about the use of SLOB generating processing massive amounts redo to an-all flash array.

    Like most technologies “it depends”, while I do support hybrid implementation models – your overall system design still needs to be balanced, if a portion of your Storage platform is operating at 1millisecond. Some operations will be limited to the slowest component.

    And yes, CPU bound applications, still benefit from extremely low-latent devices…. Perhaps just not how you might expect them too… :).

    • Gwen Shapira says:

      Completely agree on disk drive form. I didn’t cover this in the blog post, but I do talk about this in my presentation (http://www.slideshare.net/gwenshap/ssd-collab13).
      Basically disk drive form has severe performance drawbacks, SATA/SAS-on-PCI seem to still be the majority of the drives, and I did not have a chance to experiment with SAS-on-PCI vs. pure PCI (of the type Fusion-IO is now promoting).

      SSD-in-SAN are another story with their own set of pros and cons. I’m not a big fan of reducing disk latency just to add network latency, but for customers with RAC, they are the only game in town. As you said, it depends :)

      Thanks for sharing your observations.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>