De-Confusing SSD (for Oracle Databases)
You never forget your first SSD.
For me, the first time I really *noticed* SSDs was when one of my customers encountered serious corruption on one of their databases and we had to restore an entire database. It was not a small database, around 300G in total file size. After I started RMAN restore and recovery process, the customer asked the inevitable question: “How long will this take?”. I replied that I’m not familiar with the performance of their storage, but from my experience a restore of this size can be expected to take 5 hours. Imagine my surprise when the restore was done after an hour.
This was enough to convince me that SSD is magic, and that if you have money and IO problem, just go SSD. Of course, if that was the end of the story, I wouldn’t have much of a blog post.
When I tried to share my excitement with other DBAs, I found out that SSD is actually kind of scary. Soon I became extremely confused by everything I’ve seen and heard:
SSD is fast for reads, but not for writes. Its fast for random writes, but not for sequential writes. You shouldn’t use it for redo, except that Oracle do that on their appliances. SSD gets slow over time. SSD has a limited lifespan and is unreliable. Performance depends on exactly which SSD you use. You can have PCI or SATA or even SAN. You can use SSD for flash cache, but only specific versions, maybe. You can have MLC or SLC. It can be enterprise or home grade.
With every conversation, the confusion grew. Until I finally had enough, and also had some spare time, and could sit down and untangle the web. Here’s what I found out:
First, there are two types of SSD – Flash based (also called NAND flash) and RAM based. For all practical purposes, RAM based doesn’t actually exist, and is used to confuse DBAs a bit more. Just ignore it for now – all SSDs are NAND flash.
NAND flash SSDs arrive in two flavors – multi-level (MLC) and single-level (SLC). It sounds like multi-level is better, but that’s wrong. MLC is slower and cheaper, SLC is faster and more expensive. Enterprise-grade SSDs are SLC.
SSD’s base memory unit is a cell, which holds 1 bit in SLC and 2 bits in MLC. Cells are organized in pages (usually 4k) and pages are organized in blocks (512K). Data can be read and written in pages, but is always deleted in blocks. This will become really important in a moment.
SSD is indeed very fast for reads. SSD can read 4K page in 25 *microseconds*. Remember that with spinning disks, 1ms read is very reasonable, and significantly longer read times from SAN storage. SSD is around 4 orders of magnitude slower than RAM and 3 orders of magnitude faster than spinning disks.
SSD is also very fast for writes. 4K block write takes 250 microseconds. Much slower than SSD reads, but much faster than writing to magnetic disks.
However, for SSD, almost all write-time statistics are misleading.
SSD is much faster than spinning disk for random writes on new media but it doesn’t have much advantage when it comes to sequential writes. The main advantage of SSD over magnetic disks is that it doesn’t have to spin. Writing to any area on the device is just as fast. When the magnetic disk doesn’t have to spin either because it is writing all the data to the same location, SSD is not much faster. This is why DBAs do not advise placing redo logs on SSD – it is all sequential writes, so there won’t be any performance improvement. Performance may actually degrade if the redo logs were places on a volume with large number of underlying disks. SSD’s parallelism is limited to the number of channels the device supports, usually fewer than 16.
The second problem with SSD is writing is *erasing* is very slow. Erasing is very slow because even if you want to delete one page, the SSD can only delete the entire block. The controller needs to read the entire block, erase everything and write back only the bits you want to keep. To make things even more fun, each SSD cell can only be written a certain number of times before it becomes unusable.
To keep the re-write overhead low, manufacturers use several techniques:
- They use every cell before resorting to deleting existing cells
- They will over-provision the SSD (i.e. build 2G SSD but only show the OS 1.8G), so writes can be completed quickly using the “spare space” and then the required deletes can be completed asynchronously in the background.
As you can see, this requires the controller to be pretty smart and maintain “free lists” of pages that can be used for writing, and pages that need cleaning. This is one of the major differences between different SSD devices – how well the controller manages the writing and erasing cycles on the device.
To compare SSDs, you can check the “write amplification” number for each device. This number shows how many Mbytes are actually written when you attempt to write 1M (on average). The extra writes are due to the cleanup processes involved. Devices with amplification factor that is closer to 1 will have better write performance.
When looking at SSD benchmarks, it is important to remember that writing on clean device is faster by definition than writing on used device, so make sure you are looking at the right numbers. If you run your own benchmarks, expect throughput to drop significantly in the first 30 minutes of the test, and only consider the numbers you see following the initial drop.
Of course, once the device is nearly full, the controller has much harder time finding free pages and it has to perform more cleanup on every write. This is the SSD version of fragmentation. When considering benchmarks, also check how full the device was while the benchmark ran.
The SSD performance blog is an excellent resource with many well-ran SSD benchmarks. You can use their results, but its even better to learn from their methodology.
The limit on how many times a cell can be used is not as scary as it initially sounds. On enterprise grade SSD cells can take around 1M writes, which is usually enough to last around 20 years. Since most disks aren’t expected to last over 3 years anyway, I wouldn’t worry about that. Except that – this estimate assumes that you use all parts of the at a uniform rate. This isn’t a natural pattern of using data – we usually use some of the data much more frequently than we use other data. The SSD controller usually contains logic to make sure it spreads the write-erase activity around regardless of how you use the data.
Devices also differ in the interfaces they offer. The common interfaces are either PCI-E or SATA. PCI-E is faster, offers more channels (better parallelism) and is more expensive. SATA is cheaper and fits into more legacy hardware.
Now that we know a bit about how SSDs work and how to choose your SSD device, its time for the important question: How do I use it to make my Oracle Database faster. Here are several options, from most effective to least recommended:
* If you are not IO-bound, re-consider your decision to use SSD. If you don’t see significant number of physical reads and sequential read wait events in your AWR report, you won’t notice much performance improvements from using SSD.
* If you can afford it, put all your data files on SSD. This is a quick way to make sure you get excellent read and write performance.
* If you have more data than space on your SSD, you want to place on the SSD segments that will benefit most from its random-read and random-write performance. Look for segments where you do most of the reading and writing, prefer segments with more reads than writes, and those that have sequential read (i.e. random access) to scattered read (i.e. full scans). I have some queries that I use to pick the segments, and I also use the statistics in AWR reports.
* If the most-used segments are the current partition in a partitioned table (a very frequent scenario), you’ll want to write a job that will create the partition on the SSD, but move it to the magnetic storage when it stops being active.
* If you can’t decide on which segments to use, you can use your SSD as a secondary cache to your SGA. The feature is called “Database Smart Flash Cache”, not to be confused with “Exadata Smart Flash Cache”. When you enable this feature, blocks that are evicted from the SGA are written to the SSD, where they can later be retrieved much faster than they can be from disk. This means that writes will not get any performance benefit from SSD, in the same way that a larger SGA won’t benefit writes either.
If your system is IO-bound and read-heavy, this is a very easy way to benefit from a smaller SSD without deeply inspecting your workload patterns and moving segments around. However, according to Guy Harrison’s tests, the performance benefits from using SSD this way are not as impressive as selecting the right tables and placing them on SSD.
Oracle’s white paper on the feature says: “The Sun Storage F5100 Flash Array and the Sun Flash Accelerator F20 PCIe Card provide a natural fit for Oracle Database Smart Flash Cache”, which led some people to believe that these are the only devices on which the feature is supported. This is not true – Oracle supports Database Smart Flash Cache on any SSD, they just recommend using their own hardware.
* Placing redo logs on SSD is not recommended. Exadata now has a “Smart Flash Logging” feature that uses redo logs on SSD. Note that it uses redo logs *also* on SSD. This feature allows Oracle to write redo in parallel to a file on SSD and a file on the magnetic disk, and finish the operation when one of the calls is successful. This is a no-lose situation that can only improve performance. This is not the case if you place your redo logs on both SSD and magnetic disks yourself, in which case the redo write will finish when the slowest device finishes writing, making it a no-win situation.
Even if you place all copies of your redo logs on the SSD, you are still likely to see performance degradation. This is expected because as we discussed before, SSD doesn’t have any performance benefit for sequential writes of the type redo logs usually experience, and it may even end up being slower. Our experience at Pythian with customers attempting to place redo logs on SSD confirms this issue, and so do the tests ran by Guy Harrison.
I hope this post is useful for someone and that I’m not the only DBA who just finished catching up on the entire SSD story. There is also the possibility that my understanding of SSD is still incomplete, in which case I’ll be happy if someone steps up and corrects me.
For reference, these are the sites I used to learn more on SSD: