Posted by Alex Gorbachev on Feb 26, 2010
A couple weeks ago I did a short blog post about SAN storage failures and how people are blinded by all the bells and whistles that are supposed to make storage arrays 100% reliable and failsafe. My conclusion was that there is no way to avoid storage failures, and that a better way is to anticipate those failures and be ready to handle them with minimal service impact.
I referenced a wake up call from a CTO of an Australian hosting company. Let me quote it again:
The outage, blamed on an IBM storage array, saw the company’s chief technology officer promise “significant changes to the way we deploy and manage our storage environment”.
Today, I stumbled across another article that demonstrates their solution of the storage reliability problem. From Melbourne IT on $18m Oracle revamp:
… to improve the reliability of its operational support systems at a cost of $7 million over three years, which has also seen it switch storage vendors from IBM to EMC. Data corruption that had occurred on its IBM storage systems were blamed for a several day outage experienced at the company’s WebCentral web-hosting business.
So we see that, instead of learning the right lesson, they conclude, “This IBM storage stuff isn’t reliable, EMC sales folks convinced me that they are better. Now my storage will not fail.” The “significant changes to the way we deploy and manage our storage environment” were mere vendor change.
Well, data recovery services will be flourishing!
Posted by Alex Gorbachev on Feb 11, 2010
How many times have we heard the assurance of storage administrators (fueled by the SAN vendor’s claims) that their top-of-the-shelf SAN arrays simply cannot fail. Unfortunately, reality proves this wrong and we see it regularly with our customers.
At the moment of this writing, one of our DBA teams has just completed failover to the standby database as a result of a database crash caused by a SAN issue. A few hours have passed, and parts of these databases are still not available on the formerly primary host, but traffic is being handled just fine on the standby. This customer provides SaaS type of services. Imagine what hours of downtime would do for them and their clients?
Unfortunately, people get bitten by this overestimated (god-like I’d say) SAN reliability. It must, however, be said: SANs do fail!
Do you want such a wake up call for your executives?
The outage, blamed on an IBM storage array, saw the company’s chief technology officer promise “significant changes to the way we deploy and manage our storage environment”.
Since I mentioned one Australian example, here is one more storage failure scenario described by our friends at Open Query. There are many cases from literally any industry, and some of them are rather complicated while others are just plain obvious.
Is there a silver bullet? Well, not as solution but as a concept, yes — simply admit that SANs do fail — this what should drive infrastructure design for business continuity. Actually, I should extrapolate it to another design principle — everything fails, but that’s another story.