Jun 20, 2009

When SANs Go Bad

They sometimes go bad in completely unpredictable ways. Here's a problem I have now seen twice in production situations. A host boots up nicely and mounts file systems from the SAN. At some point a SAN switch (e.g., through a Fibrechannel controller) fails in such a way that the SAN goes away but the file system still appears visible to applications.

This kind of problem is an example of a Byzantine fault where a system does not fail cleanly but instead starts to behave in a completely arbitrary manner. It seems that you can get into a state where the in-memory representation of the file system inodes is intact but the underlying storage is non-responsive. The non-responsive file system in turn can make operating system processes go a little crazy. They continue to operate but show bizarre failures or hang. The result is problems that may not be diagnosed or even detected for hours.

What to do about this type of failure? Here are some ideas.
  1. Be careful what you put on the SAN. Log files and other local data should not go onto the SAN. Use local files with syslog instead. Think about it: your application is sick and trying to write a log message to tell you about it on a non-responsive file system. In fact, if you have a robust scale-out architecture, don't use a SAN at all. Use database replication and/or DRBD instead to protect your data.
  2. Test the SAN configuration carefully, especially failover scenarios. What happens when the host fails from access one path to another? What happens when another host picks up the LUN from a "failed" host? Do you have fencing properly enabled?
  3. Actively look for SAN failures. Write test files to each mounted file system and read them back as part of your regular monitoring. That way you know that the file system is fully "live."
The last idea gets at a core issue with SAN failures--they are rare, so it's not the first thing people think of when there is a problem. The first time this happened on one of my systems it was around 4am in the morning. It took a really long time to figure out what was going on. We didn't exactly feel like geniuses when we finally checked the file system.

SANs are great technology, but there is an increasingly large "literature" of SAN failures on the net, such as this overview from Arjen Lentz and this example of a typical failure. You need to design mission-critical systems with SAN failures in mind. Otherwise you may want to consider avoiding SAN use entirely.

6 comments:

Anonymous said...

Can you explain a bit more about your fabric? Did you do multipathing? What kind of cards, switches, and on what OS?

I've generally found a well architected SAN to be no better or worse than RAID hanging off of a local controller.

Arjen Lentz said...

Among the meriad of problems around SANs and their causes:
- companies use SANs as a backup strategy.
- The SAN word has an implicit "enterprise" label on it, so it must be good.
- In many people's minds, SANs can't/won't fail. I kid you not.
- SANs are often difficult to monitor, they tend to insist on you just receiving SNMP traps. So, when the SAN dies, it'll let you know. All extra foo in there aside, here's something fundamentally wrong about that logic.

@anonymous if it's no better, why does it cost more ;-)

Robert Hodges said...

@anonymous
I don't have full technical details as I was diagnosing the software problems but both systems were (as far as I know) dual-pathed using Fibre Channel switches. In the first case the failure was on Solaris, in the second on Linux. In both cases the problems seem to have arisen as a result of improperly handled failures on the SAN.

I have also used RAID but never seen this type of behavior, though obviously RAID fails as well. The problem as I mentioned is for non-specialists like most of us the failures are sufficiently rare it's somewhat hard to generalize.

Anonymous said...

Huh. Thanks for scaring me, then! We're doing fairly text book architecture on decent hardware (MDS switches, netapps, etc. Even in my testing, we never saw a case where all hell didnt break loose when we ripped out fiber cabling or power from the switches or the netapps.

It's nice to know that there is some cases where the OS freaks out and DTWT.

Thanks for the reply!

Robert Hodges said...

@anonymous
You are most welcome.

Not to pile on the horror stories but here is another one I experienced on my applications--we had a power failure on the fabric that took down the Fibre Channel switches. Solaris applications not only did not notice the problem but kept writing to the fabric. We lost 20 minutes of data on Oracle before anybody noticed a problem. This was a while back and may not be a problem for newer technology (we were using Brocade switches at the time), but it's another illustration of the surprises that await the unwary.

Documentaries said...

Among the meriad of problems around SANs and their causes:
- companies use SANs as a backup strategy.
- The SAN word has an implicit "enterprise" label on it, so it must be good.
- In many people's minds, SANs can't/won't fail. I kid you not.
- SANs are often difficult to monitor, they tend to insist on you just receiving SNMP traps. So, when the SAN dies, it'll let you know. All extra foo in there aside, here's something fundamentally wrong about that logic.