Friday, August 22, 2008

VMWare and NFS on NetApp Filers

Ran across a very disturbing issue this week, which caused the corruption of nearly half of our virtual machines - certainly all of the ones which were experiencing any load to speak of.

If you run VMWare ESX 3i against a NetApp filer using storage pools defined as NFS mounts (of which a given cluster can only have 32, by the way), and you're a good Systems Engineer and follow the NetApp best practices guidelines for configuring ESX to work optimally with their products, then you'll turn on this nifty little switch called NFS.Lock.Disable by changing the default value of "0" to "1".

The document in question from October, 2007, used to be on the NetApp site (and maybe VMware's as well), but is now relegated to one of those cool sites that leech stuff from the web and hold it forever.

http://www.docstoc.com/docs/304765/Network-Appliance-and-VMware-Virtual-Infrastructure-3-Storage-Best-Practices


Then about 10 months later, or 2 months ago, it changed.

http://media.netapp.com/documents/tr-3428.pdf

Suddenly, no mention of NFS.Lock.Disable.


This document, on the VMware website, however interestingly titled, makes no mention of it the setting. It's authored by NetApp as well, and is dated May 2007.

http://www.vmware.com/files/pdf/partners/netapp_esx_best_practices_whitepaper.pdf


So what gives? Why so many documents and versions on the same topic, by so many authors? Who knows. All we know is that "best practices" for VMWare ESX 3 customers using NetApp and NFS storage pools was to set NFS.Lock.Disable to 1.

The problem is that VMWare ESX 3i isn't what ESX 3 used to be. From the poor retention of logs (not my opinion; statement of fact from VMware support), to the singular dependency on file locks to ensure split brain conditions don't occur, all the way to patches that somehow slip past QC with license-detonation code included.

All of this means when you do what VMware and NetApp told you to do in order to deploy their products together successfully, you put your data at risk.

And we did, and ours was, and it sucked.

Split brain means, in essence, that the system in charge of deciding what VM goes where (VCenter in this case) is unaware or uncertain of what ESX host holds which guest operating systems. In this condition, two separate ESX hosts can be - simultaneously - running the exact same guest OS instances. The bad news is, as you might imagine, that one half of that "brain" doesn't know what the other half is doing. What results is the utter annihilation of your filesystem, and depending on where and how you keep files, a very long process of restoring to a known good state.

The symptoms are beyond bizarre. VCenter shows a guest VM as being on a different host just about every second. Opening up a VM's console may give two people completely different screens, because you're actually looking at different "real" instances of the same virtual machine. Shutting down VCenter doesn't make things better; connecting directly to the ESX host will show you a wildly fluctuating number of guests running.

The only remedy is to use some neat "unsupported" (wink wink) console commands on the ESX hosts, and 'kill' the offending VM's. The faster you do that, the less badly your data will be corrupted....no matter how you slice it, it stinks.

VMware's Platinum support was surprisingly disappointing. Our first tech "hadn't been trained on 3i yet", and it took a while to get to someone who was. Like, hours. The RCLI is lacking commands that ESX 3 used to have, which vexes them. Following a host reboot, logs aren't kept. At all. They're not kept at all. What? Yes. Used to be in 3, but not in 3i. Nice. Once we got back up, it took ages and a lot of pretty irate e-mail to get someone to do some post-mortem analysis. Ultimately, we heard the details here straight from the horse's mouth. They're trying to eradicate the versions of that document that advise NFS.Lock.Disable - word never made it to us, somehow. They say they've known about the problem for about 30 days, which seems unrealistic. They say about 12 customers have had the same exact issues. And, unequivocally, they say to turn off that NFS.Lock.Disable shit, post haste.

My hope is that people who have deployed with this setting seriously consider changing it, and perhaps ask NetApp WTF? (or better yet, share their experiences in the comment area below). And additionally, my hope is that we're able to convince VMware that the deficiencies with 3i are totally unacceptable (no matter how insignificant they may seem to those with direct-attached or FC-attached disk). They could take a cue from Novell regarding heartbeat and keepalives to make sure direct host-to-host communication is used as a failsafe against these goofy file locks before allowing a VM to start on a new host.