Friday, August 22, 2008

VMWare and NFS on NetApp Filers

Ran across a very disturbing issue this week, which caused the corruption of nearly half of our virtual machines - certainly all of the ones which were experiencing any load to speak of.

If you run VMWare ESX 3i against a NetApp filer using storage pools defined as NFS mounts (of which a given cluster can only have 32, by the way), and you're a good Systems Engineer and follow the NetApp best practices guidelines for configuring ESX to work optimally with their products, then you'll turn on this nifty little switch called NFS.Lock.Disable by changing the default value of "0" to "1".

The document in question from October, 2007, used to be on the NetApp site (and maybe VMware's as well), but is now relegated to one of those cool sites that leech stuff from the web and hold it forever.

Then about 10 months later, or 2 months ago, it changed.

Suddenly, no mention of NFS.Lock.Disable.

This document, on the VMware website, however interestingly titled, makes no mention of it the setting. It's authored by NetApp as well, and is dated May 2007.

So what gives? Why so many documents and versions on the same topic, by so many authors? Who knows. All we know is that "best practices" for VMWare ESX 3 customers using NetApp and NFS storage pools was to set NFS.Lock.Disable to 1.

The problem is that VMWare ESX 3i isn't what ESX 3 used to be. From the poor retention of logs (not my opinion; statement of fact from VMware support), to the singular dependency on file locks to ensure split brain conditions don't occur, all the way to patches that somehow slip past QC with license-detonation code included.

All of this means when you do what VMware and NetApp told you to do in order to deploy their products together successfully, you put your data at risk.

And we did, and ours was, and it sucked.

Split brain means, in essence, that the system in charge of deciding what VM goes where (VCenter in this case) is unaware or uncertain of what ESX host holds which guest operating systems. In this condition, two separate ESX hosts can be - simultaneously - running the exact same guest OS instances. The bad news is, as you might imagine, that one half of that "brain" doesn't know what the other half is doing. What results is the utter annihilation of your filesystem, and depending on where and how you keep files, a very long process of restoring to a known good state.

The symptoms are beyond bizarre. VCenter shows a guest VM as being on a different host just about every second. Opening up a VM's console may give two people completely different screens, because you're actually looking at different "real" instances of the same virtual machine. Shutting down VCenter doesn't make things better; connecting directly to the ESX host will show you a wildly fluctuating number of guests running.

The only remedy is to use some neat "unsupported" (wink wink) console commands on the ESX hosts, and 'kill' the offending VM's. The faster you do that, the less badly your data will be matter how you slice it, it stinks.

VMware's Platinum support was surprisingly disappointing. Our first tech "hadn't been trained on 3i yet", and it took a while to get to someone who was. Like, hours. The RCLI is lacking commands that ESX 3 used to have, which vexes them. Following a host reboot, logs aren't kept. At all. They're not kept at all. What? Yes. Used to be in 3, but not in 3i. Nice. Once we got back up, it took ages and a lot of pretty irate e-mail to get someone to do some post-mortem analysis. Ultimately, we heard the details here straight from the horse's mouth. They're trying to eradicate the versions of that document that advise NFS.Lock.Disable - word never made it to us, somehow. They say they've known about the problem for about 30 days, which seems unrealistic. They say about 12 customers have had the same exact issues. And, unequivocally, they say to turn off that NFS.Lock.Disable shit, post haste.

My hope is that people who have deployed with this setting seriously consider changing it, and perhaps ask NetApp WTF? (or better yet, share their experiences in the comment area below). And additionally, my hope is that we're able to convince VMware that the deficiencies with 3i are totally unacceptable (no matter how insignificant they may seem to those with direct-attached or FC-attached disk). They could take a cue from Novell regarding heartbeat and keepalives to make sure direct host-to-host communication is used as a failsafe against these goofy file locks before allowing a VM to start on a new host.


hejish said...

We hit the exact same problem this week- in the form of data corruption. We have also had the bizarre behavior with HA (which we've left turned off ever since - many months so far).

One more point - if you change the setting, you must reboot your virtuals for the setting to take effect. There is no way to check if the setting is in effect in the virtuals (if you have changed it in the past) - we rebooted all virtuals after setting this back to 0.

One additional problem: Turning this feature off means deleting snapshots of volumes supporting the virtuals (in the netapp) can cause virtuals to stop responding for a few minutes! That is better than data corruption, but it is clearly something that _must_ be resolved, and we have no ETA.

The communications issues, the internal training issues - these are all issues that I would think vmware could easily fix and have not in about 6 months.

The problem with ESXi with no logs is no different than the problem with ESX with logs - they both have the same failure profile.

fletch said...

Hi, I found your posting after searching on "vmware netapp nfs lock" - we had the exact same experience this week (vm corruption) - timeline:
1) we follow the infamous 3428 netapp best practice doc from 2007 - including the "netapp vmware snapshot integration" script
2) may 2008: we have vm timeouts during the vmware snapshot removal part of the script - vmware points us to the 2007 where it says to disableNFS locking - we do so with a little hesitation - "is there any risk doing this"? - it solves the vm timeouts and its in the best practice doc, so we keep it.
3) June: we have a horrible split brain double registration VM experience - exasperated by HA starting VMs mulitple times during a network outage. VMWare has us turn HA off until they can identify a bug and provide a patch (promised the patch in July)
4) Aug: we had severe corruption of a mail server VM - corruption went back several snapshots and we needed to go back 14 snapshots (days) to a usable version - I opened the Platinum case 10pm and it was still listed as unassigned 11 hours later until I called in (I was busy going back 14 snapshots one at a time)

Finding your post has validated what we suspected:
1) locking best practice changed without notification and vm corruption happens when its disabled
2) the multiple registration bug is unresolved

We have reverted to the updated best practice, but now we have vms timing out during the Netapp (still in the latest best practice doc on page 61) script by Vaughn Stewart (I will be cross posting a link to his blog at Netapp)


hejish said...

Further point - you are at risk of data corruption even without HA enabled, and thus without the split-brain problem.

StewDaPew said...

As one of the authors of TR3428 maybe I can shed some light on the changes in version 4.1.

VMware has an issue with the VMSnap process and their locking mechanism when it is running over NFS. We published a workaround that addressed this issue; however, the workaround could introduce risk if one did implement all of the best practices associated with HA.

In order to provide the most reliable environment for customers we were asked to remove the workaround.

I have seen customers drop the phase of calling a VMware backup prior to executing a NetApp Snapshot.

Let's hope a fix for this issue gets posted by VMware in the very near future. Until then I believe this is a solid workaround.

ZEN Master said...

All - I really, really appreciate the information and comments. This is extremely helpful to me in my efforts to drive our problem to resolution with VMWare, and NetApp if necessary.

It would be great if StewDaPew could elaborate on what he felt the "workaround" to be. It seems as if he's suggesting we use NFS.LockDisable=0, but whether or not this negatively impacts snapshot backup scripts like the ones published by NetApp is still unclear.

Thanks again.

hejish said...

What I interpret Stewdapew to mean, with some extrapolation/interpretation of my own which he did not say:
1) Yes, use NFS.lockdisable=0
2) Although the netapp best practices changed, some people have noted a continued horrible effect remains -- virtuals can fail to respond for a period of time while using the best practices script! By editing the script in a manner not documented in best practices, this horrible side effect is completely eliminated. However, you get a new side-effect - you don't have any vendor guarantee that those netapp snapshots will do you any good.

ZEN Master said...

Based on additional discussions I've had with VMWare, I agree with your analysis.

I am informed that the patch in question, designed to correct the issue with the snapshot process, should be out literally any day. I was directed to look at the VMWare SelfService patch site and await it's appearance. From the e-mail I received, the patch should be named #ESX350-200808401-BG for ESX 3.5, and #ESXe350-200808501-I-SG for 3i.

fletch said...

Yes, the lockdisable setting reverts to 0 and
"I have seen customers drop the phase of calling a VMware backup prior to executing a NetApp Snapshot.

Means you comment out (until the patch comes out) those steps in the script which issue VMWare commands - effective its just a netapp snapshot until the patches come out - thanks for posting the patch numbers - I'm going to go check for those now - there were 13 released in the last 24 hours

fletch said...

Yes, ESX350-200808401-BG is indeed one of the 13 patches released yesterday - I am applying it to a test server now and will compare its snapshot creation/deletion performance to a server without the patch and report back


Rick said...

Talked with a NetApp engineer and a VMware Engineer at VMworld last week about this same issue. Both recommend to keep NFS.LockDisable set to 1 -- the real issue is with HA spawning the VMs when it thinks a host failed.

They recommend setting the HA settings to shutdown the VM, this will ensure the vmdk isn't in use when the backup host kicks off those VMs. They also recommended to increase the HA threshold from the standard 14 seconds, and also recommended in adding additional isolation addresses in HA Advanced settings. This will cause the additional IPs to be pinged before HA thinks there is a failure (das.isolationaddress)

fletch said...

WAIT! If you re-read the Netapp TR3428 and the original post, that may address the dual-registration of VMs, but the data corruption issue still necessitates the lockdisable be set to 0:
They say they've known about the problem for about 30 days, which seems unrealistic. They say about 12 customers have had the same exact issues. And, unequivocally, they say to turn off that NFS.Lock.Disable shit, post haste.

My hope is that people who have deployed with this setting seriously consider changing it, and perhaps ask NetApp WTF?"

Rick said...

Well the corruption is caused by multiple VM's being spawned. When locking is disabled and multiple VMs are spawned they all have full access to the VMDK files, which in turn cause corruption. Changing your HA default to Shutdown will lower the risk of this happening.

NetApp originally suggested NFS.LockDisable to be set to 1 to resolve an issue with removing VM Snapshots. With NFS Locking the snapshots (on an NFS datastore) would freeze the VMs for 15-20 seconds while the deltas were being committed. NetApp found that disabling Locking resolved this....BUT AT WHAT COST?!

Well, it looks like VMware has released a patch (ESX350-200808401-BG) which supposedly helps speed up the snapshot deletion with on NFS datastores. Looks like we can now have Locking enabled and have fast snapshot removals! I haven't fully tested but I am hopeful.

Rick said...