Snapshot removal issues of a large VM

VMware specific discussions

Re: Snapshot removal issues of a large VM

Veeam Logoby joergr » Wed Nov 10, 2010 9:11 pm

I bet this problem is related to esx 4.0 or storage.

It would be extremely interesting if you could test this with esxi4.1 and provide us with the results. Further more it would be even more interesting if you could provide us with disk i/o data during snapshot commit (especially read and write latency, iops, and disk read and write rate). You can get all these from within vsphere client or veeam monitor.

Best regards,
Joerg
joergr
Expert
 
Posts: 377
Liked: 39 times
Joined: Tue Jun 08, 2010 2:01 pm
Full Name: Joerg Riether

Re: Snapshot removal issues of a large VM

Veeam Logoby tsightler » Wed Nov 10, 2010 10:26 pm

A couple of things to look at:

1. Several users have reported that having previous other, non-Veeam snapshots can cause the VM to hang

2. Like Joergr, I would agree that the problem is likely storage related. The snapshot removal process significantly lowers the total IOPS that can be delivered by the VM because of additional locks on the VMFS storage due to the increase in metadata updates, as well as the added IOP load of the snapshot removal process itself. In most environments, if you're already over 30-40% IOP load for your target storage, which isn't uncommon with a busy SQL/Exchange server, then the snapshot removal process will easily push that into the 80%+ mark, and, likely much higher. Most storage arrays will see a significant latency penalty once IOP's get into the 80%+ mark which will of course be detrimental to application performance.

As an example, my older Equallogic storage arrays typically provide 3.5ms read latency when running at a 40% IOP load, but at 90% IOP load, which can happen during snapshot removal, read latency spikes to 7.5-10ms. That effectively means that IOPs required for Exchange are 2-3x slower but this doesn't tell the entire picture for how much Exchange might slow down. If Exchange cannot get the IOPs that are required to satisfy a user request before the next user request is made, the queue will grow as MAPI request are served slower than they come in. This is just like an Interstate that is unable to service the traffic that is flowing into it. If you're planning to use a repliation solution that is dependent on VMware snapshots you have to know that your storage can serve not just the IOPS for normal operations, but can maintain that performance during snapshot operations.

We've had some success by increasing the shares for storage on our IOP heavy VM's but I don't think this is some great solution.

3. One suggestion would be to upgrade to ESX 4.1. We've seen huge improvement in the snapshot consolidation process with ESX 4.1 on our Equallogic SAN clusters, and I highly suggest it. If this isn't possible, at least apply all of the latest ESX 4.0 patches.
tsightler
Veeam Software
 
Posts: 4768
Liked: 1737 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Snapshot removal issues of a large VM

Veeam Logoby Gostev » Wed Nov 10, 2010 10:55 pm

Tom, do your already have your Equallogic SAN on 5.0.2 firmware? Because in that case, improvement that you see may come from VAAI, which is not available for every storage?
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: Snapshot removal issues of a large VM

Veeam Logoby matarvai » Thu Nov 11, 2010 3:47 am

I will upgrade ESXi to 4.1 at weekend. Let's see if that helps
matarvai
Enthusiast
 
Posts: 30
Liked: never
Joined: Wed Apr 07, 2010 9:49 am
Full Name: Marko Tarvainen

Re: Snapshot removal issues of a large VM

Veeam Logoby joergr » Thu Nov 11, 2010 7:31 am

Great, please keep us updated when you upgraded to esxi 4.1. Could you also provide some intel about your DAS? Vendor? SAS or SATA? 15K/10K/7K? Number of spindles? RAID Level? And the interface to your host?

best regards,
Joerg
joergr
Expert
 
Posts: 377
Liked: 39 times
Joined: Tue Jun 08, 2010 2:01 pm
Full Name: Joerg Riether

Re: Snapshot removal issues of a large VM

Veeam Logoby matarvai » Thu Nov 11, 2010 7:33 am

Here is information about DAS. Vender HP, P400 controller, 8x 146GB 10K SAS, two RAID-10 arrays, both have four disc.
matarvai
Enthusiast
 
Posts: 30
Liked: never
Joined: Wed Apr 07, 2010 9:49 am
Full Name: Marko Tarvainen

Re: Snapshot removal issues of a large VM

Veeam Logoby joergr » Thu Nov 11, 2010 7:52 am

Good lord, this is 4(!) 10K spindles for direct IO access, quite bad in my opinion for performance considerations. Don´t expect miracles with this setup but believe me, ESXi 4.1 will get more out of it than ESX4.

best regards,
Joerg
joergr
Expert
 
Posts: 377
Liked: 39 times
Joined: Tue Jun 08, 2010 2:01 pm
Full Name: Joerg Riether

Re: Snapshot removal issues of a large VM

Veeam Logoby tsightler » Thu Nov 11, 2010 2:48 pm

Gostev wrote:Tom, do your already have your Equallogic SAN on 5.0.2 firmware? Because in that case, improvement that you see may come from VAAI, which is not available for every storage?


Nope, we're not brave enough to jump on that bandwagon yet, not after the reliability issues we experience after jumping on the RAID 6 train with the Equallogic 4.x firmware far too early. We'll let some other customers be the beta testers for these new features for a while as EQL already proven to us that their QC is not that great with the 5.x code with the disaster that was the original release.
tsightler
Veeam Software
 
Posts: 4768
Liked: 1737 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Snapshot removal issues of a large VM

Veeam Logoby Gostev » Thu Nov 11, 2010 3:10 pm

Ah, so these huge improvements in the snapshot consolidation process that you see are actually solely due to ESX 4.1? Interesting information!
Concerning firmware I heard EQL QC actually took the time to test it properly this time, people seems to be very happy with 5.0.2...
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: Snapshot removal issues of a large VM

Veeam Logoby tsightler » Thu Nov 11, 2010 6:01 pm

Everything is a judgment call, but I have no faith in Equallogic firmware at this point. We started off with 3.x firmware and the arrays seemed rock solid. We were actually part of the 4.x beta program and were a very early adopter. The 4.x train was only released in Aug 2008, just 2 years ago. We hit quite a few issues as we progressed through 4.x code, and EQL averaged a code release a month, which is simply too ofter for an enterprise storage array. Some of the issues were minor, others were catastrophic. I'm sure there were many EQL customers that were very happy with the stability of the firmware version that ate our data, but we weren't too happy with it.

QC should be able to shake out obvious issues (which they didn't with 5.0.1) but have a much harder time with fleshing out long term stability issues. Many times we managed to runs firmware for 4-6 months before we had an issue. The 5.0.2 code hasn't even been out that long yet, so you can't really make statements about it's long term stability yet.

I'm not saying that 5.0.2 is horrible, it may very well be the best code EQL has ever produced, but we're not going to be jumping ship on our finally stable 4.x code just year. I'll probably give it a year or so first. I think that's the difference. Some people measure stability as a matter of weeks or months, by I measure enterprise stability in terms of years.
tsightler
Veeam Software
 
Posts: 4768
Liked: 1737 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Snapshot removal issues of a large VM

Veeam Logoby matarvai » Tue Nov 16, 2010 8:31 am

I updated our ESXi to 4.1 and Veeam to 5.0 and now snapshot removal issue disappeared. Thanks everyone!
matarvai
Enthusiast
 
Posts: 30
Liked: never
Joined: Wed Apr 07, 2010 9:49 am
Full Name: Marko Tarvainen

Re: Snapshot removal issues of a large VM

Veeam Logoby joergr » Tue Nov 16, 2010 9:24 am

you are welcome ;-)
joergr
Expert
 
Posts: 377
Liked: 39 times
Joined: Tue Jun 08, 2010 2:01 pm
Full Name: Joerg Riether

[MERGED] CBT and VM unresponsiveness

Veeam Logoby eriktxstate » Thu Mar 24, 2011 8:40 pm

We've had a problem with VM's becoming unresponsive, and it was narrowed down to the second that Veeam releases a snapshot. We're using CBT for VM's, hoping that it'd help with backup times. Additionally, we're currently running Veeam B&R 5.0.1.198, Vmware 4.1 esxi on our hosts and vSphere 4.1.


The issue came up with our windows admins who kept seeing clustered VM's bark in SCCM about loosing a node in a cluster. I've noticed it on several linux-based systems and witnessed the unresponsiveness (up to 10 mintues!!!) myself and thought I was crazy. I've suggested bumping out the timeout values for those cluster checks on the windows systems, but it seems odd that this is a persistent issue that's becoming more and more trouble.

Now, I've seen the issue with NFS and CBT in VMware 4.1, here:

http://kb.vmware.com/selfservice/micros ... 0168771546


But all of our datastores are on a 8Gb SAN. This may be an exclusive VMware issue, but I need help pointing the finger. Veeam - is there anything you have heard of regarding long snapshot release times? I'm only guessing this is due to CBT because that's the only thing I can think of at this stage that would be causing the issue. When I make a regular snapshot, there's no issues.

Thanks,
Erik
eriktxstate
Influencer
 
Posts: 22
Liked: never
Joined: Wed Sep 29, 2010 4:40 pm
Full Name: Erik Redding

Re: Snapshot removal issues of a large VM

Veeam Logoby eriktxstate » Wed Mar 30, 2011 4:29 pm

My above post got merged into this discussion, but I'm still not convinced it's related to size. I think it's more or less related to CBT being slow.
eriktxstate
Influencer
 
Posts: 22
Liked: never
Joined: Wed Sep 29, 2010 4:40 pm
Full Name: Erik Redding

Re: Snapshot removal issues of a large VM

Veeam Logoby Gostev » Wed Mar 30, 2011 4:43 pm

Erik, it was merged in this topic where this and similar issues are being discussed, no matter of VM size (don't pay attention to the topic name). It is good to have everything in the same place.

Did you read some suggestions below on what else can cause unresponsiveness (such as existing snapshots)? 10 minutes unresponsiveness is actually real issue that you have to address. Most people reporting issues here have pretty minor issues (occasional application timeouts).
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

PreviousNext

Return to VMware vSphere



Who is online

Users browsing this forum: Google [Bot] and 12 guests