Snapshot removal issues of a large VM

curruscanis · Post by **curruscanis** » Jan 19, 2010 4:42 pm this post

I have an issue within my enviorment where a VM that is 500+ Gb is taking a very long time for snapshot removal. After the Veeam backup process calls to remove the snapshot the VM will go off line from the networks perspective and be in a snapshot removal state for almost an hour, sometimes more. Is there anything that I can do to keep the VM online during this process or make the process take less time?

Thanks.

Post by **tsightler** » Jan 19, 2010 5:48 pm this post

What version and patch level of VMware are you running? VM's should generally stay online during snapshot removal except for a few seconds as the final commits are made. We backups several VM's that are 500+GB, including one that's 1.2TB, and I've never seen this issue.

jgremillion · Post by **jgremillion** » Jan 19, 2010 7:43 pm this post

We have this problem occasionally when performing backups of large GroupWise VMs. It seems to only happen when we have a Post Office that has been pretty busy during the backup Window. I asked VMware about this and they said that if you have a VM that was being used heavily during the time the snapshot was created and was being backed up it can take quite while for the snapshot to consolidate before it's removed. This can and will effect the VMs performance.

Their solution was not to perform a backup of a busy VM during heavy use periods.

Post by **tsightler** » Jan 19, 2010 8:09 pm this post

Well, I can certainly understand "effect the VMs performance", but he's saying "VM will go off line from the networks perspective..." for "...almost an hour, sometimes more." That's a little more than a performance issue. We backup some very busy VM's, including our Exchange VM. It's almost 400GB now and it's pretty busy almost all the time. It's not unusual for it to grow a multi-gigabyte snapshot that takes 30-40 minutes to remove, even backing it up during a "quiet" time. Still, I've never seen a system go completely offline for an hour. That's sounds like a serious problem to me.

jgremillion · Post by **jgremillion** » Jan 19, 2010 8:35 pm this post

Well hope it doesn't happen to you either because it aint pretty when everyone starts yelling and calling.

User's mailboxes on the affect POs are pretty much inaccessible until the snapshot is consolidated and removed.

I've had two VMs become pretty much unusably during the snapshot removal time. The first time it happened I was skeptical but around the third time it happened I pretty much decided that I need to start the backup earlier.

One thing that may be the culprit is all of our large GroupWise (on Windows) VMs are virtual RDMs. I wonder if it's a issue with consolidating the snapshot of the RDM to the raw disk?

jgremillion · Post by **jgremillion** » Jan 19, 2010 8:37 pm this post

And yes, I've had this happen for almost that long. I had one that was stuck for 45 minutes. Talk about a major panic around here.

Post by **tsightler** » Jan 19, 2010 8:51 pm this post

I wasn't trying to claim that it couldn't happen, or that you didn't have it happen, only that I believe that it shouldn't happen. To be fair, I've seen similar problems back in the ESX 3.5 days and earlier. There were some known issues with snapshot removal that could cause this. But since 3.5 U2 (I think U2, I guess it might have been U3) snapshot removal was overhauled completely and now uses helper snapshots in a loop until the final snapshot is small, and thus the "stun" time should be short. Veeam 4.x also has "safe snapshot removal" that let's you get similar behavior from older versions of ESX.

I guess what I'm saying is, if you're seeing this with current VMware versions, well, that still seems like a problem, perhaps something unique in your environment (slow storage, the virtual RDM's you mention -- we use VMDK's, etc). In other words, I'm fully buying that it can happen, but if it were happening to me, I don't think I'd let VMware off the hook with the "don't preform a backup of a busy VM" excuse. What if I were using snapshots for other purposes?

Post by **Gostev** » Jan 19, 2010 11:19 pm this post

ESX4 indeed has an issue when removing snapshot causes long VM freezes, but this only happens if there is another snapshot exists on VM before you create, and then try to delete an additional snapshot. The VM freeze is proportional to the first (existing) snapshot size, and does not matter on how big the second snapshot have grown. So please check if you have other snapshots on your VM.

This is the only issue I am aware of which may cause significant downtimes on production VM during the snapshot removal with ESX4. If you do not have additional snapshot, then your VM definitely should not become inaccessible for more than a few seconds during the snapshot removal, no matter what the snapshot size is - I've personally done stress testing on this (snapshot removal while copying large files to VM). The way snapshot removal is implemented in ESX4 ensures that large snapshots do not result in longer VM freezes (except the issue/bug with extra snapshots present - described above).

jgremillion · Post by **jgremillion** » Jan 20, 2010 2:45 am this post

I do not have any other snapshots when this happens. This only happens when the snapshot from VBR is trying being removed. No other snapshots.

curruscanis · Post by **curruscanis** » Jan 20, 2010 5:12 pm this post

First my version of ESX: 4.0.0 Build 164009
Vcenter version : 4.0.0 Build 162856

I do have some items in my snapshot manager for my large VM I have two levels of Consolidate Helper-0.

Is this a remenant of failed backups?

Thank you all for your help and suggestions.

Post by **tsightler** » Jan 20, 2010 6:19 pm this post

I have never seen the problem you describe with ESX 4 but of course that doesn't mean that it might not exist. Are you running the latest VM tools? Do you have the "VMware Tools Quiesce" disabled? You might want to make sure that the VMware Tools sync driver is not installed or is disabled, having this legacy service enabled has been known to cause hangs during snapshot removal. Just a few thoughts.

I'd also suggest that you remove the snapshots that are currently on the VM. It's likely that those are leftovers from failed backups and I would suggest you remove them via the snapshot manager GUI.

Post by **Gostev** » Jan 21, 2010 12:41 pm this post

Jack, also make sure you are using the latest Veeam Backup version (4.1), as with previous release Consolidate Helper-0 snapshot could be left if your stop the backup job manually.

Tom is correct that you should remove the snapshot manually. If you do not have this option available in snapshot manager GUI, you should create and extra (new) snapshot first, then you will be able to remove the helper snapshot.

Thanks!

Ace T · Post by **Ace T** » Mar 22, 2010 11:42 am this post

We have this issue on 4.1 with a large VM sitting removing snapshots and not availble on the network on ESX4. What can I do to resolve this ?

Post by **Gostev** » Mar 22, 2010 11:56 am this post

Amit, the only possible cause we know about is described in my post above (20 Jan 2010). If this does not apply to you, it would be better for you to open a support case with VMware to investigate why snapshot removal causes issues such a long VM locks. Veeam Backup is merely issuing command to remove snapshot, so this is similar to removing snapshot manually with VMware Infrastructure Client. The actual process is fully handled by ESX host.

This is definitely not "normal" behavior, it does not matter how large the VM or snapshot is, this should not be happening.

Ace T · Post by **Ace T** » Mar 22, 2010 2:06 pm this post

Hi Gostev,

VMWare could not see where the problem was and advised me to wait till the operation completed before making sure all snapshots are removed from the VM. They had a look through all the logs and said it was just a very slow snapshot removal process but they were not sure about not being able to ping the VM. This is a large VM but the snapshot is still removing now and has been going on for over 4 hours. I can see the Consolidated Helper snapshot but there are 3 snapshots in total so it is taking a while to clear them all.

nira99077 · Post by **nira99077** » Mar 25, 2010 12:37 pm this post

Hi Gostev

Let me say in advance, sorry for the long novel that follows, but as I am not the only one experiencing this issue, I thought the more information the better.

I experienced this exact issue today on the 3 x VM's that I replicate to my DR site. I thought I would share what I have found so far so hopefully this issue can be addressed from Veeam's end as I believe it could be both a Veeam and VMware issue combining to cause the issue. BTW, I have also opened a support case for this issue.

Firstly some history:
I am running ver 4.1 of Veeam and vSphere 4.0.

VM1 has been replicating happily since installing Veeam a couple of weeks ago and during business hours, a replication pass average between 1 and 2 hours so I would not expect the snapshot to be all that big.
VM2 had been trying to complete it's initial WAN replication after being seeded from a removable disk for about a week. It was running for a couple of days prior to a power outage the other day and was kicked off again after the power outage but still had not completed. I would expect this snapshot would have been quite big.
VM3 was also in the middle of a retry (power outage again) of an expected large pass as I had been advised by support to defrag VM3's guest OS to try and address replications taking a long time to complete. I would also expect this snapshot would have been quite big.

Today, users contacted support and reported that the 3 replicated servers (DB, Mail and File) were not responding which I confirmed. While investigating, I found that for a still unknown reason (currently with support), all 3 replication jobs failed at the same time. Each VM was in the process of removing snapshots. In the VMware Snapshot manager,each VM had both a Veeam backup snapshot and a Consolidate Helper-0 snapshot. When the snapshot removal finished, all 3 VMs returned to normal operation.

Previously, during normal snapshot removals, I have not had this issue on these 3 servers. (Previous Veeam and Vizioncore replications have run on these servers for around 6 months).

What I would like to know is:

1) If it is as a reult of Veeam and the previous replication passes failing that the Consolidate Helper-0 snapshots existed.
2) Why would these snapshots exist if the Vm's were not trying to remove snapshots when the replication passed failed (Power Outage in my case).
3) Apart from manually checking each VM in vCentre, is there anyway Veeam can check and advise when a Consolidate Helper-0 snapshot has not been successfully deleted so this can be triggered manually to avoid this issue.
4) Support recommended enabling "Safe Snapshot removal" which was already on with default settings (100M). Is there a recommnded minimum level this can be set to, or would it make no difference if there is a "stale" Consolidate Helper-0 snapshot already on the VM.
5) Is having "Safe Snapshot removal" enabled likely to be why the Consolidate Helper-0 snapshot was created in the first place.

Thanks Adrian

Post by **Gostev** » Mar 25, 2010 2:27 pm this post

Hello Adrian,

1. Consolidate Helper snapshots are always created by ESX hosts during snapshot removal, but they should not persist under normal conditions (even if replication job fails).
2. It looks like network or vCenter connection failure prevented Veeam Backup to issue snapshot removal command (this can be confirmed by our support with debug logs). Veeam Backup snapshot will be removed automatically during next Veeam Backup job path, so we will take care about this one. However, Consolidate Helper snapshot is not something Veeam Backup directly creates and manages, this snapshot is created during snapshot removal process and should be cleaned up by ESX. But I am guessing network or power issue during snapshot removal might cause this snapshot to remain? It is best to ask VMware investigate their logs to understand why Consolidate Helper snapshot remains.
3. This sounds like a feature we could add, seems useful to me - even though it is uncommon during normal operation to see this happening. I will investigate this with devs.
4. I guess they missed the fact that you are on vSphere. Enabling this feature on ESX4 will have no effect. This feature is designed for pre-ESX3.5 U2 hosts to help with consolidation of large snapshot. This is no longer needed as ESX now has built-in logic for safe removal of large snapshots.

Thanks!

rchew · Post by **rchew** » Mar 25, 2010 3:32 pm this post

curruscanis wrote: After the Veeam backup process calls to remove the snapshot the VM will go off line from the networks perspective...

Finally...I'm glad that I'm not the only one experiencing this problem. I've been working with Veeam Support for the last 3 months on this issue (amongst others) to no avail. I'm curious if we share similar environments.

I'm running...
- ESX 4.0.0, 208167
- 2 x IBM x3650 Servers
- Source and Target storage is over NFS
- All network connections run to a pair of Cisco 3750 cross-stack switches running etherchannel.
- VM Guest being backed up are about 70GB each.
- Veeam VBR is installed on a VM within the HA Pair.
- Veeam Replication set up to use the VMWare "Network" vStorage API
- VM Tools Quiescence disabled / VSS Quiescence Enabled

Interesting thing to note. When I moved the source VM over to local storage, the network disconnect occurance were much shorter. The usual disconnects were around 20 - 30 seconds. On local storage, about 5 seconds.

The other issue I've had that you may want to check. Some of my VM's are resetting after the snapshot removal. This is a hard reset. I discovered this while looking at the Windows event logs for clues on the network disconnects. Since this was not occuring on all my VM's I'm not sure if its related. I just thought I would throw it out there to see if anyone else was experiencing this problem.

Post by **Gostev** » Mar 25, 2010 3:48 pm this post

rchew wrote:Interesting thing to note. When I moved the source VM over to local storage, the network disconnect occurance were much shorter. The usual disconnects were around 20 - 30 seconds. On local storage, about 5 seconds.

Does it approximately match VM freeze times in VMware VM log? It is pretty easy to read, check it out?

Also (I don't know if our support suggested this), but you may want to open support case with VMware on this as well, as Veeam Backup merely issues snapshot removal command (same as if you would initiate snapshot removal using VIC). Actual processing of this command is fully handled by the corresponding ESX. VMware may be able to find the reason much faster as they actually own this code.

rchew · Post by **rchew** » Mar 25, 2010 5:31 pm this post

Does it approximately match VM freeze times in VMware VM log?

It always happens right after the snapshot removal. Also...when I perform manual snapshots and removals, I do not see the network disconnect behaviour.

Also (I don't know if our support suggested this), but you may want to open support case with VMware on this as well, as Veeam Backup merely issues snapshot removal command (same as if you would initiate snapshot removal using VIC). Actual processing of this command is fully handled by the corresponding ESX. VMware may be able to find the reason much faster as they actually own this code.

I have just initiated support on the VMWare side. Unfortunately, our support contract is through IBM so I don't have direct access to VMWare yet. We are changing this structure soon. However...since you have multiple customers experiencing this problem, I think Veeam should also pursue this problem with VMWare. As a VMWare partner, wouldn't you have much better access to some of their resources? Furthermore, since manual snapshots and removals do not exhibit this behavior, it is difficult to point the finger at VMWare.

Post by **Gostev** » Mar 25, 2010 7:07 pm this post

I am not sure if you missed my question, but were you able to investigate VMware logs for the affected VMs? There are the log files created on the datastore next to each VM. They provide information on VM stun cycles duration during snapshot commit operations, basically if VM remains stunned for a few seconds, this results in network drop in guest OS. This is the first thing I would check.

rchew wrote:It always happens right after the snapshot removal. Also...when I perform manual snapshots and removals, I do not see the network disconnect behaviour.

VIC snapshot does not quiesce VM. Freeze/unfreeze during quiescence is something that can potentially affect guest OS. Did you try to run Veeam backup job with both Veeam VSS and VMware Tools quiescence disabled in the Advanced job settings, and see if the issue goes away? This would be closest behavior to VIC snapshot. Also, when testing make sure you wait enough time before removing snapshot, so it becomes large enough (like in case of backup, which takes some time).

rchew wrote:However...since you have multiple customers experiencing this problem, I think Veeam should also pursue this problem with VMWare. As a VMWare partner, wouldn't you have much better access to some of their resources?

We do open support cases directly with VMware for any problem we can reproduce internally (to be able to show them). For example, we were first to open support case and bring this issue to VMware's attention. But for problems which are not reproducible and affect a few specific deployments only, VMware needs to work directly with the affected customer, as this requires direct webex session between VMware SE and the customer to troubleshoot.

rchew wrote:Furthermore, since manual snapshots and removals do not exhibit this behavior, it is difficult to point the finger at VMWare.

Every time I saw this issue before, it was no different between removing snapshot through VIC, or letting Veeam Backup remove it. I would be surprised if this was otherwise, because both tools issue the same RemoveSnapshot VMware API call to initiate snapshot removal on ESX. Please PM me your support case number so that I can take a look up more details on your situation.

Post by **tsightler** » Apr 15, 2010 5:07 pm this post

OK, so as a followup to this, we actually did have a similar problem happen yesterday. Around 11:30AM we started receiving complaints of users having poor response from Outlook, especially for messages with attachments, and we started to investigate. What we found was that, due to an administrative error, a Veeam full backup ran during and overlapped with part of the business day. Our 400+GB Exchange server was backed up starting around 8:00AM and completed around 11:00AM. In the process that VMware snapshot grew to over 6GB. At the end of the backup Veeam initiated the snapshot removal process. During this snapshot removal process Exchange was very slow to respond to request, even timing out some Outlook connections.

While this was an unusual issue that occurred primarily due to an administrative error, we still wondered if anything we could do with VMware could potentially prevent this issue in the future. We decided to try bumping the CPU reservation and shares up significantly as, while the snapshot removal was taking place, the VM showed very high CPU utilization within the VM, but very low utilization within ESX. This made us think that the VMware snapshot removal process may place some cap on the resource utilization during snapshot removal, however, it should not be allowed to cap the performance below the reservation level.

Today we performed the same full backup just as a test. We kicked off a full backup starting at 8AM and, predictably, it ended around 11AM. The snapshot growth today was around 5.8GB. When the snapshot removal process started the system definitely slowed down, but not nearly as much as yesterday. Performance of the Outlook client was still slower than normal, sometimes taking a few seconds to open messages with attachments, but nothing like yesterday. The increase of CPU reservation seemed to have a significant impact.

We're going to try increasing the CPU reservation to max. We're thinking this might be the key to keeping busy VM's responsive during background snapshot commits. Has anyone else tried using CPU reservation settings to keep VM's responsive during snapshot removal and seen any positive results?

We also still wonder if Veeam's "safe snapshot removal" might still be useful in a scenario like this. It's strange, but creating a new snapshot, and removing an old snapshot via the vCenter console doesn't seem to have the same negative performance impact as the "delete all snapshots" option. It might be worth trying.

Also, note that I never experienced the complete loss of connectivity or response that the OP of this thread reported, but this performance issue was still enough to be noticed by some users.

grizo · Post by **grizo** » Apr 15, 2010 8:58 pm this post

not sure if this has been mentioned as a solution already but i did run into this issue and resolved the issue by re-installing VMTools. Also, there was a fix for vsphere 4.0.1 related to this issue. http://kb.vmware.com/selfservice/micros ... Id=1017458

KiwiJJ · Post by **KiwiJJ** » Apr 19, 2010 10:19 pm this post

Hi,
To add my 2 cents worth. We have similar problems with SQL 2005 running on a Windows 2003 Server that is constantly being used (ie no downtime). When I go to backup this server it freezes and the application servers lose contact with it and the batch jobs fail. Also, when the job finishes and the snapshot is being commited
the server freezes again and I cannot even login to it. I have turned off VM quiesence and am using VSS.
I have even tried copying the data that needs to be backed up to a drive on the server that does not get used (E:\) except to store this data and
then using Veeam to only backup this drive and excluding the C:\ and D:\ drives. The server still freezes. Gostev, why would this happen if I am only backing up
a single drive that is not being used ?

regards,

John

Post by **Gostev** » Apr 19, 2010 11:28 pm this post

John, this is because the freeze happens not due to backup activities, but due to some issues with snapshot creation and deletion. VMware snapshots affect whole VM, not just specific disks.

arthurp · Post by **arthurp** » Aug 10, 2010 4:08 pm this post

rchew wrote: Interesting thing to note. When I moved the source VM over to local storage, the network disconnect occurance were much shorter. The usual disconnects were around 20 - 30 seconds. On local storage, about 5 seconds.

We do not experience server disconnect issues, rather slowdown to our busy Exchange 2003 VM (160GB) as described by tsightler. It is more prominent on messages with attachments. In fact we have three brief slowdowns for every replication cycle: one at the point when original snapshot is initiated by Veeam, second (the most noticable one) when Veeam initiates snapshot removal, and the third one closer to the end of replication cycle, probably at the point when consolidate helper is removed.

Now, my post goes to the quote above. We had this server on local storage and snapshot removal hardly ever took over 10 minutes. Since we moved it over to the SAN (IBM DS3300) snapshot removal routinely takes over 20 minutes, e.g., we just had a job where it took 15 minutes to create a replica, but 18 minutes to remove the snapshot. I will have more stats later this week (hopefully), but it seems to be a consistent behaviour on all our servers as they are moved to SAN.

It is the latest Veeam, VMware 4.0.2.

Thank you, Arthur

arthurp · Post by **arthurp** » Aug 11, 2010 4:43 pm this post

I find that snapshot removal is considerably longer when VM is running off SAN as opposed to local storage.

To illustrate the situation, here are the following details:
VM with 2 30GB drives is our AV (Symantec) and WSUS server. As such, it gets busy at times when Symantec downloads new definitions and pushes those to clients. Same applies to WSUS around MS Tuesday updates.

When we run it off SAN (IBM DS3300) on IBM x3650M3 average replication time is 5:23 out of which 1:23 is for snapshot removal. When VM is moved to local storage on IBM x3500 (far less powerful box) the data for replication and snapshot removal duration is respectively 4:11 and 0:11 seconds. These results are averages of 50 replications (we run these every 20 minutes) over comparable daily intervals.

Now, this is not a problem for this particular VM as the end user will not notice the difference for AV and Windows update. Data from this VM was used for illustration only. However, the same situation stands for all other VMs, including domain controllers, file and print servers, Exchange 2003 and SQL2005 servers. We attempt to use Veeam in "near CDP" mode and this is a significant roadblock. We can't do "near CDP" on SQL as it interferes with native sql backups. File server is not particularly responsive during snapshot removal, but this is probably something we can live with. The biggest issue is Exchange as we used to run hourly backups throughout the day. Once we moved it to SAN, regular operation is unaffected, but snapshot removal is routinely over 15 minutes (up to 30), during which end-user experience is quite bad. This is especially true for mail with attachments.

I will follow up with more data as we move our VMs around. In the meantime, every advise is very much appreciated.

Thank you, Arthur

matarvai · Post by **matarvai** » Nov 10, 2010 4:47 pm this post

We have same downtime issues now. No problems for almost a year, but now our busy Exchange server is having issues when backing up. When Veeam is removing snapshot it creates 15-20min downtime to exchange, Outlook clients loses connection for this period. Is there any ideas what could reduce the downtime during snapshot removal?

joergr · Post by **joergr** » Nov 10, 2010 7:37 pm this post

The old rule: The more disk load on the vm and the longer the time frame, the bigger the snapshot, the slower the snapshot commit.

Backup during non-high-disk-load times, thus the snapshot won´t grow that much and thus, the snapshot can be committed very fast. 20-30 mins offline is something i never ever saw before. Could you by any chance check out if this behaviour also occurs when using esxi 4.1? VMware did A LOT, especially when it comes to snapshot handling with ESXi 4.1.

If you mentioned it already mea culpa - but could you describe exactly what you use (esx version, iscsi/fc and if iscsi hba or software, san vendor and model).

Best regards,
Joerg

matarvai · Post by **matarvai** » Nov 10, 2010 8:02 pm this post

We are using ESXi 4.0 at the moment at this server. And we're using DAS. We had same problem at february this year, but we did something then and it corrected the problem. I just can't remember what was the fix then.

R&D Forums

Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Who is online