Snapshot removal issues of a large VM

VMware specific discussions

Re: Snapshot removal issues of a large VM

Veeam Logoby nira99077 » Thu Mar 25, 2010 12:37 pm

Hi Gostev

Let me say in advance, sorry for the long novel that follows, but as I am not the only one experiencing this issue, I thought the more information the better.

I experienced this exact issue today on the 3 x VM's that I replicate to my DR site. I thought I would share what I have found so far so hopefully this issue can be addressed from Veeam's end as I believe it could be both a Veeam and VMware issue combining to cause the issue. BTW, I have also opened a support case for this issue.

Firstly some history:
I am running ver 4.1 of Veeam and vSphere 4.0.

VM1 has been replicating happily since installing Veeam a couple of weeks ago and during business hours, a replication pass average between 1 and 2 hours so I would not expect the snapshot to be all that big.
VM2 had been trying to complete it's initial WAN replication after being seeded from a removable disk for about a week. It was running for a couple of days prior to a power outage the other day and was kicked off again after the power outage but still had not completed. I would expect this snapshot would have been quite big.
VM3 was also in the middle of a retry (power outage again) of an expected large pass as I had been advised by support to defrag VM3's guest OS to try and address replications taking a long time to complete. I would also expect this snapshot would have been quite big.

Today, users contacted support and reported that the 3 replicated servers (DB, Mail and File) were not responding which I confirmed. While investigating, I found that for a still unknown reason (currently with support), all 3 replication jobs failed at the same time. Each VM was in the process of removing snapshots. In the VMware Snapshot manager,each VM had both a Veeam backup snapshot and a Consolidate Helper-0 snapshot. When the snapshot removal finished, all 3 VMs returned to normal operation.

Previously, during normal snapshot removals, I have not had this issue on these 3 servers. (Previous Veeam and Vizioncore replications have run on these servers for around 6 months).

What I would like to know is:

1) If it is as a reult of Veeam and the previous replication passes failing that the Consolidate Helper-0 snapshots existed.
2) Why would these snapshots exist if the Vm's were not trying to remove snapshots when the replication passed failed (Power Outage in my case).
3) Apart from manually checking each VM in vCentre, is there anyway Veeam can check and advise when a Consolidate Helper-0 snapshot has not been successfully deleted so this can be triggered manually to avoid this issue.
4) Support recommended enabling "Safe Snapshot removal" which was already on with default settings (100M). Is there a recommnded minimum level this can be set to, or would it make no difference if there is a "stale" Consolidate Helper-0 snapshot already on the VM.
5) Is having "Safe Snapshot removal" enabled likely to be why the Consolidate Helper-0 snapshot was created in the first place.

Thanks Adrian
Adrian
nira99077
Novice
 
Posts: 3
Liked: never
Joined: Thu Feb 25, 2010 11:53 pm
Full Name: Adrian Simpson

Re: Snapshot removal issues of a large VM

Veeam Logoby Gostev » Thu Mar 25, 2010 2:27 pm

Hello Adrian,

1. Consolidate Helper snapshots are always created by ESX hosts during snapshot removal, but they should not persist under normal conditions (even if replication job fails).
2. It looks like network or vCenter connection failure prevented Veeam Backup to issue snapshot removal command (this can be confirmed by our support with debug logs). Veeam Backup snapshot will be removed automatically during next Veeam Backup job path, so we will take care about this one. However, Consolidate Helper snapshot is not something Veeam Backup directly creates and manages, this snapshot is created during snapshot removal process and should be cleaned up by ESX. But I am guessing network or power issue during snapshot removal might cause this snapshot to remain? It is best to ask VMware investigate their logs to understand why Consolidate Helper snapshot remains.
3. This sounds like a feature we could add, seems useful to me - even though it is uncommon during normal operation to see this happening. I will investigate this with devs.
4. I guess they missed the fact that you are on vSphere. Enabling this feature on ESX4 will have no effect. This feature is designed for pre-ESX3.5 U2 hosts to help with consolidation of large snapshot. This is no longer needed as ESX now has built-in logic for safe removal of large snapshots.

Thanks!
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: Snapshot removal issues of a large VM

Veeam Logoby rchew » Thu Mar 25, 2010 3:32 pm

curruscanis wrote: After the Veeam backup process calls to remove the snapshot the VM will go off line from the networks perspective...


Finally...I'm glad that I'm not the only one experiencing this problem. I've been working with Veeam Support for the last 3 months on this issue (amongst others) to no avail. I'm curious if we share similar environments.

I'm running...
- ESX 4.0.0, 208167
- 2 x IBM x3650 Servers
- Source and Target storage is over NFS
- All network connections run to a pair of Cisco 3750 cross-stack switches running etherchannel.
- VM Guest being backed up are about 70GB each.
- Veeam VBR is installed on a VM within the HA Pair.
- Veeam Replication set up to use the VMWare "Network" vStorage API
- VM Tools Quiescence disabled / VSS Quiescence Enabled

Interesting thing to note. When I moved the source VM over to local storage, the network disconnect occurance were much shorter. The usual disconnects were around 20 - 30 seconds. On local storage, about 5 seconds.

The other issue I've had that you may want to check. Some of my VM's are resetting after the snapshot removal. This is a hard reset. I discovered this while looking at the Windows event logs for clues on the network disconnects. Since this was not occuring on all my VM's I'm not sure if its related. I just thought I would throw it out there to see if anyone else was experiencing this problem.
rchew
Influencer
 
Posts: 20
Liked: never
Joined: Wed Dec 16, 2009 7:02 pm
Full Name: Raymond Chew

Re: Snapshot removal issues of a large VM

Veeam Logoby Gostev » Thu Mar 25, 2010 3:48 pm

rchew wrote:Interesting thing to note. When I moved the source VM over to local storage, the network disconnect occurance were much shorter. The usual disconnects were around 20 - 30 seconds. On local storage, about 5 seconds.

Does it approximately match VM freeze times in VMware VM log? It is pretty easy to read, check it out?

Also (I don't know if our support suggested this), but you may want to open support case with VMware on this as well, as Veeam Backup merely issues snapshot removal command (same as if you would initiate snapshot removal using VIC). Actual processing of this command is fully handled by the corresponding ESX. VMware may be able to find the reason much faster as they actually own this code.
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: Snapshot removal issues of a large VM

Veeam Logoby rchew » Thu Mar 25, 2010 5:31 pm

Does it approximately match VM freeze times in VMware VM log?


It always happens right after the snapshot removal. Also...when I perform manual snapshots and removals, I do not see the network disconnect behaviour.

Also (I don't know if our support suggested this), but you may want to open support case with VMware on this as well, as Veeam Backup merely issues snapshot removal command (same as if you would initiate snapshot removal using VIC). Actual processing of this command is fully handled by the corresponding ESX. VMware may be able to find the reason much faster as they actually own this code.


I have just initiated support on the VMWare side. Unfortunately, our support contract is through IBM so I don't have direct access to VMWare yet. We are changing this structure soon. However...since you have multiple customers experiencing this problem, I think Veeam should also pursue this problem with VMWare. As a VMWare partner, wouldn't you have much better access to some of their resources? Furthermore, since manual snapshots and removals do not exhibit this behavior, it is difficult to point the finger at VMWare.
rchew
Influencer
 
Posts: 20
Liked: never
Joined: Wed Dec 16, 2009 7:02 pm
Full Name: Raymond Chew

Re: Snapshot removal issues of a large VM

Veeam Logoby Gostev » Thu Mar 25, 2010 7:07 pm

I am not sure if you missed my question, but were you able to investigate VMware logs for the affected VMs? There are the log files created on the datastore next to each VM. They provide information on VM stun cycles duration during snapshot commit operations, basically if VM remains stunned for a few seconds, this results in network drop in guest OS. This is the first thing I would check.

rchew wrote:It always happens right after the snapshot removal. Also...when I perform manual snapshots and removals, I do not see the network disconnect behaviour.

VIC snapshot does not quiesce VM. Freeze/unfreeze during quiescence is something that can potentially affect guest OS. Did you try to run Veeam backup job with both Veeam VSS and VMware Tools quiescence disabled in the Advanced job settings, and see if the issue goes away? This would be closest behavior to VIC snapshot. Also, when testing make sure you wait enough time before removing snapshot, so it becomes large enough (like in case of backup, which takes some time).

rchew wrote:However...since you have multiple customers experiencing this problem, I think Veeam should also pursue this problem with VMWare. As a VMWare partner, wouldn't you have much better access to some of their resources?

We do open support cases directly with VMware for any problem we can reproduce internally (to be able to show them). For example, we were first to open support case and bring this issue to VMware's attention. But for problems which are not reproducible and affect a few specific deployments only, VMware needs to work directly with the affected customer, as this requires direct webex session between VMware SE and the customer to troubleshoot.

rchew wrote:Furthermore, since manual snapshots and removals do not exhibit this behavior, it is difficult to point the finger at VMWare.

Every time I saw this issue before, it was no different between removing snapshot through VIC, or letting Veeam Backup remove it. I would be surprised if this was otherwise, because both tools issue the same RemoveSnapshot VMware API call to initiate snapshot removal on ESX. Please PM me your support case number so that I can take a look up more details on your situation.
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: Snapshot removal issues of a large VM

Veeam Logoby tsightler » Thu Apr 15, 2010 5:07 pm

OK, so as a followup to this, we actually did have a similar problem happen yesterday. Around 11:30AM we started receiving complaints of users having poor response from Outlook, especially for messages with attachments, and we started to investigate. What we found was that, due to an administrative error, a Veeam full backup ran during and overlapped with part of the business day. Our 400+GB Exchange server was backed up starting around 8:00AM and completed around 11:00AM. In the process that VMware snapshot grew to over 6GB. At the end of the backup Veeam initiated the snapshot removal process. During this snapshot removal process Exchange was very slow to respond to request, even timing out some Outlook connections.

While this was an unusual issue that occurred primarily due to an administrative error, we still wondered if anything we could do with VMware could potentially prevent this issue in the future. We decided to try bumping the CPU reservation and shares up significantly as, while the snapshot removal was taking place, the VM showed very high CPU utilization within the VM, but very low utilization within ESX. This made us think that the VMware snapshot removal process may place some cap on the resource utilization during snapshot removal, however, it should not be allowed to cap the performance below the reservation level.

Today we performed the same full backup just as a test. We kicked off a full backup starting at 8AM and, predictably, it ended around 11AM. The snapshot growth today was around 5.8GB. When the snapshot removal process started the system definitely slowed down, but not nearly as much as yesterday. Performance of the Outlook client was still slower than normal, sometimes taking a few seconds to open messages with attachments, but nothing like yesterday. The increase of CPU reservation seemed to have a significant impact.

We're going to try increasing the CPU reservation to max. We're thinking this might be the key to keeping busy VM's responsive during background snapshot commits. Has anyone else tried using CPU reservation settings to keep VM's responsive during snapshot removal and seen any positive results?

We also still wonder if Veeam's "safe snapshot removal" might still be useful in a scenario like this. It's strange, but creating a new snapshot, and removing an old snapshot via the vCenter console doesn't seem to have the same negative performance impact as the "delete all snapshots" option. It might be worth trying.

Also, note that I never experienced the complete loss of connectivity or response that the OP of this thread reported, but this performance issue was still enough to be noticed by some users.
tsightler
Veeam Software
 
Posts: 4768
Liked: 1737 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Snapshot removal issues of a large VM

Veeam Logoby grizo » Thu Apr 15, 2010 8:58 pm

not sure if this has been mentioned as a solution already but i did run into this issue and resolved the issue by re-installing VMTools. Also, there was a fix for vsphere 4.0.1 related to this issue. http://kb.vmware.com/selfservice/micros ... Id=1017458
grizo
Novice
 
Posts: 3
Liked: never
Joined: Fri Mar 12, 2010 7:44 pm
Full Name: Gary Rizo

Re: Snapshot removal issues of a large VM

Veeam Logoby KiwiJJ » Mon Apr 19, 2010 10:19 pm

Hi,
To add my 2 cents worth. We have similar problems with SQL 2005 running on a Windows 2003 Server that is constantly being used (ie no downtime). When I go to backup this server it freezes and the application servers lose contact with it and the batch jobs fail. Also, when the job finishes and the snapshot is being commited
the server freezes again and I cannot even login to it. I have turned off VM quiesence and am using VSS.
I have even tried copying the data that needs to be backed up to a drive on the server that does not get used (E:\) except to store this data and
then using Veeam to only backup this drive and excluding the C:\ and D:\ drives. The server still freezes. Gostev, why would this happen if I am only backing up
a single drive that is not being used ?

regards,

John
KiwiJJ
Enthusiast
 
Posts: 84
Liked: 1 time
Joined: Tue Feb 16, 2010 8:05 pm
Location: New Zealand
Full Name: John Jones

Re: Snapshot removal issues of a large VM

Veeam Logoby Gostev » Mon Apr 19, 2010 11:28 pm

John, this is because the freeze happens not due to backup activities, but due to some issues with snapshot creation and deletion. VMware snapshots affect whole VM, not just specific disks.
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: Snapshot removal issues of a large VM

Veeam Logoby arthurp » Tue Aug 10, 2010 4:08 pm

rchew wrote:Interesting thing to note. When I moved the source VM over to local storage, the network disconnect occurance were much shorter. The usual disconnects were around 20 - 30 seconds. On local storage, about 5 seconds.


We do not experience server disconnect issues, rather slowdown to our busy Exchange 2003 VM (160GB) as described by tsightler. It is more prominent on messages with attachments. In fact we have three brief slowdowns for every replication cycle: one at the point when original snapshot is initiated by Veeam, second (the most noticable one) when Veeam initiates snapshot removal, and the third one closer to the end of replication cycle, probably at the point when consolidate helper is removed.

Now, my post goes to the quote above. We had this server on local storage and snapshot removal hardly ever took over 10 minutes. Since we moved it over to the SAN (IBM DS3300) snapshot removal routinely takes over 20 minutes, e.g., we just had a job where it took 15 minutes to create a replica, but 18 minutes to remove the snapshot. I will have more stats later this week (hopefully), but it seems to be a consistent behaviour on all our servers as they are moved to SAN.

It is the latest Veeam, VMware 4.0.2.

Thank you, Arthur
arthurp
Influencer
 
Posts: 23
Liked: never
Joined: Mon Jan 11, 2010 9:18 pm
Full Name: Arthur Pizyo

Re: Snapshot removal issues of a large VM

Veeam Logoby arthurp » Wed Aug 11, 2010 4:43 pm

I find that snapshot removal is considerably longer when VM is running off SAN as opposed to local storage.

To illustrate the situation, here are the following details:
VM with 2 30GB drives is our AV (Symantec) and WSUS server. As such, it gets busy at times when Symantec downloads new definitions and pushes those to clients. Same applies to WSUS around MS Tuesday updates.

When we run it off SAN (IBM DS3300) on IBM x3650M3 average replication time is 5:23 out of which 1:23 is for snapshot removal. When VM is moved to local storage on IBM x3500 (far less powerful box) the data for replication and snapshot removal duration is respectively 4:11 and 0:11 seconds. These results are averages of 50 replications (we run these every 20 minutes) over comparable daily intervals.

Now, this is not a problem for this particular VM as the end user will not notice the difference for AV and Windows update. Data from this VM was used for illustration only. However, the same situation stands for all other VMs, including domain controllers, file and print servers, Exchange 2003 and SQL2005 servers. We attempt to use Veeam in "near CDP" mode and this is a significant roadblock. We can't do "near CDP" on SQL as it interferes with native sql backups. File server is not particularly responsive during snapshot removal, but this is probably something we can live with. The biggest issue is Exchange as we used to run hourly backups throughout the day. Once we moved it to SAN, regular operation is unaffected, but snapshot removal is routinely over 15 minutes (up to 30), during which end-user experience is quite bad. This is especially true for mail with attachments.

I will follow up with more data as we move our VMs around. In the meantime, every advise is very much appreciated.

Thank you, Arthur
arthurp
Influencer
 
Posts: 23
Liked: never
Joined: Mon Jan 11, 2010 9:18 pm
Full Name: Arthur Pizyo

Re: Snapshot removal issues of a large VM

Veeam Logoby matarvai » Wed Nov 10, 2010 4:47 pm

We have same downtime issues now. No problems for almost a year, but now our busy Exchange server is having issues when backing up. When Veeam is removing snapshot it creates 15-20min downtime to exchange, Outlook clients loses connection for this period. Is there any ideas what could reduce the downtime during snapshot removal?
matarvai
Enthusiast
 
Posts: 30
Liked: never
Joined: Wed Apr 07, 2010 9:49 am
Full Name: Marko Tarvainen

Re: Snapshot removal issues of a large VM

Veeam Logoby joergr » Wed Nov 10, 2010 7:37 pm

The old rule: The more disk load on the vm and the longer the time frame, the bigger the snapshot, the slower the snapshot commit.

Backup during non-high-disk-load times, thus the snapshot won´t grow that much and thus, the snapshot can be committed very fast. 20-30 mins offline is something i never ever saw before. Could you by any chance check out if this behaviour also occurs when using esxi 4.1? VMware did A LOT, especially when it comes to snapshot handling with ESXi 4.1.

If you mentioned it already mea culpa - but could you describe exactly what you use (esx version, iscsi/fc and if iscsi hba or software, san vendor and model).

Best regards,
Joerg
joergr
Expert
 
Posts: 377
Liked: 39 times
Joined: Tue Jun 08, 2010 2:01 pm
Full Name: Joerg Riether

Re: Snapshot removal issues of a large VM

Veeam Logoby matarvai » Wed Nov 10, 2010 8:02 pm

We are using ESXi 4.0 at the moment at this server. And we're using DAS. We had same problem at february this year, but we did something then and it corrected the problem. I just can't remember what was the fix then.
matarvai
Enthusiast
 
Posts: 30
Liked: never
Joined: Wed Apr 07, 2010 9:49 am
Full Name: Marko Tarvainen

PreviousNext

Return to VMware vSphere



Who is online

Users browsing this forum: Google [Bot] and 13 guests