-
- Novice
- Posts: 4
- Liked: never
- Joined: Jul 29, 2009 6:17 pm
- Full Name: Jack
- Contact:
Snapshot removal issues of a large VM
I have an issue within my enviorment where a VM that is 500+ Gb is taking a very long time for snapshot removal. After the Veeam backup process calls to remove the snapshot the VM will go off line from the networks perspective and be in a snapshot removal state for almost an hour, sometimes more. Is there anything that I can do to keep the VM online during this process or make the process take less time?
Thanks.
Thanks.
-
- VP, Product Management
- Posts: 6035
- Liked: 2860 times
- Joined: Jun 05, 2009 12:57 pm
- Full Name: Tom Sightler
- Contact:
Re: Snapshot removal issues of a large VM
What version and patch level of VMware are you running? VM's should generally stay online during snapshot removal except for a few seconds as the final commits are made. We backups several VM's that are 500+GB, including one that's 1.2TB, and I've never seen this issue.
-
- Enthusiast
- Posts: 87
- Liked: never
- Joined: Oct 20, 2009 2:49 pm
- Full Name: Joe Gremillion
- Contact:
Re: Snapshot removal issues of a large VM
We have this problem occasionally when performing backups of large GroupWise VMs. It seems to only happen when we have a Post Office that has been pretty busy during the backup Window. I asked VMware about this and they said that if you have a VM that was being used heavily during the time the snapshot was created and was being backed up it can take quite while for the snapshot to consolidate before it's removed. This can and will effect the VMs performance.
Their solution was not to perform a backup of a busy VM during heavy use periods.
Their solution was not to perform a backup of a busy VM during heavy use periods.
-
- VP, Product Management
- Posts: 6035
- Liked: 2860 times
- Joined: Jun 05, 2009 12:57 pm
- Full Name: Tom Sightler
- Contact:
Re: Snapshot removal issues of a large VM
Well, I can certainly understand "effect the VMs performance", but he's saying "VM will go off line from the networks perspective..." for "...almost an hour, sometimes more." That's a little more than a performance issue. We backup some very busy VM's, including our Exchange VM. It's almost 400GB now and it's pretty busy almost all the time. It's not unusual for it to grow a multi-gigabyte snapshot that takes 30-40 minutes to remove, even backing it up during a "quiet" time. Still, I've never seen a system go completely offline for an hour. That's sounds like a serious problem to me.
-
- Enthusiast
- Posts: 87
- Liked: never
- Joined: Oct 20, 2009 2:49 pm
- Full Name: Joe Gremillion
- Contact:
Re: Snapshot removal issues of a large VM
Well hope it doesn't happen to you either because it aint pretty when everyone starts yelling and calling. User's mailboxes on the affect POs are pretty much inaccessible until the snapshot is consolidated and removed.
I've had two VMs become pretty much unusably during the snapshot removal time. The first time it happened I was skeptical but around the third time it happened I pretty much decided that I need to start the backup earlier.
One thing that may be the culprit is all of our large GroupWise (on Windows) VMs are virtual RDMs. I wonder if it's a issue with consolidating the snapshot of the RDM to the raw disk?
I've had two VMs become pretty much unusably during the snapshot removal time. The first time it happened I was skeptical but around the third time it happened I pretty much decided that I need to start the backup earlier.
One thing that may be the culprit is all of our large GroupWise (on Windows) VMs are virtual RDMs. I wonder if it's a issue with consolidating the snapshot of the RDM to the raw disk?
-
- Enthusiast
- Posts: 87
- Liked: never
- Joined: Oct 20, 2009 2:49 pm
- Full Name: Joe Gremillion
- Contact:
Re: Snapshot removal issues of a large VM
And yes, I've had this happen for almost that long. I had one that was stuck for 45 minutes. Talk about a major panic around here.
-
- VP, Product Management
- Posts: 6035
- Liked: 2860 times
- Joined: Jun 05, 2009 12:57 pm
- Full Name: Tom Sightler
- Contact:
Re: Snapshot removal issues of a large VM
I wasn't trying to claim that it couldn't happen, or that you didn't have it happen, only that I believe that it shouldn't happen. To be fair, I've seen similar problems back in the ESX 3.5 days and earlier. There were some known issues with snapshot removal that could cause this. But since 3.5 U2 (I think U2, I guess it might have been U3) snapshot removal was overhauled completely and now uses helper snapshots in a loop until the final snapshot is small, and thus the "stun" time should be short. Veeam 4.x also has "safe snapshot removal" that let's you get similar behavior from older versions of ESX.
I guess what I'm saying is, if you're seeing this with current VMware versions, well, that still seems like a problem, perhaps something unique in your environment (slow storage, the virtual RDM's you mention -- we use VMDK's, etc). In other words, I'm fully buying that it can happen, but if it were happening to me, I don't think I'd let VMware off the hook with the "don't preform a backup of a busy VM" excuse. What if I were using snapshots for other purposes?
I guess what I'm saying is, if you're seeing this with current VMware versions, well, that still seems like a problem, perhaps something unique in your environment (slow storage, the virtual RDM's you mention -- we use VMDK's, etc). In other words, I'm fully buying that it can happen, but if it were happening to me, I don't think I'd let VMware off the hook with the "don't preform a backup of a busy VM" excuse. What if I were using snapshots for other purposes?
-
- Chief Product Officer
- Posts: 31816
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Snapshot removal issues of a large VM
ESX4 indeed has an issue when removing snapshot causes long VM freezes, but this only happens if there is another snapshot exists on VM before you create, and then try to delete an additional snapshot. The VM freeze is proportional to the first (existing) snapshot size, and does not matter on how big the second snapshot have grown. So please check if you have other snapshots on your VM.
This is the only issue I am aware of which may cause significant downtimes on production VM during the snapshot removal with ESX4. If you do not have additional snapshot, then your VM definitely should not become inaccessible for more than a few seconds during the snapshot removal, no matter what the snapshot size is - I've personally done stress testing on this (snapshot removal while copying large files to VM). The way snapshot removal is implemented in ESX4 ensures that large snapshots do not result in longer VM freezes (except the issue/bug with extra snapshots present - described above).
This is the only issue I am aware of which may cause significant downtimes on production VM during the snapshot removal with ESX4. If you do not have additional snapshot, then your VM definitely should not become inaccessible for more than a few seconds during the snapshot removal, no matter what the snapshot size is - I've personally done stress testing on this (snapshot removal while copying large files to VM). The way snapshot removal is implemented in ESX4 ensures that large snapshots do not result in longer VM freezes (except the issue/bug with extra snapshots present - described above).
-
- Enthusiast
- Posts: 87
- Liked: never
- Joined: Oct 20, 2009 2:49 pm
- Full Name: Joe Gremillion
- Contact:
Re: Snapshot removal issues of a large VM
I do not have any other snapshots when this happens. This only happens when the snapshot from VBR is trying being removed. No other snapshots.
-
- Novice
- Posts: 4
- Liked: never
- Joined: Jul 29, 2009 6:17 pm
- Full Name: Jack
- Contact:
Re: Snapshot removal issues of a large VM
First my version of ESX: 4.0.0 Build 164009
Vcenter version : 4.0.0 Build 162856
I do have some items in my snapshot manager for my large VM I have two levels of Consolidate Helper-0.
Is this a remenant of failed backups?
Thank you all for your help and suggestions.
Vcenter version : 4.0.0 Build 162856
I do have some items in my snapshot manager for my large VM I have two levels of Consolidate Helper-0.
Is this a remenant of failed backups?
Thank you all for your help and suggestions.
-
- VP, Product Management
- Posts: 6035
- Liked: 2860 times
- Joined: Jun 05, 2009 12:57 pm
- Full Name: Tom Sightler
- Contact:
Re: Snapshot removal issues of a large VM
I have never seen the problem you describe with ESX 4 but of course that doesn't mean that it might not exist. Are you running the latest VM tools? Do you have the "VMware Tools Quiesce" disabled? You might want to make sure that the VMware Tools sync driver is not installed or is disabled, having this legacy service enabled has been known to cause hangs during snapshot removal. Just a few thoughts.
I'd also suggest that you remove the snapshots that are currently on the VM. It's likely that those are leftovers from failed backups and I would suggest you remove them via the snapshot manager GUI.
I'd also suggest that you remove the snapshots that are currently on the VM. It's likely that those are leftovers from failed backups and I would suggest you remove them via the snapshot manager GUI.
-
- Chief Product Officer
- Posts: 31816
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Snapshot removal issues of a large VM
Jack, also make sure you are using the latest Veeam Backup version (4.1), as with previous release Consolidate Helper-0 snapshot could be left if your stop the backup job manually.
Tom is correct that you should remove the snapshot manually. If you do not have this option available in snapshot manager GUI, you should create and extra (new) snapshot first, then you will be able to remove the helper snapshot.
Thanks!
Tom is correct that you should remove the snapshot manually. If you do not have this option available in snapshot manager GUI, you should create and extra (new) snapshot first, then you will be able to remove the helper snapshot.
Thanks!
-
- Enthusiast
- Posts: 35
- Liked: never
- Joined: Dec 02, 2009 8:32 am
- Full Name: Amit Panchal
- Contact:
Re: Snapshot removal issues of a large VM
We have this issue on 4.1 with a large VM sitting removing snapshots and not availble on the network on ESX4. What can I do to resolve this ?
-
- Chief Product Officer
- Posts: 31816
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Snapshot removal issues of a large VM
Amit, the only possible cause we know about is described in my post above (20 Jan 2010). If this does not apply to you, it would be better for you to open a support case with VMware to investigate why snapshot removal causes issues such a long VM locks. Veeam Backup is merely issuing command to remove snapshot, so this is similar to removing snapshot manually with VMware Infrastructure Client. The actual process is fully handled by ESX host.
This is definitely not "normal" behavior, it does not matter how large the VM or snapshot is, this should not be happening.
This is definitely not "normal" behavior, it does not matter how large the VM or snapshot is, this should not be happening.
-
- Enthusiast
- Posts: 35
- Liked: never
- Joined: Dec 02, 2009 8:32 am
- Full Name: Amit Panchal
- Contact:
Re: Snapshot removal issues of a large VM
Hi Gostev,
VMWare could not see where the problem was and advised me to wait till the operation completed before making sure all snapshots are removed from the VM. They had a look through all the logs and said it was just a very slow snapshot removal process but they were not sure about not being able to ping the VM. This is a large VM but the snapshot is still removing now and has been going on for over 4 hours. I can see the Consolidated Helper snapshot but there are 3 snapshots in total so it is taking a while to clear them all.
VMWare could not see where the problem was and advised me to wait till the operation completed before making sure all snapshots are removed from the VM. They had a look through all the logs and said it was just a very slow snapshot removal process but they were not sure about not being able to ping the VM. This is a large VM but the snapshot is still removing now and has been going on for over 4 hours. I can see the Consolidated Helper snapshot but there are 3 snapshots in total so it is taking a while to clear them all.
-
- Novice
- Posts: 3
- Liked: never
- Joined: Feb 25, 2010 11:53 pm
- Full Name: Adrian Simpson
- Contact:
Re: Snapshot removal issues of a large VM
Hi Gostev
Let me say in advance, sorry for the long novel that follows, but as I am not the only one experiencing this issue, I thought the more information the better.
I experienced this exact issue today on the 3 x VM's that I replicate to my DR site. I thought I would share what I have found so far so hopefully this issue can be addressed from Veeam's end as I believe it could be both a Veeam and VMware issue combining to cause the issue. BTW, I have also opened a support case for this issue.
Firstly some history:
I am running ver 4.1 of Veeam and vSphere 4.0.
VM1 has been replicating happily since installing Veeam a couple of weeks ago and during business hours, a replication pass average between 1 and 2 hours so I would not expect the snapshot to be all that big.
VM2 had been trying to complete it's initial WAN replication after being seeded from a removable disk for about a week. It was running for a couple of days prior to a power outage the other day and was kicked off again after the power outage but still had not completed. I would expect this snapshot would have been quite big.
VM3 was also in the middle of a retry (power outage again) of an expected large pass as I had been advised by support to defrag VM3's guest OS to try and address replications taking a long time to complete. I would also expect this snapshot would have been quite big.
Today, users contacted support and reported that the 3 replicated servers (DB, Mail and File) were not responding which I confirmed. While investigating, I found that for a still unknown reason (currently with support), all 3 replication jobs failed at the same time. Each VM was in the process of removing snapshots. In the VMware Snapshot manager,each VM had both a Veeam backup snapshot and a Consolidate Helper-0 snapshot. When the snapshot removal finished, all 3 VMs returned to normal operation.
Previously, during normal snapshot removals, I have not had this issue on these 3 servers. (Previous Veeam and Vizioncore replications have run on these servers for around 6 months).
What I would like to know is:
1) If it is as a reult of Veeam and the previous replication passes failing that the Consolidate Helper-0 snapshots existed.
2) Why would these snapshots exist if the Vm's were not trying to remove snapshots when the replication passed failed (Power Outage in my case).
3) Apart from manually checking each VM in vCentre, is there anyway Veeam can check and advise when a Consolidate Helper-0 snapshot has not been successfully deleted so this can be triggered manually to avoid this issue.
4) Support recommended enabling "Safe Snapshot removal" which was already on with default settings (100M). Is there a recommnded minimum level this can be set to, or would it make no difference if there is a "stale" Consolidate Helper-0 snapshot already on the VM.
5) Is having "Safe Snapshot removal" enabled likely to be why the Consolidate Helper-0 snapshot was created in the first place.
Thanks Adrian
Let me say in advance, sorry for the long novel that follows, but as I am not the only one experiencing this issue, I thought the more information the better.
I experienced this exact issue today on the 3 x VM's that I replicate to my DR site. I thought I would share what I have found so far so hopefully this issue can be addressed from Veeam's end as I believe it could be both a Veeam and VMware issue combining to cause the issue. BTW, I have also opened a support case for this issue.
Firstly some history:
I am running ver 4.1 of Veeam and vSphere 4.0.
VM1 has been replicating happily since installing Veeam a couple of weeks ago and during business hours, a replication pass average between 1 and 2 hours so I would not expect the snapshot to be all that big.
VM2 had been trying to complete it's initial WAN replication after being seeded from a removable disk for about a week. It was running for a couple of days prior to a power outage the other day and was kicked off again after the power outage but still had not completed. I would expect this snapshot would have been quite big.
VM3 was also in the middle of a retry (power outage again) of an expected large pass as I had been advised by support to defrag VM3's guest OS to try and address replications taking a long time to complete. I would also expect this snapshot would have been quite big.
Today, users contacted support and reported that the 3 replicated servers (DB, Mail and File) were not responding which I confirmed. While investigating, I found that for a still unknown reason (currently with support), all 3 replication jobs failed at the same time. Each VM was in the process of removing snapshots. In the VMware Snapshot manager,each VM had both a Veeam backup snapshot and a Consolidate Helper-0 snapshot. When the snapshot removal finished, all 3 VMs returned to normal operation.
Previously, during normal snapshot removals, I have not had this issue on these 3 servers. (Previous Veeam and Vizioncore replications have run on these servers for around 6 months).
What I would like to know is:
1) If it is as a reult of Veeam and the previous replication passes failing that the Consolidate Helper-0 snapshots existed.
2) Why would these snapshots exist if the Vm's were not trying to remove snapshots when the replication passed failed (Power Outage in my case).
3) Apart from manually checking each VM in vCentre, is there anyway Veeam can check and advise when a Consolidate Helper-0 snapshot has not been successfully deleted so this can be triggered manually to avoid this issue.
4) Support recommended enabling "Safe Snapshot removal" which was already on with default settings (100M). Is there a recommnded minimum level this can be set to, or would it make no difference if there is a "stale" Consolidate Helper-0 snapshot already on the VM.
5) Is having "Safe Snapshot removal" enabled likely to be why the Consolidate Helper-0 snapshot was created in the first place.
Thanks Adrian
Adrian
-
- Chief Product Officer
- Posts: 31816
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Snapshot removal issues of a large VM
Hello Adrian,
1. Consolidate Helper snapshots are always created by ESX hosts during snapshot removal, but they should not persist under normal conditions (even if replication job fails).
2. It looks like network or vCenter connection failure prevented Veeam Backup to issue snapshot removal command (this can be confirmed by our support with debug logs). Veeam Backup snapshot will be removed automatically during next Veeam Backup job path, so we will take care about this one. However, Consolidate Helper snapshot is not something Veeam Backup directly creates and manages, this snapshot is created during snapshot removal process and should be cleaned up by ESX. But I am guessing network or power issue during snapshot removal might cause this snapshot to remain? It is best to ask VMware investigate their logs to understand why Consolidate Helper snapshot remains.
3. This sounds like a feature we could add, seems useful to me - even though it is uncommon during normal operation to see this happening. I will investigate this with devs.
4. I guess they missed the fact that you are on vSphere. Enabling this feature on ESX4 will have no effect. This feature is designed for pre-ESX3.5 U2 hosts to help with consolidation of large snapshot. This is no longer needed as ESX now has built-in logic for safe removal of large snapshots.
Thanks!
1. Consolidate Helper snapshots are always created by ESX hosts during snapshot removal, but they should not persist under normal conditions (even if replication job fails).
2. It looks like network or vCenter connection failure prevented Veeam Backup to issue snapshot removal command (this can be confirmed by our support with debug logs). Veeam Backup snapshot will be removed automatically during next Veeam Backup job path, so we will take care about this one. However, Consolidate Helper snapshot is not something Veeam Backup directly creates and manages, this snapshot is created during snapshot removal process and should be cleaned up by ESX. But I am guessing network or power issue during snapshot removal might cause this snapshot to remain? It is best to ask VMware investigate their logs to understand why Consolidate Helper snapshot remains.
3. This sounds like a feature we could add, seems useful to me - even though it is uncommon during normal operation to see this happening. I will investigate this with devs.
4. I guess they missed the fact that you are on vSphere. Enabling this feature on ESX4 will have no effect. This feature is designed for pre-ESX3.5 U2 hosts to help with consolidation of large snapshot. This is no longer needed as ESX now has built-in logic for safe removal of large snapshots.
Thanks!
-
- Influencer
- Posts: 20
- Liked: never
- Joined: Dec 16, 2009 7:02 pm
- Full Name: Raymond Chew
- Contact:
Re: Snapshot removal issues of a large VM
Finally...I'm glad that I'm not the only one experiencing this problem. I've been working with Veeam Support for the last 3 months on this issue (amongst others) to no avail. I'm curious if we share similar environments.curruscanis wrote: After the Veeam backup process calls to remove the snapshot the VM will go off line from the networks perspective...
I'm running...
- ESX 4.0.0, 208167
- 2 x IBM x3650 Servers
- Source and Target storage is over NFS
- All network connections run to a pair of Cisco 3750 cross-stack switches running etherchannel.
- VM Guest being backed up are about 70GB each.
- Veeam VBR is installed on a VM within the HA Pair.
- Veeam Replication set up to use the VMWare "Network" vStorage API
- VM Tools Quiescence disabled / VSS Quiescence Enabled
Interesting thing to note. When I moved the source VM over to local storage, the network disconnect occurance were much shorter. The usual disconnects were around 20 - 30 seconds. On local storage, about 5 seconds.
The other issue I've had that you may want to check. Some of my VM's are resetting after the snapshot removal. This is a hard reset. I discovered this while looking at the Windows event logs for clues on the network disconnects. Since this was not occuring on all my VM's I'm not sure if its related. I just thought I would throw it out there to see if anyone else was experiencing this problem.
-
- Chief Product Officer
- Posts: 31816
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Snapshot removal issues of a large VM
Does it approximately match VM freeze times in VMware VM log? It is pretty easy to read, check it out?rchew wrote:Interesting thing to note. When I moved the source VM over to local storage, the network disconnect occurance were much shorter. The usual disconnects were around 20 - 30 seconds. On local storage, about 5 seconds.
Also (I don't know if our support suggested this), but you may want to open support case with VMware on this as well, as Veeam Backup merely issues snapshot removal command (same as if you would initiate snapshot removal using VIC). Actual processing of this command is fully handled by the corresponding ESX. VMware may be able to find the reason much faster as they actually own this code.
-
- Influencer
- Posts: 20
- Liked: never
- Joined: Dec 16, 2009 7:02 pm
- Full Name: Raymond Chew
- Contact:
Re: Snapshot removal issues of a large VM
It always happens right after the snapshot removal. Also...when I perform manual snapshots and removals, I do not see the network disconnect behaviour.Does it approximately match VM freeze times in VMware VM log?
I have just initiated support on the VMWare side. Unfortunately, our support contract is through IBM so I don't have direct access to VMWare yet. We are changing this structure soon. However...since you have multiple customers experiencing this problem, I think Veeam should also pursue this problem with VMWare. As a VMWare partner, wouldn't you have much better access to some of their resources? Furthermore, since manual snapshots and removals do not exhibit this behavior, it is difficult to point the finger at VMWare.Also (I don't know if our support suggested this), but you may want to open support case with VMware on this as well, as Veeam Backup merely issues snapshot removal command (same as if you would initiate snapshot removal using VIC). Actual processing of this command is fully handled by the corresponding ESX. VMware may be able to find the reason much faster as they actually own this code.
-
- Chief Product Officer
- Posts: 31816
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Snapshot removal issues of a large VM
I am not sure if you missed my question, but were you able to investigate VMware logs for the affected VMs? There are the log files created on the datastore next to each VM. They provide information on VM stun cycles duration during snapshot commit operations, basically if VM remains stunned for a few seconds, this results in network drop in guest OS. This is the first thing I would check.
VIC snapshot does not quiesce VM. Freeze/unfreeze during quiescence is something that can potentially affect guest OS. Did you try to run Veeam backup job with both Veeam VSS and VMware Tools quiescence disabled in the Advanced job settings, and see if the issue goes away? This would be closest behavior to VIC snapshot. Also, when testing make sure you wait enough time before removing snapshot, so it becomes large enough (like in case of backup, which takes some time).rchew wrote:It always happens right after the snapshot removal. Also...when I perform manual snapshots and removals, I do not see the network disconnect behaviour.
We do open support cases directly with VMware for any problem we can reproduce internally (to be able to show them). For example, we were first to open support case and bring this issue to VMware's attention. But for problems which are not reproducible and affect a few specific deployments only, VMware needs to work directly with the affected customer, as this requires direct webex session between VMware SE and the customer to troubleshoot.rchew wrote:However...since you have multiple customers experiencing this problem, I think Veeam should also pursue this problem with VMWare. As a VMWare partner, wouldn't you have much better access to some of their resources?
Every time I saw this issue before, it was no different between removing snapshot through VIC, or letting Veeam Backup remove it. I would be surprised if this was otherwise, because both tools issue the same RemoveSnapshot VMware API call to initiate snapshot removal on ESX. Please PM me your support case number so that I can take a look up more details on your situation.rchew wrote:Furthermore, since manual snapshots and removals do not exhibit this behavior, it is difficult to point the finger at VMWare.
-
- VP, Product Management
- Posts: 6035
- Liked: 2860 times
- Joined: Jun 05, 2009 12:57 pm
- Full Name: Tom Sightler
- Contact:
Re: Snapshot removal issues of a large VM
OK, so as a followup to this, we actually did have a similar problem happen yesterday. Around 11:30AM we started receiving complaints of users having poor response from Outlook, especially for messages with attachments, and we started to investigate. What we found was that, due to an administrative error, a Veeam full backup ran during and overlapped with part of the business day. Our 400+GB Exchange server was backed up starting around 8:00AM and completed around 11:00AM. In the process that VMware snapshot grew to over 6GB. At the end of the backup Veeam initiated the snapshot removal process. During this snapshot removal process Exchange was very slow to respond to request, even timing out some Outlook connections.
While this was an unusual issue that occurred primarily due to an administrative error, we still wondered if anything we could do with VMware could potentially prevent this issue in the future. We decided to try bumping the CPU reservation and shares up significantly as, while the snapshot removal was taking place, the VM showed very high CPU utilization within the VM, but very low utilization within ESX. This made us think that the VMware snapshot removal process may place some cap on the resource utilization during snapshot removal, however, it should not be allowed to cap the performance below the reservation level.
Today we performed the same full backup just as a test. We kicked off a full backup starting at 8AM and, predictably, it ended around 11AM. The snapshot growth today was around 5.8GB. When the snapshot removal process started the system definitely slowed down, but not nearly as much as yesterday. Performance of the Outlook client was still slower than normal, sometimes taking a few seconds to open messages with attachments, but nothing like yesterday. The increase of CPU reservation seemed to have a significant impact.
We're going to try increasing the CPU reservation to max. We're thinking this might be the key to keeping busy VM's responsive during background snapshot commits. Has anyone else tried using CPU reservation settings to keep VM's responsive during snapshot removal and seen any positive results?
We also still wonder if Veeam's "safe snapshot removal" might still be useful in a scenario like this. It's strange, but creating a new snapshot, and removing an old snapshot via the vCenter console doesn't seem to have the same negative performance impact as the "delete all snapshots" option. It might be worth trying.
Also, note that I never experienced the complete loss of connectivity or response that the OP of this thread reported, but this performance issue was still enough to be noticed by some users.
While this was an unusual issue that occurred primarily due to an administrative error, we still wondered if anything we could do with VMware could potentially prevent this issue in the future. We decided to try bumping the CPU reservation and shares up significantly as, while the snapshot removal was taking place, the VM showed very high CPU utilization within the VM, but very low utilization within ESX. This made us think that the VMware snapshot removal process may place some cap on the resource utilization during snapshot removal, however, it should not be allowed to cap the performance below the reservation level.
Today we performed the same full backup just as a test. We kicked off a full backup starting at 8AM and, predictably, it ended around 11AM. The snapshot growth today was around 5.8GB. When the snapshot removal process started the system definitely slowed down, but not nearly as much as yesterday. Performance of the Outlook client was still slower than normal, sometimes taking a few seconds to open messages with attachments, but nothing like yesterday. The increase of CPU reservation seemed to have a significant impact.
We're going to try increasing the CPU reservation to max. We're thinking this might be the key to keeping busy VM's responsive during background snapshot commits. Has anyone else tried using CPU reservation settings to keep VM's responsive during snapshot removal and seen any positive results?
We also still wonder if Veeam's "safe snapshot removal" might still be useful in a scenario like this. It's strange, but creating a new snapshot, and removing an old snapshot via the vCenter console doesn't seem to have the same negative performance impact as the "delete all snapshots" option. It might be worth trying.
Also, note that I never experienced the complete loss of connectivity or response that the OP of this thread reported, but this performance issue was still enough to be noticed by some users.
-
- Novice
- Posts: 3
- Liked: never
- Joined: Mar 12, 2010 7:44 pm
- Full Name: Gary Rizo
- Contact:
Re: Snapshot removal issues of a large VM
not sure if this has been mentioned as a solution already but i did run into this issue and resolved the issue by re-installing VMTools. Also, there was a fix for vsphere 4.0.1 related to this issue. http://kb.vmware.com/selfservice/micros ... Id=1017458
-
- Expert
- Posts: 105
- Liked: 2 times
- Joined: Feb 16, 2010 8:05 pm
- Full Name: John Jones
- Location: New Zealand
Re: Snapshot removal issues of a large VM
Hi,
To add my 2 cents worth. We have similar problems with SQL 2005 running on a Windows 2003 Server that is constantly being used (ie no downtime). When I go to backup this server it freezes and the application servers lose contact with it and the batch jobs fail. Also, when the job finishes and the snapshot is being commited
the server freezes again and I cannot even login to it. I have turned off VM quiesence and am using VSS.
I have even tried copying the data that needs to be backed up to a drive on the server that does not get used (E:\) except to store this data and
then using Veeam to only backup this drive and excluding the C:\ and D:\ drives. The server still freezes. Gostev, why would this happen if I am only backing up
a single drive that is not being used ?
regards,
John
To add my 2 cents worth. We have similar problems with SQL 2005 running on a Windows 2003 Server that is constantly being used (ie no downtime). When I go to backup this server it freezes and the application servers lose contact with it and the batch jobs fail. Also, when the job finishes and the snapshot is being commited
the server freezes again and I cannot even login to it. I have turned off VM quiesence and am using VSS.
I have even tried copying the data that needs to be backed up to a drive on the server that does not get used (E:\) except to store this data and
then using Veeam to only backup this drive and excluding the C:\ and D:\ drives. The server still freezes. Gostev, why would this happen if I am only backing up
a single drive that is not being used ?
regards,
John
-
- Chief Product Officer
- Posts: 31816
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Snapshot removal issues of a large VM
John, this is because the freeze happens not due to backup activities, but due to some issues with snapshot creation and deletion. VMware snapshots affect whole VM, not just specific disks.
-
- Influencer
- Posts: 23
- Liked: never
- Joined: Jan 11, 2010 9:18 pm
- Full Name: Arthur Pizyo
- Contact:
Re: Snapshot removal issues of a large VM
We do not experience server disconnect issues, rather slowdown to our busy Exchange 2003 VM (160GB) as described by tsightler. It is more prominent on messages with attachments. In fact we have three brief slowdowns for every replication cycle: one at the point when original snapshot is initiated by Veeam, second (the most noticable one) when Veeam initiates snapshot removal, and the third one closer to the end of replication cycle, probably at the point when consolidate helper is removed.rchew wrote: Interesting thing to note. When I moved the source VM over to local storage, the network disconnect occurance were much shorter. The usual disconnects were around 20 - 30 seconds. On local storage, about 5 seconds.
Now, my post goes to the quote above. We had this server on local storage and snapshot removal hardly ever took over 10 minutes. Since we moved it over to the SAN (IBM DS3300) snapshot removal routinely takes over 20 minutes, e.g., we just had a job where it took 15 minutes to create a replica, but 18 minutes to remove the snapshot. I will have more stats later this week (hopefully), but it seems to be a consistent behaviour on all our servers as they are moved to SAN.
It is the latest Veeam, VMware 4.0.2.
Thank you, Arthur
-
- Influencer
- Posts: 23
- Liked: never
- Joined: Jan 11, 2010 9:18 pm
- Full Name: Arthur Pizyo
- Contact:
Re: Snapshot removal issues of a large VM
I find that snapshot removal is considerably longer when VM is running off SAN as opposed to local storage.
To illustrate the situation, here are the following details:
VM with 2 30GB drives is our AV (Symantec) and WSUS server. As such, it gets busy at times when Symantec downloads new definitions and pushes those to clients. Same applies to WSUS around MS Tuesday updates.
When we run it off SAN (IBM DS3300) on IBM x3650M3 average replication time is 5:23 out of which 1:23 is for snapshot removal. When VM is moved to local storage on IBM x3500 (far less powerful box) the data for replication and snapshot removal duration is respectively 4:11 and 0:11 seconds. These results are averages of 50 replications (we run these every 20 minutes) over comparable daily intervals.
Now, this is not a problem for this particular VM as the end user will not notice the difference for AV and Windows update. Data from this VM was used for illustration only. However, the same situation stands for all other VMs, including domain controllers, file and print servers, Exchange 2003 and SQL2005 servers. We attempt to use Veeam in "near CDP" mode and this is a significant roadblock. We can't do "near CDP" on SQL as it interferes with native sql backups. File server is not particularly responsive during snapshot removal, but this is probably something we can live with. The biggest issue is Exchange as we used to run hourly backups throughout the day. Once we moved it to SAN, regular operation is unaffected, but snapshot removal is routinely over 15 minutes (up to 30), during which end-user experience is quite bad. This is especially true for mail with attachments.
I will follow up with more data as we move our VMs around. In the meantime, every advise is very much appreciated.
Thank you, Arthur
To illustrate the situation, here are the following details:
VM with 2 30GB drives is our AV (Symantec) and WSUS server. As such, it gets busy at times when Symantec downloads new definitions and pushes those to clients. Same applies to WSUS around MS Tuesday updates.
When we run it off SAN (IBM DS3300) on IBM x3650M3 average replication time is 5:23 out of which 1:23 is for snapshot removal. When VM is moved to local storage on IBM x3500 (far less powerful box) the data for replication and snapshot removal duration is respectively 4:11 and 0:11 seconds. These results are averages of 50 replications (we run these every 20 minutes) over comparable daily intervals.
Now, this is not a problem for this particular VM as the end user will not notice the difference for AV and Windows update. Data from this VM was used for illustration only. However, the same situation stands for all other VMs, including domain controllers, file and print servers, Exchange 2003 and SQL2005 servers. We attempt to use Veeam in "near CDP" mode and this is a significant roadblock. We can't do "near CDP" on SQL as it interferes with native sql backups. File server is not particularly responsive during snapshot removal, but this is probably something we can live with. The biggest issue is Exchange as we used to run hourly backups throughout the day. Once we moved it to SAN, regular operation is unaffected, but snapshot removal is routinely over 15 minutes (up to 30), during which end-user experience is quite bad. This is especially true for mail with attachments.
I will follow up with more data as we move our VMs around. In the meantime, every advise is very much appreciated.
Thank you, Arthur
-
- Enthusiast
- Posts: 30
- Liked: never
- Joined: Apr 07, 2010 9:49 am
- Full Name: Marko Tarvainen
- Contact:
Re: Snapshot removal issues of a large VM
We have same downtime issues now. No problems for almost a year, but now our busy Exchange server is having issues when backing up. When Veeam is removing snapshot it creates 15-20min downtime to exchange, Outlook clients loses connection for this period. Is there any ideas what could reduce the downtime during snapshot removal?
-
- Veteran
- Posts: 391
- Liked: 39 times
- Joined: Jun 08, 2010 2:01 pm
- Full Name: Joerg Riether
- Contact:
Re: Snapshot removal issues of a large VM
The old rule: The more disk load on the vm and the longer the time frame, the bigger the snapshot, the slower the snapshot commit.
Backup during non-high-disk-load times, thus the snapshot won´t grow that much and thus, the snapshot can be committed very fast. 20-30 mins offline is something i never ever saw before. Could you by any chance check out if this behaviour also occurs when using esxi 4.1? VMware did A LOT, especially when it comes to snapshot handling with ESXi 4.1.
If you mentioned it already mea culpa - but could you describe exactly what you use (esx version, iscsi/fc and if iscsi hba or software, san vendor and model).
Best regards,
Joerg
Backup during non-high-disk-load times, thus the snapshot won´t grow that much and thus, the snapshot can be committed very fast. 20-30 mins offline is something i never ever saw before. Could you by any chance check out if this behaviour also occurs when using esxi 4.1? VMware did A LOT, especially when it comes to snapshot handling with ESXi 4.1.
If you mentioned it already mea culpa - but could you describe exactly what you use (esx version, iscsi/fc and if iscsi hba or software, san vendor and model).
Best regards,
Joerg
-
- Enthusiast
- Posts: 30
- Liked: never
- Joined: Apr 07, 2010 9:49 am
- Full Name: Marko Tarvainen
- Contact:
Re: Snapshot removal issues of a large VM
We are using ESXi 4.0 at the moment at this server. And we're using DAS. We had same problem at february this year, but we did something then and it corrected the problem. I just can't remember what was the fix then.
Who is online
Users browsing this forum: No registered users and 31 guests