Comprehensive data protection for all workloads
egroeg
Enthusiast
Posts: 55
Liked: never
Joined: Sep 23, 2010 2:36 pm
Full Name: George Kenny
Contact:

Guest VM halts during replication snapshot

Post by egroeg »

Dear community.
  • Veeam B&R v5
    VMWare Essentials Plus
    Replication via Virtual Appliance Mode
During the replication from a LAN connected host - my guest VM actually halts activity, so any users using Terminal Services get their session stuck.

My replication method is via "Virtual Appliance" mode.

Is this normal behaviour - my understanding was that during the replication the guest is snapshotted online which shouldn't cause interruption?
Alexey D.

Re: Guest VM halts during replication snapshot

Post by Alexey D. »

Hello George,

Yes, snapshotting could cause this sessions' stuck. How loaded your VM is, how many users?
As a workaround, I would recommend doing this replication during "quiet" hours.
egroeg
Enthusiast
Posts: 55
Liked: never
Joined: Sep 23, 2010 2:36 pm
Full Name: George Kenny
Contact:

Re: Guest VM halts during replication snapshot

Post by egroeg »

Thank you for this - much appreciated!

I will schedule the replication at a convenient time.
crowntech
Novice
Posts: 4
Liked: never
Joined: Dec 02, 2010 5:24 pm
Full Name: Jason Carter
Contact:

SQL Gets Disconnected during Replication

Post by crowntech »

[Merged with existing discussion]

I have a client that is using Veem 5 to replicaiton SQL offsite. When the job launches and the snap is created all the users get disconnected from the application. Any ideas?
HouseofPang
Lurker
Posts: 2
Liked: never
Joined: Jan 07, 2011 5:26 pm
Full Name: Hanson Pang
Contact:

ESX4.1 + CBT + NFS + Snapshot = VM Freeze

Post by HouseofPang »

[Merged with existing discussion]

Anyone else running into this issue? Backing up a VM that is on NFS storage with Vsphere 4.1 w/CBT, during snapshot removal stage, guests can go up to 3mins of freeze?

kb1031106

http://kb.vmware.com/selfservice/micros ... 0143922415
TrevorBell
Veteran
Posts: 357
Liked: 17 times
Joined: Feb 13, 2009 10:13 am
Full Name: Trevor Bell
Location: Worcester UK
Contact:

Re: ESX4.1 + CBT + NFS + Snapshot = VM Freeze

Post by TrevorBell »

Hi,

How big is the VM ? Did you enable " safe removal of snapshots" in the advance settings ?? WHen i backed up my Exchange for the first time i saw this happen as i didnt have that option ticked.
Also the KB you are referring to Veeam automatically tells you if you need to do the this and to enable CTK

trev
Alexey D.

Re: Guest VM halts during replication snapshot

Post by Alexey D. »

Jason, Hanson,

Please refer to this post: Re: Many VM's dropping packets losing pingstate during rep/b for more detailed explanation.

Trevor, it also reminds that "Safe snapshot removal option" should mostly be used with pre-ESX 3.5 U2 hosts.
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Guest VM halts during replication snapshot

Post by tsightler »

Alexey D. wrote:Trevor, it also reminds that "Safe snapshot removal option" should mostly be used with pre-ESX 3.5 U2 hosts.
I see Veeam reps state this a lot, but I actually think this option is still valuable even for newer versions. It's true that VMware versions 3.5 U2 and later use a snapshot removal method that is very similar to "safe snapshot removal" anyway, but for some reason it's not quite the same. I think when VMware is removing snapshots it still "throttles" the guest OS somewhat, probably to attempt to limit growth of the new snapshot. We see this on our Exchange server quite noticably if a backup/replication is preform during a busy portion of the day.

As an example, if a Veeam backup is run on our Exchange server, and takes 30-40 minutes, we can easily end up with 6-8GB of snapshots that have to be removed. Because the Exchange server is still very busy while snapshot removal is taking place, it can take anywhere from 60-90 minutes for VMware to complete removing the snapshot. During the time users get very poor performance, and mail can sometimes become backlogged.

If we use "safe snapshot removal", it takes even longer to remove the snapshot, but for some reason the impact on users is almost unnoticed. I don't know for sure why Veeam's method of safe snapshot removal allows for more performance than VMware's native method, but it appears that VMware restricts the performance of the VM to try to make sure it is making progress in removing that snap, while Veeam doesn't have this side effect. We've had some success mitigating the impact of VMware's snapshot removal process by setting very high reservations for potentially busy VM's, but it's not a huge difference.
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: SQL Gets Disconnected during Replication

Post by tsightler »

crowntech wrote:[Merged with existing discussion]
I have a client that is using Veem 5 to replicaiton SQL offsite. When the job launches and the snap is created all the users get disconnected from the application. Any ideas?
HouseofPang wrote:[Merged with existing discussion]
Anyone else running into this issue? Backing up a VM that is on NFS storage with Vsphere 4.1 w/CBT, during snapshot removal stage, guests can go up to 3mins of freeze?
I'm not sure if it's correct to merge these two issues as they appear to be reporting two different issues. The first talks about client getting disconnected when the snap is created, and the other is reporting a freeze during snapshot removal.

Snapshot creation really shouldn't be causing clients to be disconnected. Are you using Veeam VSS and is it working correctly? Based on how sensitive your client application is, it may be possible that VSS is freezing the system for a little too long, although normally it is only a few seconds. We've never seen this be a problem in our environment but if the system is very busy the VSS freeze can take longer. If your are using VSS I'd try it without that option (and without VMware tools quiescence as well) just to see if they still get kicked out.

Now, snapshot removal is a different issue altogether. Normally there is a performance impact during snapshot removal, which can vary significantly based on the load on the VM, especially I/O load, but the "pause" should usually only last a couple of seconds as the final "stun" freezes the system to remove the final snapshot. I'd get VMware to look at the logs from the VM and tell you why a snapshot removal task is causing such a long hang.
crowntech
Novice
Posts: 4
Liked: never
Joined: Dec 02, 2010 5:24 pm
Full Name: Jason Carter
Contact:

Re: Guest VM halts during replication snapshot

Post by crowntech »

We are using Veeam VSS and we see in the windows logs where the writer is started. Since this is a SQL server, if I disable vss will the database be in a consistant state?
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Guest VM halts during replication snapshot

Post by tsightler »

I'm suggesting that you disable it only to see if that is causing your issue. The honest truth is, even without VSS your database is likely crash consistent, but I wouldn't suggest relying on Veeam without VSS. Still, it would be interesting to know if it's the VSS process that is causing the "halt" so running without it is a good test to see if it's VSS that's causing the problem, or just the act of creating a snapshot.
HouseofPang
Lurker
Posts: 2
Liked: never
Joined: Jan 07, 2011 5:26 pm
Full Name: Hanson Pang
Contact:

Re: SQL Gets Disconnected during Replication

Post by HouseofPang »

tsightler wrote: I'm not sure if it's correct to merge these two issues as they appear to be reporting two different issues. The first talks about client getting disconnected when the snap is created, and the other is reporting a freeze during snapshot removal.

Snapshot creation really shouldn't be causing clients to be disconnected. Are you using Veeam VSS and is it working correctly? Based on how sensitive your client application is, it may be possible that VSS is freezing the system for a little too long, although normally it is only a few seconds. We've never seen this be a problem in our environment but if the system is very busy the VSS freeze can take longer. If your are using VSS I'd try it without that option (and without VMware tools quiescence as well) just to see if they still get kicked out.

Now, snapshot removal is a different issue altogether. Normally there is a performance impact during snapshot removal, which can vary significantly based on the load on the VM, especially I/O load, but the "pause" should usually only last a couple of seconds as the final "stun" freezes the system to remove the final snapshot. I'd get VMware to look at the logs from the VM and tell you why a snapshot removal task is causing such a long hang.

You are right, according to VMWare support. My issue during snapshot removal has to do with NFS lockings. According to them, the work around is disable CBT or move all storage off NFS.. which both to me is not really an option? lol.
Gostev
Chief Product Officer
Posts: 31804
Liked: 7298 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Guest VM halts during replication snapshot

Post by Gostev »

Agreed, this can hardly be classidied as a "workaround"... absolutely unacceptable if you have already invested in NFS storage.
egroeg
Enthusiast
Posts: 55
Liked: never
Joined: Sep 23, 2010 2:36 pm
Full Name: George Kenny
Contact:

Re: Guest VM halts during replication snapshot

Post by egroeg »

Alexey D. wrote:Hello George,

Yes, snapshotting could cause this sessions' stuck. How loaded your VM is, how many users?
As a workaround, I would recommend doing this replication during "quiet" hours.
Hi Alexey D.

I've done this which does work - but can I ask, "SHOULD" Veeam provide a seamless transition for replication - ie: should my users be able to work unaffected whilst a replication is taking place?

My opinion is that it SHOULD be able to run a background replication, but my thoughts are that it should only be possible to conduct this if the replication is using "SAN based" replication - whereas "Virtual Appliance" based replication/backup methods will add load to the task and hence users on Terminal services boxes etc (where interaction is prevalent) will experience noticeable side-affects.

Comments?
Gostev
Chief Product Officer
Posts: 31804
Liked: 7298 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Guest VM halts during replication snapshot

Post by Gostev »

Hi George,

To put it straight, Veeam has nothing to deal with snapshot removal process - this is handled solely by VMware (specifically byESX host). VMware completely isolates ISVs from being able to affect the process, and only give us a single API call which all it does is "asks" VMware to initiate snapshot removal. The rest is beyond our control.

Now, to answering your question - yes, VMware snapshot "SHOULD" not affect users, and provide seamless transition from running VM from snapshot file back to running VM from VMDK files. Snapshot commit "SHOULD" not affect users, at least that's the promise of VMware snapshots. But, of course there could be bugs and environmental issues causing VMware snapshot commit procedure to affect applications inside VMs. These are best to troubleshoot with VMware directly.

For example, recently someone told me that VMware had confirmed there is currently an issue with snapshot removal on NFS storage which can cause signigicant timeouts on snapshot commit.

Thanks!
topry
Enthusiast
Posts: 49
Liked: 1 time
Joined: Jan 07, 2011 9:30 pm
Full Name: Tim O'Pry
Contact:

Re: Guest VM halts during replication snapshot

Post by topry »

For what its worth, it is not exclusive to NFS. All of our datastores are VMFS and we are having this issue during snapshot creation and removal. Creation takes 15 to 30 seconds, depending on VM and the underlying datastore - and every VM we have becomes non-responsive for 5+ seconds during this process. Removal times vary from 30 seconds to minutes and the VMs become unresponsive at varying times for 5-10 seconds throughout the process.

Turning off cbt is not a viable option since we want to do hourly backups. If we we are going to turn it off, we can just leave it on and backup after hours, but that too is not a good option for us.

We have an open ticket with VMWare on this. If anyone knows of a resolution other than disabling cbt, I would appreciate the info. I'm assuming not everyone is affected by this? For those that are not, are you using NFS/VMFS and what is your underlying datastore (disk type, RAID).

I have tested on 15k SAS in RAID 1 as well as 'near SAS' SATA 6GB RAID5 luns. All experience the issue, the SAS for a shorter period of tiem of course due to their speed.
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Guest VM halts during replication snapshot

Post by tsightler »

Yes, I believe that it is quite normal for the VM's to become unresponsive for a few seconds. I think the people in this thread are experiencing freezes on the order of minutes. I see you state that you are experiencing freezes during the entirety of your snapshot removal process, correct? I have never seen this behavior. We've had a few occasions where our Exchange server was accidentally backed up during the business day and the snapshot removal process had a very negative impact on it's performance, but it did not stop responding, although users complained of slowness and the occasional "trying to retrieve data from the MS Exchange server" message. During the very last "stun" there was a 10 second pause or so, that's typically the worst we see.

I believe this to be "normal" for VMware ESX because of the way their technology works. I do believe that features like VAAI, which offload features like snapshots to the storage array, will significantly improve this issue but I have not tested these features on my storage systems yet.
topry
Enthusiast
Posts: 49
Liked: 1 time
Joined: Jan 07, 2011 9:30 pm
Full Name: Tim O'Pry
Contact:

Re: Guest VM halts during replication snapshot

Post by topry »

Fortunately, we are not seeing the multi-minute issue, just a few secs on snapshot creation and then multiple 5-10 second pauses during snapshot release. While our Exchange server is one of these, those pauses are not the cause of our concern, its our SQL server and file server. Having those go non-responsive every hour for a few seconds is a non-starter. If that is normal behavior how can you use Veeam to back up VMs during buiness hours?
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Guest VM halts during replication snapshot

Post by tsightler »

In our setup we see a few seconds pause on these servers but this largely goes unnoticed by the user community. A delayed SQL response doesn't kill connections or anything, just happens to delay any response to a client for a few seconds, but only if there happened to be someone that was actually making a connection at that exact instant. We've never even had a user complaint on anything other than Exchange.

One thing you want to make sure of is that you have your CPU, Memory, and I/O reservations high enough for the server. I've found that having these set to low levels may not reserve enough resources to keep the system responsive during snapshot removal.

Do you see the pauses simply making and removing snapshots from Vcenter or is it only with Veeam?
topry
Enthusiast
Posts: 49
Liked: 1 time
Joined: Jan 07, 2011 9:30 pm
Full Name: Tim O'Pry
Contact:

Re: Guest VM halts during replication snapshot

Post by topry »

The delay is independent of Veeam. The concern is that the delay will only increase in duration. A pause for even a few seconds during certain high volume SQL and file i/o transactions isn't acceptable for us. Our memory and CPU usage does not appear to be the issue. I/O does not go above 40% except when the backups or replications are running. Regardless, I'm hoping this isn't normal behavior as having a system become unresponsive, even for a few seconds, isn't acceptable in most enterprise production environments.

I found the following excellent description of what occurs during snapshots: http://www.vmdamentals.com/?p=332
This data matches our experience. I am going to test his suggestion of using a different (internal) LUN as a working directory, using 15k SAS configured as RAID 10.

Has anyone else using Veeam utilized this method?
topry
Enthusiast
Posts: 49
Liked: 1 time
Joined: Jan 07, 2011 9:30 pm
Full Name: Tim O'Pry
Contact:

Re: Guest VM halts during replication snapshot

Post by topry »

I modified some vmx files setting the workingDir to a dedicated LUN on 15k SAS on the server, as well as a dedicated LUN on the MD3200i, but with a primary controller different from the LUN with the target VMs. In both cases, performance was slightly better than creating snapshots on the same LUN as the VM. With the limited testing we can do, the results are barely more than anecdotal, but showed a 20 to 25% reduction in both creation and release. The internal SAS was slightly faster but not as much as I expected. (single digits over the iSCSI dedicated LUN on separate controller).

The longer the backup takes, the more changes to the VM, the larger the resulting snapshot, the longer it will take VMWare to integrate those changed blocks into the vmdk and the longer it takes to release the snapshot and the more interruptions in service we experienced on release. Regardless of the disk source used (SAS, internal, SAN) every test I ran shows at least a minor interruption/delay during snapshot creation and release. I would like to test using SSD drives for the workingDir LUN, but the R710 controller apparently will only work with the ones sold by Dell and they are stupid expensive (1k for a 50GB and 2k for a 100GB).

If anyone is doing snapshots of VMs during high activity levels (database, file server), I would appreciate knowing if you are experiencing the same interruption and if not, what configuration you have.
topry
Enthusiast
Posts: 49
Liked: 1 time
Joined: Jan 07, 2011 9:30 pm
Full Name: Tim O'Pry
Contact:

Re: Guest VM halts during replication snapshot

Post by topry »

Wanted to add one more follow-up on this if you are considering using a separate datastore/LUN for your workingDir:

1) Make sure you create this datastore with a block size large enough to support the maximum vmdk you will be snapshotting. If you choose a 1MB block size, you will not be able to create a snapshot for a 512GB disk.

2) After changing the workinDir you must remove/re-add the VM from inventory. Which also requires you remove/re-add the VM to your Veeam job(s). This also appears to affect CBT, causing the next backup/rep to be a full backup.
Vitaliy S.
VP, Product Management
Posts: 27371
Liked: 2799 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov
Contact:

Re: Guest VM halts during replication snapshot

Post by Vitaliy S. »

topry wrote:2) After changing the workinDir you must remove/re-add the VM from inventory. Which also requires you remove/re-add the VM to your Veeam job(s). This also appears to affect CBT, causing the next backup/rep to be a full backup.
That's correct. Each time you re-add a VM to VMware inventory it gets a unique ID, so a backup job treats this VM as a new one, thus making a Full run.
alexlihk
Lurker
Posts: 2
Liked: never
Joined: Mar 16, 2011 12:31 pm
Full Name: Alex Li
Contact:

Re: Guest VM halts during replication snapshot

Post by alexlihk »

I did some test with separated disk for snapshot and the performance is really better. Remark: don't miss the point: OS disk is set to independent (not snapshot included).
On the other hand, I found that OS running but network is interrupted:
When taking snapshot, MSSQL itself is still running smoothly with little delay (looping 1s for 1 transaction and result is less than 1.2s), however, 1 "ping timeout" is occurred. certainly, for snapshot removal, it causes 5-6s delay and 2-3 "ping timeout" found.
Seems Network timeout is the most critical issue...
topry
Enthusiast
Posts: 49
Liked: 1 time
Joined: Jan 07, 2011 9:30 pm
Full Name: Tim O'Pry
Contact:

Re: Guest VM halts during replication snapshot

Post by topry »

My testing had similar results - I can run a cpu intensive process within the VM itself and it will experience a slight slowdown during a snapshot but not but totally blocked/locked/stopped. However, network I/O is always interrupted (a few hundred milliseconds min on creation to several thousand on release). The more disk i/o during the snapshot creation/release, the longer it takes, but the network I/O interruption will start/stop as VMWare processes the snapshot file(s).

I opened a support ticket with VMWare when I discovered this and their response was 'this is expected behavior' - when the snapshot is created and again when released, they quiesce the system, which will pause all activity. For network i/o intensive applications/systems like databases and file servers, this can be a problem. While our interruptions remain in the few second range (total), we have modified some processes to work around this. I also tested without CBT enabled, but still experience a slight pause in network connectivity - not that it would be a viable solution for our current implementation.

The recommendation in the article linked above to use a separate/fast LUN for snapshotting is the only thing I have found that shortens the interruption. Once Veeam modifies replication so that using that option does not cause issues with the replicated image, we will modify our database and file servers to use that method.

Gostev - perhaps you could add something on this topic to your FAQs?
alexlihk
Lurker
Posts: 2
Liked: never
Joined: Mar 16, 2011 12:31 pm
Full Name: Alex Li
Contact:

Re: Guest VM halts during replication snapshot

Post by alexlihk »

latest result, without snapshot with OS disk (set OS disk as independent), the network ping will not be stun...although the delay of MSSQL is average 5.4s. If the OS can buffer this delay, application may not be affected.

Here are the summary about our modification:
1. Separate disk for Snapshot (save 30% of stun)
2. Remove OS disk from snapshot (save network stun)
stormlight
Enthusiast
Posts: 48
Liked: 3 times
Joined: Apr 28, 2011 5:34 pm
Full Name: JG
Contact:

Re: Guest VM halts during replication snapshot

Post by stormlight »

I assume that those of us who have Dell Ecologic sans who cant control the IO load on Luns and have to let the Dell EQs do the magic themselves cant really do anything else to fix this issue.
bobfink
Novice
Posts: 5
Liked: never
Joined: Jul 18, 2011 3:29 pm
Full Name: Bob Fink
Contact:

Lose Server Connection at end of Replication

Post by bobfink »

[merged]

On several servers we are losing connection momentarily after a replication occurs. The timing seems to coincide with the snapshot removal. A couple of our programs seem to be more sensitive than others and kick up errors and close on our users. Our SQL server is causing the majority of the disconnects.

We are running ESX 4.1 (433742) with Veeam running on a physical box running the jobs in SAN mode. The Data Size for the SQL server are normally around 1 GB and the job runs every for 5-6 minutes. It is scheduled to run every 30 minutes.

Any ideas on how to resolve this issue? I'm guessing that this is going to be said as a VMware issue, but wanted to ask here in case someone has already resolved this.

Thanks for any input!
Bob
Beevoir
Expert
Posts: 144
Liked: never
Joined: May 06, 2010 11:13 am
Full Name: Mike Beevor
Contact:

Re: Lose Server Connection at end of Replication

Post by Beevoir »

Hi Bob,

Easiest way to test the connectivity issue is to take a normal VMware snapshot of the VM that is being a bit sensitive and see if it occurs then. If it does, it will likely be a VMware issue (since we use exactly the same API call for creating/removing snapshots)

If not, are you replicating locally across a LAN, or remotely across a WAN, and if you are replicating remotely, are you pulling the data from Production, or pushing it away from Production?
bobfink
Novice
Posts: 5
Liked: never
Joined: Jul 18, 2011 3:29 pm
Full Name: Bob Fink
Contact:

Re: Lose Server Connection at end of Replication

Post by bobfink »

We are pushing the replication locally across our LAN.

The hard part of testing is it only happens once a day or every other day and we are doing 48 replication jobs a day.
Post Reply

Who is online

Users browsing this forum: Google [Bot], Semrush [Bot], simon_netcraft and 128 guests