VM IO pause during backup

jamesharper-bsol · Jun 10, 2015 5:26 pm

Hi,

I've got a support case open (Case 00918770) which is getting more complicated by the day and slow moving. I'm hoping someone may have seen this and can provide additional pointers!

We are backing up Exchange 2013 on Hyper-V 2012 R2. 2 DAG nodes, one active, one passive, the passive is being backed up but experiences IO pauses for up to 90 seconds.

Initially it would affect both DAG nodes and cause database failovers but we have split out the active node onto it's own dedicated LUN (there were other VMs sharing it previously). This has stabilized the system as we no longer get DB failovers and customers don't get a disconnect/reconnect anymore, however the passive node still has the problem.

We have worked through multiple things with support:
http://www.veeam.com/kb1744 (cluster changes made no difference)
8.0 Update 2 installed
SAN firmware upgrade (Dell MD3620f to 8.20 - latencies are low, maxing 30-40ms, nowhere near 90 seconds)
The hardware is Cisco UCS & MDS FC switches.
OS & Hyper-V patches
Manual VSS within Exchange VM was fine with no issues
Only happens during backups, we have moved the backup window, the problem follows it

The VM does not go to a saved state, it keeps running and IO just stops. PerfMon graphs show all disk transfer counters drop to 0 for the 90 seconds while the disk queue raises slowly. Our current hypothesis is that it is CSV VSS snapshots that are causing the pause.

We have enabled the "Allow processing of multiple VMs with a single volume snapshot" which has reduced the frequency of the pauses (was every night, now every few days). Even stranger is that one night when the backup job containing this VM did not run (it was paused to allow tape backup) the pause happened when other jobs were running.

The VMs OS disk is on CSV1, and DB on CSV2. Other VMs that are backed up in the other jobs share CSV1 so we think it might be triggering something, although the pause happens to the DB which is on the other LUN CSV2. We will move the OS drive after business approval (it's politically sensitive after all the failovers), but this would be a workaround and does not indicate the root cause.

Any experience/pointers would be much appreciated.

Thanks,
James

ptoro · Post by **ptoro** » Aug 11, 2015 1:05 pm this post

bump

I'm also seeing this IO freeze ON VMS that are not even being backed up BUT share the same CSV.

SOFS with SMB3

davidpollock · Post by **davidpollock** » Sep 23, 2015 6:17 am this post

We're seeing an issue which sounds similar to this.

We have a hyper-v cluster with 2 nodes. VMs are stored on CSVs.

During backups, VMs have been randomly failing because the cluster service detects them as being unresponsive. I believe this is because I/O to their system disk hangs. The VM which fails is always on the same CSV as a VM being backed up at the time. We've also reproduced the issue running backups via DPM so it seems more likely to be caused by the SAN or Microsoft VSS. Issue occurs during both on-host and off-host backups. From what I've seen, the issue seems to occur while the SAN is taking a snapshot of the CSV volume, or possibly straight afterwards.

We've logged calls with Microsoft, Veeam and HP. None of them have been able to resolve the issue so far. If you guys got anywhere I'd love to hear about it.

Some more info:
Cluster is running Server 2012 R2
SAN: HP StoreVirtual 4530s running LeftHand OS version 12.0.00.0725.0
HP StoreVirtual DSM MPIO driver v12.0.0.371.1 is installed on the hyper-v hosts.

akselc · Post by **akselc** » Sep 23, 2015 10:54 am this post

@davidpollock

Your description looks similar to one case i opened today, thats why i am reading forum posts now.
2 nodes, hyper-v, hp san CSV.
When Veeam is running, the offhost proxy node (hv02) looses connection to the cluster, and vmvirtual NIC cannot ping the other node, and VM's across nodes cannot are not able to contact eachother.
This resulting in, cluster volume1 is partially offline, hv02 cannot connect to it, it only finds volum2 on the SAN, hv03 (other node) have managed to manually mount volume1 as a d: drive.
HP technical have read the SAN logs, and cannot find any errors.
So something is happening with the I/O when node hv02 (offhost proxy) is bacing up vm's.

Hope Veeam looks at this asap.

Post by **foggy** » Sep 23, 2015 12:26 pm this post

In the OP's case, reducing the number of concurrent VSS snapshots has helped to reduce the occurrence of the pauses.

Aksel, have you opened a case with Veeam technical support for this? Dave, posting your case ID here would help us in further tracking of the resolution. Thanks!

akselc · Sep 23, 2015 1:07 pm

Hi Foggy

Yes, Case # 01063492

I am currently looking through logs on my HV02 node, and i believe that I/O from Veeam backup is causing loss of connectivity.
Or is triggering the loss.
these 2 patches may be helpful, i will investigate further, and maybe install patches under.

https://support.microsoft.com/en-us/kb/2870270
https://support.microsoft.com/en-us/kb/2813630

davidpollock · Post by **davidpollock** » Sep 23, 2015 11:28 pm this post

Hi Foggy,

Here's our case number: 00937186

We've installed all of the updates on this page: https://support.microsoft.com/en-us/kb/2920151

I believe the issue is most likely somewhere in VSS... the recommendation from Microsoft support was simply to disable heartbeat monitoring on all VMs, which does prevent the VMs from restarting during backups but does not actually solve the issue of them becoming unresponsive.

akselc · Post by **akselc** » Sep 24, 2015 2:54 pm this post

Both of the links
https://support.microsoft.com/en-us/kb/2870270
https://support.microsoft.com/en-us/kb/2813630

When trying ot install gives me a message "Not applicable to your system" or something like that.
When downloading i see that i can only choose Windows-8 RTm version, but it says in description installs on server 2012 datacentre....

davidpollock · Post by **davidpollock** » Sep 28, 2015 6:11 am this post

akselc are you sure you're not running 2012 R2..?

ptoro · Post by **ptoro** » Sep 30, 2015 3:04 am this post

Glad i'm not the only one seeing this issue.

HyperV to our SOFS cluster.

We see this only during VEEAM backup schedule AND happens to servers that are not being backed up by VEEAM but share the same CSV as servers that are being backed up.

Our servers are all Windows 2012 R2 (they all have the latest updates, minus this month set of patches).

All though it doesn't really do much harm to most servers, if you have a server that is in some sort of SQL cluster it will definitely mess with it. We see most IO freeze errors only on servers that have SQL or AD installed on them (would be the most sensitive to IO freeze and "complain" the most).

pterpumpkin · Post by **pterpumpkin** » Dec 08, 2016 11:51 pm this post

Sorry for dragging up an old post!

We're also seeing this. When Veeam snapshots a CSV, the IO pauses/queues/high latency for up to 30 seconds affecting all VM's on the CSV.

Is this expected behavior?

Post by **foggy** » Dec 09, 2016 12:29 pm this post

There's no any resolution in the support cases mentioned in this thread, so I advise you to open your own case to investigate the reasons causing this behavior in your particular environment.

dmalishkin · Post by **dmalishkin** » Mar 20, 2017 8:17 pm this post

We are seeing same issues on our CSVs. We have different type of SANs and it happens no matter what hardware we use. Our Hyper-V clusters are all 2012 R2.

Once veeam is backing up single VM, the entire CSV is slowed down so bad in some cases machines crash, or reboot.

Working with Veeam currently but not really seeing anything obvious.

Post by **Mike Resseler** » Mar 20, 2017 8:25 pm this post

Hi Daniel,

Can you let us know your support case ID?

Are your Hyper-V clusters all patched (and not only the windows updates, but also our recommended hotfixes?)

kcm_aaron · Post by **kcm_aaron** » Mar 27, 2017 4:31 pm this post

We have been having the exact same issue for several months now - during overnight VSS based snapshots, we are seeing Hyper-V VMs "fail", and reboot - according to policy. The VMs that crash are never part of the group of VMs being backed up, but do share storage on the CSV(s) involved in the snapshotting process.

We are actually in the process of moving to VEEAM, because of these issues we are seeing with our current backup product - NetApp SnapManager for Hyper-V (SMHV). SMHV uses a proprietary Data ONTAP VSS hardware provider when taking snapshots of the CSVs housing our VMs. Almost every night, at least one VM will fail during the overnight backup process. So, I started testing with VEEAM and found that I don't have the issue if I use the native Microsoft software VSS provider, but do have the issue when using the ONTAP VSS provider. This seems to point to the VSS provider being the issue, but I don't want to move to VEEAM yet, in case we're just masking the source of an ongoing problem.

I have updated all of our server firmware, SAN firmware, Windows updates and recommended hotfixes (recommendation by Microsoft, NetApp and VEEAM) and have even migrated all of our VM related storage to freshly provisioned volumes/luns per NetApp support recommendation, all to no avail. I have open cases with Cisco, Microsoft and NetApp, but they all point the finger at the other vendor.

Next, I will be disabling ODX on my Hyper-V hosts to see if that has any impact, but I just wanted to post our current situation here in case it helps anyone else. If there are any other suggestions out there, please let me know! Thanks!

Post by **Mike Resseler** » Mar 28, 2017 6:55 am this post

Aaron,

You make a good point here... ODX. ODX was (sorry is) a good idea and could be very useful and then it got really badly implemented because it is a combination of Microsoft and the storage/ hardware vendor that needs to work together (probably the reason of the fingerpointing...)

In the past many told us that disabling ODX indeed helped and solved the issues. So a good next test to run.

Are you also having other offloading techniques running/ activated? Those could also be an issue

PS: How was the performance for the software VSS provider? If that is acceptable, why not consider using that method instead of staying with the hardware VSS provider that is giving you issues?

nismoau · Post by **nismoau** » Mar 31, 2017 3:04 am this post

kcm_aaron wrote: Next, I will be disabling ODX on my Hyper-V hosts to see if that has any impact

We're also experiencing just about the exact same issue as you describe. Let us know how disabling ODX goes - I'm very interested!

We have also been able to isolate VM lockups during backups to happen only when the SAN Hardware VSS provider is used. Changing to using the Microsoft CSV VSS provider does not (seem to) exhibit the VM lockup issue - as I read it, very similar to what you're seeing.

Cheers!
Justin

Post by **Dreadnought** » May 09, 2017 12:47 pm this post

Hi,

Were suffering the same issue as well.

We have a 16 node Hyper-V cluster, all nodes are 2012R2, each node is dual 16 core CPU's with 384GB ram, resources on the cluster are not an issue, lots of performance monitoring and testing has been conducted to rule this out, ODX has always been disabled as its dangerously unreliable, we are using Compellent SANs, IO latency is very low, we dont have any performance issues with almost 400 VM's running on the cluster.

Weve recently introduced Veeam to replace StorageCraft which had no issues with VSS.

At the moment we have only migrated our own servers to Veeam but we are seeing the VM's hang when VSS snapshots take place and at the end of the backup, also the VM's can hang when they arent backing up but others are that are on the same CSV, most noticeable on Exchange and anything with a Database.

We are using on host proxy's.

Currently going through every node to see if we can find anything but nothing so far, all patched, all NIC offload settings disabled, ODX disabled, Compellent Firmware up to date.

This has put our entire migration project to Veeam on hold until we can resolve the problem, it would seem that in this instance Veeam just doesn't work.

I'll post here if we find anything that resolves the problem.

Post by **Mike Resseler** » May 09, 2017 5:51 pm this post

Hi Jerry,

Did you already created a support case?

Mike

Post by **Dreadnought** » May 10, 2017 3:28 pm this post

Hi Mike,

We havent yet as we are going through everything on our side first to make sure there isnt anything out of place that might be causing the issue. This has highlighted a lot of issues with windows updates on the cluster nodes where they are at different levels of patching. Unfortunatley due to the way Microsoft manage their rollups we arent able to bring them all up to the same level as they are so we are having to rebuild all the nodes, bit of a ball ache but sorts that problem out at least.

We now have some of the nodes back in the cluster fully patched, ODX off etc etc as i previously mentioned, our compellents are fully patched as is our switch fabric.

Ive created a new CSV and storage migrated all our own VMs to the CSV (17 in total that run in a single backup job), moved the csv to one of the fully patched nodes and live migrated all our VM's to that same node. Backup job has then been started, initial testing looked promising, didnt see any hangs in Outlook as Exchange is very suseptable to the pauses. The backup job has been left running and we have seen some pauses although not as bad as before, this could be because of the reduction in running jobs or the fact that ive moved our VM's to an isolated CSV.

Until we have all the nodes back in the cluster i dont think we can say one way or the other if we have found the issue. Ill post results here once we do have them all back in and i can conduct further testing, if at that point the issue is still present then i'll raise a support case.

Post by **Dreadnought** » May 11, 2017 1:32 pm this post

Hi Mike,

We have completed further testing.

Nodes fully updated, VM's now using the latest version of Integration Services, all VM's reside on same cluster node and same CSV, CSV is owned by the node the VM's reside on.
Pause during snapshotting is still present, doesnt appear to be as bad as it was but its still not going to be possible for us to move any of our customers to the platform as it stands. I'm going to log a support ticket today with Veeam.

If anyone has any ideas on how to resolve this issue i'm all ears.

thanks

Post by **Mike Resseler** » May 11, 2017 1:40 pm this post

Jerry,

Thanks for the updates. Please log the support case here and the result after investigation with support

Thanks
Mike

Post by **Dreadnought** » May 11, 2017 2:24 pm this post

Hi Mike,

Support case raised, log files submitted with case.

Case #02153463 was created.

thanks
Jerry

May 15, 2017 10:44 am

Thanks Jerry,

Keep us informed about the outcome please!

Cheers
Mike

Post by **Dreadnought** » May 15, 2017 11:11 am this post

will do.

I have a question regarding the GIP server that may or may not be contributing to the problem. Should a dedicated server be used as the GIP or is it ok to use an existing server that performs other roles.
The reason i ask is that we initially tested with a dedicated GIP in the customer network, monitoring showed that usage on the dedicated GIP was barely noticable when backup jobs were running and the GIP was performing injection of the runtimes so we decided to use an existing server within the customer network to perform the role of GIP (reduces costs etc).

This then means that the server performing the GIP role is also included in that customers backup job so its performing snapshots of itself when the backup runs as well as dealing with runtime injection for all the other servers in the job and dealing with preperation of application aware processing. Not been able to find much info regarding best practice of the GIP server and what should and shouldnt be used.

Should we be using dedicated servers to perform the GIP role, that arent included in the backup jobs?

May 15, 2017 11:16 am

Actually. No. It shouldn't be a dedicated machine. In fact, in my setup my hosts are GIP proxies also.

The only thing is what will happen when the VM is snapshotted who is at the same time the GIP proxy. I am not sure what happens then as I don't have that specific setup. So ask that to support (And I would like to know the answer also

)

Post by **Dreadnought** » May 15, 2017 3:51 pm this post

well just to rule it out ive built a dedicated GIP on the customer network.

Hasnt made the slightest difference. I had the backups running this morning without any pause (6 seconds) in the VM. Now i'm back to the 6 second pause again. The VM is a busy exchange server admitedly but its

Ive moved all the Veeam servers to the same node that the VM's to be backed up reside on so they use the Virtual switch rather than breaking out onto the switch infrastructure to rule out any type of network latency (network has been checked and there are no performance issues or bottlenecks). All VM's reside on the same CSV and the same cluster node, the CSV is also owned by the node. Perfmon shows no performance issues on the cluster node, the CSV shows very low read and write latency so i know i dont have a performance issue anywhere.

All the VSS writers checkout ok on the VM, Shadowsstorage has plenty of space, everything works as it should do with no errors anywhere yet i get a 6 second pause inside the VM when it thaws the snapshot. If i dnot do an application aware backup and use the Crash consistent backup it works fine, no pause and looking at the event log it shows everything happening the same as if i was useing an application aware, snapshoting freezes and thaws all in the space of 1 second and the logs all get truncated.

frustrating to say the least.

Post by **Dreadnought** » May 22, 2017 4:10 pm this post

Hi,

So further testing done. to rule out an issue with the Cluster nodes and underlying compellent storage ive migrated our exchange server to a standalone host, the host has been added to the Veeam backup infrastructure. New backup job created for our exchange server that is on the standalone host with direct attached storage. Host is dual CPU 24 cores, 192 GB RAM, 24 x 146GB 15k disks so its by no means slow and the only VM is the exchange server.

Start the new backup job, snapshot process starts, pauses experienced in outlook, there are always 2 pauses as well. So the exact same behaviour on a standalone host.

Next test, our veeam platform sits behind a Virtual PFSense firewall which segregates it from our cloud platform so to rule the Pf Sense out of the equation ive multi-honed the B&R and Repo servers so they now sit in the cloud platform and can talk directly between the two without touching the virtual firewall. Again backup tested and the pause is experienced exactly as before.

In addition to the above i have also changed settings within the B&R server as per Veeams support request, so Storage Latency control enabled and set to 10ms on both settings, Max concurrent snapshots set to 1 on all volumes, Max tasks on the On Host Proxy's set to the default of 4.
Again ran the backup with all of the above settings reduced as well as the B&R and the Repo in their new configuration as well as the old configuration with no change what so ever, we still get the pause.

According to this article https://www.veeam.com/kb1896 The cause of the pause can be caused by Hyper-V 2012 R2 using a saved state backup if Online backup is not available, The saved state backup process causes the system up-time counter in Hyper-V manager to reset, which we are seeing, why we are seeing it i have not the foggiest idea or how to find out why its doing it and how to stop it doing it.

completely out of ideas now with this one. Anyone have any good suggestions as to where to look to try and resolve this will be appreciated

Post by **Mike Resseler** » May 22, 2017 5:40 pm this post

Just to be sure and on the safe side (if you already looked at it with our support team, I apologize for the double question...)

1. Is the child VM must in the running state. (I am sure it is yes but this is part of my standard research

)
2. Is the Snapshot File Location for the VM set to be the same volume in the host operating system as the VHD files for the VM?
3. Are all volumes in the child VM basic disks and are there no dynamic disks?
4. Are all disks in the child VM use a file system that supports snapshots (for example, NTFS, ReFS)?

nmdange · Post by **nmdange** » May 22, 2017 6:21 pm this post

Have you tested using Hyper-V Native Quiescence instead of Veeam Application-Aware processing?

R&D Forums

VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Re: VM IO pause during backup

Who is online