I've got a support case open (Case 00918770) which is getting more complicated by the day and slow moving. I'm hoping someone may have seen this and can provide additional pointers!
We are backing up Exchange 2013 on Hyper-V 2012 R2. 2 DAG nodes, one active, one passive, the passive is being backed up but experiences IO pauses for up to 90 seconds.
Initially it would affect both DAG nodes and cause database failovers but we have split out the active node onto it's own dedicated LUN (there were other VMs sharing it previously). This has stabilized the system as we no longer get DB failovers and customers don't get a disconnect/reconnect anymore, however the passive node still has the problem.
We have worked through multiple things with support:http://www.veeam.com/kb1744
(cluster changes made no difference)
8.0 Update 2 installed
SAN firmware upgrade (Dell MD3620f to 8.20 - latencies are low, maxing 30-40ms, nowhere near 90 seconds)
The hardware is Cisco UCS & MDS FC switches.
OS & Hyper-V patches
Manual VSS within Exchange VM was fine with no issues
Only happens during backups, we have moved the backup window, the problem follows it
The VM does not go to a saved state, it keeps running and IO just stops. PerfMon graphs show all disk transfer counters drop to 0 for the 90 seconds while the disk queue raises slowly. Our current hypothesis is that it is CSV VSS snapshots that are causing the pause.
We have enabled the "Allow processing of multiple VMs with a single volume snapshot" which has reduced the frequency of the pauses (was every night, now every few days). Even stranger is that one night when the backup job containing this VM did not run (it was paused to allow tape backup) the pause happened when other jobs were running.
The VMs OS disk is on CSV1, and DB on CSV2. Other VMs that are backed up in the other jobs share CSV1 so we think it might be triggering something, although the pause happens to the DB which is on the other LUN CSV2. We will move the OS drive after business approval (it's politically sensitive after all the failovers), but this would be a workaround and does not indicate the root cause.
Any experience/pointers would be much appreciated.