I thought I'd share some information regarding an issue we've been experiencing since moving to ESX 5.5 and Veeam 7 R2a. Background: We've been running Veeam for just shy of 4 years. Our Veeam server was a 2008R2 box and it backed up a 3 host ESX 5.1 cluster. Now, some of our VM's are rather large... in the region of 8TB+ large (although using a bunch of 2TB VMDK's). The size of these VM's has never been an issue and things have been running really well.
So, we now have some servers that just cannot fit within the 2TB limit. In fact, we have to allow for a 11TB drive on one of our servers (this is the lowest granularity we can achieve). So, we built a ESXi 5.5 cluster on new hosts as soon as it was available, built and tested some large VM's (Windows and RHEL) along with two new Veeam 7 servers (Windows 2012R2). Initial testing was good: backups and restore from both disk and tape, good. We're good to go!
In addition to these new servers we needed to accomodate our existing VM's and so needed to upgrade our 5.1 hosts. To do this we moved them into the new 5.5 vCenter and upgraded the hosts without much incident. This new cluster is served by the new Veeam servers and so we knew/accepted that we'd have to do full backups to begin with. This wasn't a concern as we probably don't run active fulls as often as we'd like because of VM's are so big.
On the whole this has been going well, but we did find that two of our older servers had the vSphere yellow warning triangle on them - once after the full seed, the other a few days after it's initial seed. Each of these servers are about 4TB in size and contain 3 or 4 disks. That size is usual here and many other servers of that size or larger worked correctly. The yellow triangle was the 'VM needs it's disks consolidating' message with no snaps present in Snapshot Manager. When trying to consolidate we got the following message:Error: An error occurred while consolidating disks: msg.snapshot.error-FAILED. The maximum consolidate retries was exceeded for scsix:x.
We have spoken to VMWare, had the logs dug into and they couldn't understand why this was happening. We could take and remove snaps after, but couldn't avoid one disk from each server working on a snapshot. Next, on one of the servers, which has little change, we shutdown, manually deleted the snap and connected to the flat disk. Once we took and removed another snapshot... consolidation was needed again. VMware seemed to indicate that we might need to shutdown and clone the disk to remedy this. This wasn't something we could entertain, as this issue may present itself of any of our disks and these are production servers that cannot afford days of cloning time. Imagine cloning a 11TB disk.
So it was starting to look fairly helpless, until Veeam identified the following in the log (VMWare did highlight this, but didn't spend too much time :2014-01-16T04:17:27.495Z| vcpu-0| I120: DISKLIB-LIB : Free disk space is less than imprecise space neeeded for combine (0x96a3b800 < 0x9b351000, in sectors). Getting precise space needed for combine...2014-01-16T04:17:40.173Z| vcpu-0| I120: SnapshotVMXConsolidateHelperProgress: Stunned for 13 secs (max = 12 secs). Aborting consolidate.
2014-01-16T04:17:40.173Z| vcpu-0| I120: DISKLIB-LIB
iskLibSpaceNeededForCombineInt: Cancelling space needed for combine calculation
2014-01-16T04:17:40.174Z| vcpu-0| I120: DISKLIB-LIB : DiskLib_SpaceNeededForCombine: failed to get space for combine operation: Operation was canceled (33).
2014-01-16T04:17:40.174Z| vcpu-0| I120: DISKLIB-LIB : Combine: Failed to get (precise) space requirements.
2014-01-16T04:17:40.174Z| vcpu-0| I120: DISKLIB-LIB : Failed to combine : Operation was canceled (33).
2014-01-16T04:17:40.174Z| vcpu-0| I120: SNAPSHOT: SnapshotCombineDisks: Failed to combine: Operation was canceled (33).
2014-01-16T04:17:40.178Z| vcpu-0| I120: DISKLIB-CBT : Shutting down change tracking for untracked fid 9428050.
2014-01-16T04:17:40.178Z| vcpu-0| I120: DISKLIB-CBT : Successfully disconnected CBT node.
2014-01-16T04:17:40.211Z| vcpu-0| I120: DISKLIB-VMFS : "/vmfs/volumes/50939781-d365a9b0-5523-001b21badd94/<SERVERNAME>/<SERVERNAME>-000002-delta.vmdk" : closed.
2014-01-16T04:17:40.213Z| vcpu-0| I120: DISKLIB-VMFS : "/vmfs/volumes/50939781-d365a9b0-5523-001b21badd94/<SERVERNAME>/<SERVERNAME>-flat.vmdk" : closed.
2014-01-16T04:17:40.213Z| vcpu-0| I120: SNAPSHOT: Snapshot_ConsolidateWorkItem failed: Operation was canceled (5) 2014-01-16T04:17:40.213Z| vcpu-0| I120: SnapshotVMXConsolidateOnlineCB: Synchronous consolidate failed for disk node: scsi0:2. Adding it to skip list.
We can see that there is a process that calculates if there is enough space free for consolidation and if this process does not complete in 12 or less seconds, it aborts the consolidate operation. After speaking to VMWare we found that we couldn't extend this timer - it's hardcoded. Veeam suggested that this precise calculation is only needed if there is less free space available in the datastore than the size of the disk that needs consolidation. So we extended the LUN and datastore so we had enough free space and ran the consolidate task again. This time it worked instantly and has continued to work for the last few days.
So I guess we have a work-around for the issue - to extend the datastore. Obviously this isn't ideal when we might be speaking about 11TB VMDK's (22TB datastore!!).
We still don't know why this happened, or should I say why it didn't happen before... The VM hasn't changed in size for months and we never had this issue on 5.1 or Veeam 6.5. It's on the same backend storage and the datastore latency hasn't really risen. I'm guessing something must have changed in connection with vSphere 5.5 snapshotting and this has made it less tolerant of our environment.
Anyway, I just thought I'd post this in-case anyone else experiences this or similar issues or has any further insight into this issue.