ESX 5.5: An error occurred while consolidating disks

VMware specific discussions

ESX 5.5: An error occurred while consolidating disks

Veeam Logoby pgitdept » Fri Feb 07, 2014 11:48 am 7 people like this post

Hi All,

Case: #00510773

I thought I'd share some information regarding an issue we've been experiencing since moving to ESX 5.5 and Veeam 7 R2a. Background: We've been running Veeam for just shy of 4 years. Our Veeam server was a 2008R2 box and it backed up a 3 host ESX 5.1 cluster. Now, some of our VM's are rather large... in the region of 8TB+ large (although using a bunch of 2TB VMDK's). The size of these VM's has never been an issue and things have been running really well.

So, we now have some servers that just cannot fit within the 2TB limit. In fact, we have to allow for a 11TB drive on one of our servers (this is the lowest granularity we can achieve). So, we built a ESXi 5.5 cluster on new hosts as soon as it was available, built and tested some large VM's (Windows and RHEL) along with two new Veeam 7 servers (Windows 2012R2). Initial testing was good: backups and restore from both disk and tape, good. We're good to go!

In addition to these new servers we needed to accomodate our existing VM's and so needed to upgrade our 5.1 hosts. To do this we moved them into the new 5.5 vCenter and upgraded the hosts without much incident. This new cluster is served by the new Veeam servers and so we knew/accepted that we'd have to do full backups to begin with. This wasn't a concern as we probably don't run active fulls as often as we'd like because of VM's are so big.

On the whole this has been going well, but we did find that two of our older servers had the vSphere yellow warning triangle on them - once after the full seed, the other a few days after it's initial seed. Each of these servers are about 4TB in size and contain 3 or 4 disks. That size is usual here and many other servers of that size or larger worked correctly. The yellow triangle was the 'VM needs it's disks consolidating' message with no snaps present in Snapshot Manager. When trying to consolidate we got the following message:

Error: An error occurred while consolidating disks: msg.snapshot.error-FAILED. The maximum consolidate retries was exceeded for scsix:x.

We have spoken to VMWare, had the logs dug into and they couldn't understand why this was happening. We could take and remove snaps after, but couldn't avoid one disk from each server working on a snapshot. Next, on one of the servers, which has little change, we shutdown, manually deleted the snap and connected to the flat disk. Once we took and removed another snapshot... consolidation was needed again. VMware seemed to indicate that we might need to shutdown and clone the disk to remedy this. This wasn't something we could entertain, as this issue may present itself of any of our disks and these are production servers that cannot afford days of cloning time. Imagine cloning a 11TB disk. :shock:

So it was starting to look fairly helpless, until Veeam identified the following in the log (VMWare did highlight this, but didn't spend too much time :

2014-01-16T04:17:27.495Z| vcpu-0| I120: DISKLIB-LIB : Free disk space is less than imprecise space neeeded for combine (0x96a3b800 < 0x9b351000, in sectors). Getting precise space needed for combine...
2014-01-16T04:17:40.173Z| vcpu-0| I120: SnapshotVMXConsolidateHelperProgress: Stunned for 13 secs (max = 12 secs). Aborting consolidate.
2014-01-16T04:17:40.173Z| vcpu-0| I120: DISKLIB-LIB :DiskLibSpaceNeededForCombineInt: Cancelling space needed for combine calculation
2014-01-16T04:17:40.174Z| vcpu-0| I120: DISKLIB-LIB : DiskLib_SpaceNeededForCombine: failed to get space for combine operation: Operation was canceled (33).
2014-01-16T04:17:40.174Z| vcpu-0| I120: DISKLIB-LIB : Combine: Failed to get (precise) space requirements.
2014-01-16T04:17:40.174Z| vcpu-0| I120: DISKLIB-LIB : Failed to combine : Operation was canceled (33).
2014-01-16T04:17:40.174Z| vcpu-0| I120: SNAPSHOT: SnapshotCombineDisks: Failed to combine: Operation was canceled (33).
2014-01-16T04:17:40.178Z| vcpu-0| I120: DISKLIB-CBT : Shutting down change tracking for untracked fid 9428050.
2014-01-16T04:17:40.178Z| vcpu-0| I120: DISKLIB-CBT : Successfully disconnected CBT node.
2014-01-16T04:17:40.211Z| vcpu-0| I120: DISKLIB-VMFS : "/vmfs/volumes/50939781-d365a9b0-5523-001b21badd94/<SERVERNAME>/<SERVERNAME>-000002-delta.vmdk" : closed.
2014-01-16T04:17:40.213Z| vcpu-0| I120: DISKLIB-VMFS : "/vmfs/volumes/50939781-d365a9b0-5523-001b21badd94/<SERVERNAME>/<SERVERNAME>-flat.vmdk" : closed.
2014-01-16T04:17:40.213Z| vcpu-0| I120: SNAPSHOT: Snapshot_ConsolidateWorkItem failed: Operation was canceled (5) 2014-01-16T04:17:40.213Z| vcpu-0| I120: SnapshotVMXConsolidateOnlineCB: Synchronous consolidate failed for disk node: scsi0:2. Adding it to skip list.

We can see that there is a process that calculates if there is enough space free for consolidation and if this process does not complete in 12 or less seconds, it aborts the consolidate operation. After speaking to VMWare we found that we couldn't extend this timer - it's hardcoded. Veeam suggested that this precise calculation is only needed if there is less free space available in the datastore than the size of the disk that needs consolidation. So we extended the LUN and datastore so we had enough free space and ran the consolidate task again. This time it worked instantly and has continued to work for the last few days.

So I guess we have a work-around for the issue - to extend the datastore. Obviously this isn't ideal when we might be speaking about 11TB VMDK's (22TB datastore!!).

We still don't know why this happened, or should I say why it didn't happen before... The VM hasn't changed in size for months and we never had this issue on 5.1 or Veeam 6.5. It's on the same backend storage and the datastore latency hasn't really risen. I'm guessing something must have changed in connection with vSphere 5.5 snapshotting and this has made it less tolerant of our environment.

Anyway, I just thought I'd post this in-case anyone else experiences this or similar issues or has any further insight into this issue.

Thanks
Adrian
pgitdept
Influencer
 
Posts: 17
Liked: 14 times
Joined: Thu Feb 03, 2011 10:29 am
Full Name: PGITDept

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby m1m1n0 » Mon Feb 10, 2014 7:55 am 3 people like this post

Hello!

fire a support request to VMware. They consolidate deltas differently in 5.5, which are supposed to be more gentle to VMs. Your VM is too intensive on writes and consolidation process cannot keep up with the changes when -delta file grows faster than it can commit the changes to the base. It is not Veeam's problem, it's the way ESXi 5.5 behaves comparing to 5.1.

VMware will most likely recommend to increase the time for which the VM is allowed to be paused during delta consolidation. Don't be afraid that you will have longer downtime now when you remove snapshot, 5.1 was doing that silently anyway.

And do this ASAP. The process of removing snapshots for VMs like yours is very stressful on disk subsystem.

Source: gone through this myself

EDIT: in ESX 5.1 there was a possibility that consolidation will be failing the same way until some 30 times, however after that the host would pause your VM up to 30 minutes to consolidate the delta. ESX 5.5 does not do that, fall back to this mechanism does not happen and the allowed pause time is 5 seconds IIRC.
m1m1n0
Novice
 
Posts: 5
Liked: 3 times
Joined: Mon Nov 18, 2013 9:13 am

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby dahdco » Mon Feb 10, 2014 4:55 pm

We had the exact same error. It didn't occur until we upgraded our host to 5.5 (vcenter was upgraded a ~week prior).

In our first instance we changed the scsi controller from Paravirtual to LSI Logic SAS and were then able to consolidate and the problem didn't return on that VM.

On our second occurrence we weren't able to take a downtime to switch the controller from Paravirtual to LSI. We were able to consolidate by first taking a snapshot, then consolidate, then remove all snapshots. The consolidation failing was happening every night on this VM and this fix would work every time in the morning. We ended up moving the VM config files (VMX etc) to a new LUN (off of 3PAR to Compellent). Since doing this we haven't had the consolidation fail on this VM. Both the old and new LUN location had less free space than the actual disk size (1.95TB) so the precise check is still occurring. This VM has 5 1.9TB disk. The consolidation was always failing on the last one. I just checked the logs for this VM and see the following events. They are almost all around 11 seconds response. Looks like I'm barely missing the 12 second failure - repeatedly. Kind of scary.

2014-02-08T05:03:57.988Z| SnapshotVMXCombiner| I120: DISKLIB-LIB : Free disk space is less than imprecise space neeeded for combine (0xa0ffb800 < 0xe7db5800, in sectors). Getting precise space needed for combine...

2014-02-08T05:04:08.604Z| SnapshotVMXCombiner| I120: DISKLIB-LIB : Upward Combine 2 links at 1. Need 4 MB of free space (1318903 MB available)
dahdco
Novice
 
Posts: 3
Liked: never
Joined: Fri Oct 07, 2011 2:31 pm
Full Name: Doug Heckman

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby dahdco » Wed Feb 12, 2014 5:05 pm

Spoke too soon. The VM where we moved the config file to faster disk had another occurrence. VMWare went with it's a locking issue and wanted me to reboot the backup server and try again. I did and it worked but I think it's unrelated and things where just "faster" at the time and didn't trigger the 12 second timeout. I have logs where consolidation fails due to locking which look very different from these failures, which I uploaded to vmware today. Hopefully they'll actually look at the logs this time (pretty sure they didn't before contacting me).

So far I've had the error on 3 different VMs, across two different backup servers - we have 4. Two of the VMs had a single occurrence and the other has had 5+.
dahdco
Novice
 
Posts: 3
Liked: never
Joined: Fri Oct 07, 2011 2:31 pm
Full Name: Doug Heckman

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby munklarsen » Sun Mar 09, 2014 3:04 pm 5 people like this post

Just wanted to mention this so that people don't blame veeam for this :) We run TSM for VE on our enviroment and we have the same issues. Some times we can consolidate just fine, other times is doesn't work. What however always works is:

1) make snapshot (without snapping the memory)
2) consolidate
3) remove snap
4) consolidate
munklarsen
Influencer
 
Posts: 13
Liked: 6 times
Joined: Thu Nov 15, 2012 11:02 pm
Full Name: Michael Munk Larsen

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby teknomage » Wed Apr 30, 2014 5:18 pm

I was running into the same issue (Error: An error occurred while consolidating disks: msg.snapshot.error-FAILED. The maximum consolidate retries was exceeded for scsix:x) and your steps fixed me right up. Thanks munklarsen.
teknomage
Service Provider
 
Posts: 23
Liked: 2 times
Joined: Wed Jul 21, 2010 8:55 pm
Location: Fargo, ND
Full Name: Mike

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby cdickerson » Mon May 05, 2014 12:21 am 3 people like this post

Had this same problem. At first the VMware engineer said there was no change to the way snapshots were removed in vSphere 5.5, I quickly told him he was wrong. Here is the fix VMware provided to me.

-Pick a VM (s) which is most affected. You need to add a parameter in the virtual machine configuration.

You can do this by:

• Shut down the virtual machine
• Right-click the virtual machine and click Edit Settings.
• Click the Options tab.
• Under Advanced, click General.
• Click Configuration Parameters and add snapshot.maxConsolidateTime = 30
cdickerson
Influencer
 
Posts: 23
Liked: 4 times
Joined: Tue Nov 23, 2010 2:39 am
Full Name: Craig Dickerson

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby award@kahnlitwin.com » Thu Jun 19, 2014 10:09 am 1 person likes this post

I had this problem and found that the Veeam server still had the affected VM's disk mounted. I removed the disk from the Veeam server and cosolidated without problems.
award@kahnlitwin.com
Lurker
 
Posts: 2
Liked: 1 time
Joined: Thu Jun 19, 2014 10:07 am
Full Name: Andrew Ward

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby Vitaliy S. » Thu Jun 19, 2014 8:09 pm

Yes, that could be one of the reasons for this issue as well. On top of that, if you happen to see disks not mounted from the proxy server again, contact our technical team for investigating this behavior. Thanks!
Vitaliy S.
Veeam Software
 
Posts: 19562
Liked: 1102 times
Joined: Mon Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby TBone » Mon Jun 23, 2014 1:42 pm

I will add that we have ben having a similar problem with ESX 5.1. After the backup of the server completes, Veeam get a notice that the snapshot has been removed. However vCentre starts throwing errors that consolidation is required. Examining the config of the server in question, it is still running off a snapshot, despite the fact that snapshot manager doesn't show anything.

I have just opened a ticket with VMWare so we'll see what they make of it. This problem, does not happen every day, but generally is always the same server (a SQL 2008R2). It did not start happening until we moved from a Lefthand SAN to Nimble Storage.
TBone
Novice
 
Posts: 7
Liked: never
Joined: Wed Feb 12, 2014 4:14 pm
Full Name: Brian

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby Vitaliy S. » Mon Jun 23, 2014 3:30 pm

Could it be the issue with datastore performance not having enough IOPs to commit the snapshot? When you see this issue happening, can you please check the performance graphs of the source datastore?
Vitaliy S.
Veeam Software
 
Posts: 19562
Liked: 1102 times
Joined: Mon Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby Peejay62 » Tue Sep 16, 2014 8:29 am

TBone wrote:I will add that we have ben having a similar problem with ESX 5.1. After the backup of the server completes, Veeam get a notice that the snapshot has been removed. However vCentre starts throwing errors that consolidation is required. Examining the config of the server in question, it is still running off a snapshot, despite the fact that snapshot manager doesn't show anything.
.


I am curious, did you ever get to solve this or found out the cause? I am seeing some similar behaviour.

Thanks, Peter
Peejay62
Expert
 
Posts: 171
Liked: 21 times
Joined: Tue Aug 06, 2013 10:40 am
Full Name: Peter Jansen

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby v.Eremin » Tue Sep 16, 2014 9:03 am

Hi Peter, we've provided some explanation on that issue previously; should clarify the nature of the experienced behavior. Thanks.
v.Eremin
Veeam Software
 
Posts: 13266
Liked: 969 times
Joined: Fri Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby Peejay62 » Tue Sep 16, 2014 9:25 am

Vladimir thank you. More and more i am convinced that indeed the issue lies within vsphere. Just to illustrate. I defined a new job yesterday containing 17 VMs. They have never been snapshotted before. Ended up with 6 vms needing consolidation afterwards. Common factor is that within this job al vms in need of consolidation ran on the same host (not same datastore). Vmware reports that snapshot removal is complete, browsing datastore shows it not. Seems like the actual running of the cleanup gets lost, host overloaded or some kind of a command queue loss??
Anyway, i am trying to find a good way in trying to avoid the consolidations needed, it's getting kind of annoying.

Thanks, Peter
Peejay62
Expert
 
Posts: 171
Liked: 21 times
Joined: Tue Aug 06, 2013 10:40 am
Full Name: Peter Jansen

Re: ESX 5.5: An error occurred while consolidating disks

Veeam Logoby Vitaliy S. » Mon Sep 22, 2014 9:48 am

Peejay62 wrote:Vmware reports that snapshot removal is complete, browsing datastore shows it not. Seems like the actual running of the cleanup gets lost, host overloaded or some kind of a command queue loss??

Do you run your jobs using vCenter Server connection or direct ESXi host? Try to switch between these two and see if you observe the same snapshot consolidation error message or not.
Vitaliy S.
Veeam Software
 
Posts: 19562
Liked: 1102 times
Joined: Mon Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov

Next

Return to VMware vSphere



Who is online

Users browsing this forum: obroni and 29 guests