ESX 5.5: An error occurred while consolidating disks

pgitdept · Feb 07, 2014 11:48 am

Hi All,

Case: #00510773

I thought I'd share some information regarding an issue we've been experiencing since moving to ESX 5.5 and Veeam 7 R2a. Background: We've been running Veeam for just shy of 4 years. Our Veeam server was a 2008R2 box and it backed up a 3 host ESX 5.1 cluster. Now, some of our VM's are rather large... in the region of 8TB+ large (although using a bunch of 2TB VMDK's). The size of these VM's has never been an issue and things have been running really well.

So, we now have some servers that just cannot fit within the 2TB limit. In fact, we have to allow for a 11TB drive on one of our servers (this is the lowest granularity we can achieve). So, we built a ESXi 5.5 cluster on new hosts as soon as it was available, built and tested some large VM's (Windows and RHEL) along with two new Veeam 7 servers (Windows 2012R2). Initial testing was good: backups and restore from both disk and tape, good. We're good to go!

In addition to these new servers we needed to accomodate our existing VM's and so needed to upgrade our 5.1 hosts. To do this we moved them into the new 5.5 vCenter and upgraded the hosts without much incident. This new cluster is served by the new Veeam servers and so we knew/accepted that we'd have to do full backups to begin with. This wasn't a concern as we probably don't run active fulls as often as we'd like because of VM's are so big.

On the whole this has been going well, but we did find that two of our older servers had the vSphere yellow warning triangle on them - once after the full seed, the other a few days after it's initial seed. Each of these servers are about 4TB in size and contain 3 or 4 disks. That size is usual here and many other servers of that size or larger worked correctly. The yellow triangle was the 'VM needs it's disks consolidating' message with no snaps present in Snapshot Manager. When trying to consolidate we got the following message:

Error: An error occurred while consolidating disks: msg.snapshot.error-FAILED. The maximum consolidate retries was exceeded for scsix:x.

We have spoken to VMWare, had the logs dug into and they couldn't understand why this was happening. We could take and remove snaps after, but couldn't avoid one disk from each server working on a snapshot. Next, on one of the servers, which has little change, we shutdown, manually deleted the snap and connected to the flat disk. Once we took and removed another snapshot... consolidation was needed again. VMware seemed to indicate that we might need to shutdown and clone the disk to remedy this. This wasn't something we could entertain, as this issue may present itself of any of our disks and these are production servers that cannot afford days of cloning time. Imagine cloning a 11TB disk.

So it was starting to look fairly helpless, until Veeam identified the following in the log (VMWare did highlight this, but didn't spend too much time :

2014-01-16T04:17:27.495Z| vcpu-0| I120: DISKLIB-LIB : Free disk space is less than imprecise space neeeded for combine (0x96a3b800 < 0x9b351000, in sectors). Getting precise space needed for combine...
2014-01-16T04:17:40.173Z| vcpu-0| I120: SnapshotVMXConsolidateHelperProgress: Stunned for 13 secs (max = 12 secs). Aborting consolidate.
2014-01-16T04:17:40.173Z| vcpu-0| I120: DISKLIB-LIB :DiskLibSpaceNeededForCombineInt: Cancelling space needed for combine calculation
2014-01-16T04:17:40.174Z| vcpu-0| I120: DISKLIB-LIB : DiskLib_SpaceNeededForCombine: failed to get space for combine operation: Operation was canceled (33).
2014-01-16T04:17:40.174Z| vcpu-0| I120: DISKLIB-LIB : Combine: Failed to get (precise) space requirements.
2014-01-16T04:17:40.174Z| vcpu-0| I120: DISKLIB-LIB : Failed to combine : Operation was canceled (33).
2014-01-16T04:17:40.174Z| vcpu-0| I120: SNAPSHOT: SnapshotCombineDisks: Failed to combine: Operation was canceled (33).
2014-01-16T04:17:40.178Z| vcpu-0| I120: DISKLIB-CBT : Shutting down change tracking for untracked fid 9428050.
2014-01-16T04:17:40.178Z| vcpu-0| I120: DISKLIB-CBT : Successfully disconnected CBT node.
2014-01-16T04:17:40.211Z| vcpu-0| I120: DISKLIB-VMFS : "/vmfs/volumes/50939781-d365a9b0-5523-001b21badd94/<SERVERNAME>/<SERVERNAME>-000002-delta.vmdk" : closed.
2014-01-16T04:17:40.213Z| vcpu-0| I120: DISKLIB-VMFS : "/vmfs/volumes/50939781-d365a9b0-5523-001b21badd94/<SERVERNAME>/<SERVERNAME>-flat.vmdk" : closed.
2014-01-16T04:17:40.213Z| vcpu-0| I120: SNAPSHOT: Snapshot_ConsolidateWorkItem failed: Operation was canceled (5) 2014-01-16T04:17:40.213Z| vcpu-0| I120: SnapshotVMXConsolidateOnlineCB: Synchronous consolidate failed for disk node: scsi0:2. Adding it to skip list.

We can see that there is a process that calculates if there is enough space free for consolidation and if this process does not complete in 12 or less seconds, it aborts the consolidate operation. After speaking to VMWare we found that we couldn't extend this timer - it's hardcoded. Veeam suggested that this precise calculation is only needed if there is less free space available in the datastore than the size of the disk that needs consolidation. So we extended the LUN and datastore so we had enough free space and ran the consolidate task again. This time it worked instantly and has continued to work for the last few days.

So I guess we have a work-around for the issue - to extend the datastore. Obviously this isn't ideal when we might be speaking about 11TB VMDK's (22TB datastore!!).

We still don't know why this happened, or should I say why it didn't happen before... The VM hasn't changed in size for months and we never had this issue on 5.1 or Veeam 6.5. It's on the same backend storage and the datastore latency hasn't really risen. I'm guessing something must have changed in connection with vSphere 5.5 snapshotting and this has made it less tolerant of our environment.

Anyway, I just thought I'd post this in-case anyone else experiences this or similar issues or has any further insight into this issue.

Thanks
Adrian

m1m1n0 · Feb 10, 2014 7:55 am

Hello!

fire a support request to VMware. They consolidate deltas differently in 5.5, which are supposed to be more gentle to VMs. Your VM is too intensive on writes and consolidation process cannot keep up with the changes when -delta file grows faster than it can commit the changes to the base. It is not Veeam's problem, it's the way ESXi 5.5 behaves comparing to 5.1.

VMware will most likely recommend to increase the time for which the VM is allowed to be paused during delta consolidation. Don't be afraid that you will have longer downtime now when you remove snapshot, 5.1 was doing that silently anyway.

And do this ASAP. The process of removing snapshots for VMs like yours is very stressful on disk subsystem.

Source: gone through this myself

EDIT: in ESX 5.1 there was a possibility that consolidation will be failing the same way until some 30 times, however after that the host would pause your VM up to 30 minutes to consolidate the delta. ESX 5.5 does not do that, fall back to this mechanism does not happen and the allowed pause time is 5 seconds IIRC.

dahdco · Post by **dahdco** » Feb 10, 2014 4:55 pm this post

We had the exact same error. It didn't occur until we upgraded our host to 5.5 (vcenter was upgraded a ~week prior).

In our first instance we changed the scsi controller from Paravirtual to LSI Logic SAS and were then able to consolidate and the problem didn't return on that VM.

On our second occurrence we weren't able to take a downtime to switch the controller from Paravirtual to LSI. We were able to consolidate by first taking a snapshot, then consolidate, then remove all snapshots. The consolidation failing was happening every night on this VM and this fix would work every time in the morning. We ended up moving the VM config files (VMX etc) to a new LUN (off of 3PAR to Compellent). Since doing this we haven't had the consolidation fail on this VM. Both the old and new LUN location had less free space than the actual disk size (1.95TB) so the precise check is still occurring. This VM has 5 1.9TB disk. The consolidation was always failing on the last one. I just checked the logs for this VM and see the following events. They are almost all around 11 seconds response. Looks like I'm barely missing the 12 second failure - repeatedly. Kind of scary.

2014-02-08T05:03:57.988Z| SnapshotVMXCombiner| I120: DISKLIB-LIB : Free disk space is less than imprecise space neeeded for combine (0xa0ffb800 < 0xe7db5800, in sectors). Getting precise space needed for combine...

2014-02-08T05:04:08.604Z| SnapshotVMXCombiner| I120: DISKLIB-LIB : Upward Combine 2 links at 1. Need 4 MB of free space (1318903 MB available)

dahdco · Post by **dahdco** » Feb 12, 2014 5:05 pm this post

Spoke too soon. The VM where we moved the config file to faster disk had another occurrence. VMWare went with it's a locking issue and wanted me to reboot the backup server and try again. I did and it worked but I think it's unrelated and things where just "faster" at the time and didn't trigger the 12 second timeout. I have logs where consolidation fails due to locking which look very different from these failures, which I uploaded to vmware today. Hopefully they'll actually look at the logs this time (pretty sure they didn't before contacting me).

So far I've had the error on 3 different VMs, across two different backup servers - we have 4. Two of the VMs had a single occurrence and the other has had 5+.

munklarsen · Mar 09, 2014 3:04 pm

Just wanted to mention this so that people don't blame veeam for this

We run TSM for VE on our enviroment and we have the same issues. Some times we can consolidate just fine, other times is doesn't work. What however always works is:

1) make snapshot (without snapping the memory)
2) consolidate
3) remove snap
4) consolidate

Post by **teknomage** » Apr 30, 2014 5:18 pm this post

I was running into the same issue (Error: An error occurred while consolidating disks: msg.snapshot.error-FAILED. The maximum consolidate retries was exceeded for scsix:x) and your steps fixed me right up. Thanks munklarsen.

cdickerson · May 05, 2014 12:21 am

Had this same problem. At first the VMware engineer said there was no change to the way snapshots were removed in vSphere 5.5, I quickly told him he was wrong. Here is the fix VMware provided to me.

-Pick a VM (s) which is most affected. You need to add a parameter in the virtual machine configuration.

You can do this by:

• Shut down the virtual machine
• Right-click the virtual machine and click Edit Settings.
• Click the Options tab.
• Under Advanced, click General.
• Click Configuration Parameters and add snapshot.maxConsolidateTime = 30

award@kahnlitwin.com · Jun 19, 2014 10:09 am

I had this problem and found that the Veeam server still had the affected VM's disk mounted. I removed the disk from the Veeam server and cosolidated without problems.

Post by **Vitaliy S.** » Jun 19, 2014 8:09 pm this post

Yes, that could be one of the reasons for this issue as well. On top of that, if you happen to see disks not mounted from the proxy server again, contact our technical team for investigating this behavior. Thanks!

TBone · Post by **TBone** » Jun 23, 2014 1:42 pm this post

I will add that we have ben having a similar problem with ESX 5.1. After the backup of the server completes, Veeam get a notice that the snapshot has been removed. However vCentre starts throwing errors that consolidation is required. Examining the config of the server in question, it is still running off a snapshot, despite the fact that snapshot manager doesn't show anything.

I have just opened a ticket with VMWare so we'll see what they make of it. This problem, does not happen every day, but generally is always the same server (a SQL 2008R2). It did not start happening until we moved from a Lefthand SAN to Nimble Storage.

Post by **Vitaliy S.** » Jun 23, 2014 3:30 pm this post

Could it be the issue with datastore performance not having enough IOPs to commit the snapshot? When you see this issue happening, can you please check the performance graphs of the source datastore?

Peejay62 · Post by **Peejay62** » Sep 16, 2014 8:29 am this post

TBone wrote:I will add that we have ben having a similar problem with ESX 5.1. After the backup of the server completes, Veeam get a notice that the snapshot has been removed. However vCentre starts throwing errors that consolidation is required. Examining the config of the server in question, it is still running off a snapshot, despite the fact that snapshot manager doesn't show anything.
.

I am curious, did you ever get to solve this or found out the cause? I am seeing some similar behaviour.

Thanks, Peter

Post by **veremin** » Sep 16, 2014 9:03 am this post

Hi Peter, we've provided some explanation on that issue previously; should clarify the nature of the experienced behavior. Thanks.

Peejay62 · Post by **Peejay62** » Sep 16, 2014 9:25 am this post

Vladimir thank you. More and more i am convinced that indeed the issue lies within vsphere. Just to illustrate. I defined a new job yesterday containing 17 VMs. They have never been snapshotted before. Ended up with 6 vms needing consolidation afterwards. Common factor is that within this job al vms in need of consolidation ran on the same host (not same datastore). Vmware reports that snapshot removal is complete, browsing datastore shows it not. Seems like the actual running of the cleanup gets lost, host overloaded or some kind of a command queue loss??
Anyway, i am trying to find a good way in trying to avoid the consolidations needed, it's getting kind of annoying.

Thanks, Peter

Post by **Vitaliy S.** » Sep 22, 2014 9:48 am this post

Peejay62 wrote:Vmware reports that snapshot removal is complete, browsing datastore shows it not. Seems like the actual running of the cleanup gets lost, host overloaded or some kind of a command queue loss??

Do you run your jobs using vCenter Server connection or direct ESXi host? Try to switch between these two and see if you observe the same snapshot consolidation error message or not.

Peejay62 · Post by **Peejay62** » Sep 22, 2014 11:53 am this post

We run using Vcenter connection. It will be too much effort I am afraid to connect to ESXi host directly because I don't know where or when on what host it will happen... Probably I can make a new job, picking a specific host and select all the vms running on that one. Then I have to sit and wait to see if the consolidations occurs. That might just be the case. You gave me a good idea to dig in and find the root cause for this, thanks

Peter

Post by **Vitaliy S.** » Sep 22, 2014 12:02 pm this post

Yes, I was referring to adding the "problematic" host as a standalone one via IP address and then run a couple of test jobs to see if this issue can be reproduced or not.

seadave · Post by **seadave** » Jan 17, 2015 6:58 am this post

I've also had this error. We've been running Veeam 8.0.0.917 for about 3 months without issue. This week we used a mail archiving tool to export a large number of email messages ~1.5M (150G) to a file share. A few days after this, when Veeam snapshot our mailbox server, it filled up the LUN it was on. The LUN had quite bit of free space ~500G, but the VM is 2.2TB so possible it needed more than that. Awoke to find VM hung wanting to retry a file operation. I first expanded free space and then selected "Retry" in vCenter. Wouldn't work so I choose Continue. VM was still hung so needed reboot. It came back up fine. Realized later in the day the VM still needed snapshot consolidation. I attempted one and I got the error:

An error occurred while consolidating disks: 9 (Bad file descriptor).

Realized later that the consolidation process kept cycling during the day. I've never seen that happen before. It appeared like it knew there was a problem and just kept running sometimes failing and sometimes succeeding until all deltas were processed. It was weird. Finally resolved itself right after I had paid for a $1200 afterhours VMware support credit. I couldn't believe it.

Later that night I manually ran Veeam against that VM again and again I got the error after the backup had completed successfully and vCenter was attempting to consolidate the snapshot. This time it only took four attempts before it was resolved. I think I might be getting hit by the 12s bug. Not sure. Wondering about what kind of risk this involves moving forward. I think it might be time to build a clean mailserver and migrate mailboxes to free up the whitespace in the old one.

Post by **foggy** » Jan 19, 2015 2:09 pm this post

Looks like Snapshot Hunter tries to consolidate snapshots. You can ask support for logs review to identify whether everything works as expected.

Post by **dellock6** » Jan 19, 2015 3:58 pm this post

You can quickly look in history as explained in the blog post linked by Alexander and search for snapshot hunter activities, just to be sure.

Post by **ken.wilson** » May 27, 2015 1:57 pm this post

award@kahnlitwin.com wrote:I had this problem and found that the Veeam server still had the affected VM's disk mounted. I removed the disk from the Veeam server and cosolidated without problems.

This was my issue as well with one of my VM's. One of my Proxy's still had the disk added but it's strange the job log didn't show an issue and was successful. Either way once the disk was removed from the proxy I was able to consolidate the disk.

rafael.gonzalez · Jun 01, 2015 5:53 pm

Ok, so I was hitting the same issue - a particularly busy test Exchange VM was successfully backing up, but would bomb out on the consolidation step - so I always had the yellow bang on the VM. The only way to remedy that was to power off the VM and delete snap/consolidate, then power back on. Not good.

I spoke to veeam (support case ID 00835299) and vmware support over several days, and the word I got was that in our scenario, the situation just could not be dealt with with vanilla settings - so - they sent me this article:

http://kb.vmware.com/selfservice/micros ... Id=2082886

(Article number is 2082886 in case forum hoses the link.)

The short version is that I had to change a setting for our problem VM that allowed it to remain in an "unresponsive state" for the duration needed to consolidate the disks. (I used the PowerCLI option to target the one VM but didn't alter the time value under "Additional Info").

Well - I'm happy to report that the initial results look good, and the backups appear to be completing.

It's not a perfect solution, but until then - it increased joy.

iwik · Sep 29, 2015 8:00 am

Hi, I hit same problem. My solution was to shutdown vm and run consolidation. It was successful when vm was not running.

dharmon · Post by **dharmon** » Nov 19, 2015 4:10 pm this post

My fix: Even though none of my proxy servers, including the backup server, showed that they had a drive mounted for a backup or replication, once I stopped the backup server and proxy servers, the consolidation proceeded without error. I turned the backup server and proxy servers back on afterward.

R&D Forums

ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Re: ESX 5.5: An error occurred while consolidating disks

Who is online