Host-based backup of VMware vSphere VMs.
Post Reply
Seve CH
Enthusiast
Posts: 69
Liked: 32 times
Joined: May 09, 2016 2:34 pm
Full Name: JM Severino
Location: Switzerland
Contact:

Slow snapshot consolidation on big VMs

Post by Seve CH »

Hello

I am investigating why we are having from time to time snapshot consolidation errors, aka "Snapshot Consolidation needed", after snapshot consolidation starts, it just times out. Veeam finds them and retry them, but with so many tries, the VMs stay stunned for long time and cause other problems.

Our environment:
  • Pure Storage arrays: 2 Active clusters of different models (2xC60, 2x X20), different firmware per cluster.
  • ESX 6.7 and ESX 6.5 clusters. Different vCenter. Different vmtools versions
  • Multi terabyte VMs (thin provisioned, 1.5 or more TBs used). Guest OS can be W2016, W2019, etc. Big datastores (15 TB or more), VMFS-6.
  • CBT enabled, Hot-Add backups via Proxies (no direct SAN). Veeam B&R 11a P20220302
If I make a snapshot manually via vSPhere web client, the consolidation is quick (1 Minute or less). If I do a quick backup, snapshot consolidation may take from 10 to 30 minutes... or sometimes (once every 20-30 backups) fails and requires consolidation.

So I've a test file server, W2019 + IIS. 2 Disks (100 GB and 4TB), both thin provisioned, 1.65TB used in total. No writes I am aware of. I make a quick backup and Veeam reads whatever it has to read. I do a second one in succession and Veeam reads 50GB, transfers 48 Mbytes. The second quick backup requires 19 minutes to complete (whole process), which is not so quick ;-).

In both cases, the snapshot consolidation takes 12 or more minutes. The snapshot size is small (sesparse file is 16GB big but du -h displays only 120MB being used), but during consolidation ESXi writes the whole time (10+ minutes) at 80-120MB/s to the storage array. The process involved in the I/O is the VM-World process.

I will open a case by VMware (and maybe with Veeam), but just asking here just in case you have any suggestion ;-)

Best regards
vmtech123
Veeam Legend
Posts: 235
Liked: 134 times
Joined: Mar 28, 2019 2:01 pm
Full Name: SP
Contact:

Re: Slow snapshot consolidation on big VMs

Post by vmtech123 »

Interested to know the resolution as well. Unfortunately I backup direct from SAN so my setup is a bit difference. I have many 30TB+ VM's and by the time they were done backing up the snapshots were huge. Direct SAN was the only way to get rid of the VMware snapshot quickly.
PetrM
Veeam Software
Posts: 3264
Liked: 528 times
Joined: Aug 28, 2013 8:23 am
Full Name: Petr Makarov
Location: Prague, Czech Republic
Contact:

Re: Slow snapshot consolidation on big VMs

Post by PetrM »

Hello,

Looks like an infrastructural issue, I believe VMware support should carry out the root cause analysis. The only idea which comes to me is to try to reduce the number of concurrent tasks processing VMs from the same datastore and see whether the same issue persists or not. Also, Direct SAN mode or built-in integration with Pure Storage FlashArray are worth trying to test.

Thanks!
Seve CH
Enthusiast
Posts: 69
Liked: 32 times
Joined: May 09, 2016 2:34 pm
Full Name: JM Severino
Location: Switzerland
Contact:

Re: Slow snapshot consolidation on big VMs

Post by Seve CH »

Hi

Well, vmware support did a couple of tests and said that vSphere was working fine.
Snapshots done via vSphere web client work fast and without any problem. Snapshots consolidated by Veeam take very long to consolidate.

Example of test:
Web server 1.5TB used (4TB Thin provisioned). No I/O on the guest.
vSphere snapshot consolidation: 14s
Veeam snapshot consolidation after a Quick Backup: 11 minutes 17 seconds.

Their answer: Veeam must have a look at it. I've opened a Veeam Case #05649648.

We also were able in this time to better investigate the problem.
  • All VMs can be impacted, regardless of kind of backup (app aware or crash consistent) or size (multi TB or <100GB).
  • VMs without IO are also impacted
  • Snapshots done with vSphere Client work normal (quick consolidation). Snapshots after a Veeam backup can take up to 40 minutes to consolidate.

Code: Select all

Example of vmware.log error of a consolidation failure:
2022-09-27T21:55:27.479Z| vmx| I125: Mirror_DiskCopy: Starting disk copy.
2022-09-27T22:07:33.355Z| vmx| W115: Mirror: scsi0:2: Failed to copy disk: Timeout
2022-09-27T22:07:33.355Z| vmx| W115: MirrorDiskCopyGetCopyProgress: Failed to get disk copy progress for source disk: '/vmfs/volumes/6034cce3-dd739998-3780-8030e0327f98/ECHDXPV1PAN0001/ECHDXPV1PAN0001_2-000001.vmdk' and destination disk: '/vmfs/volumes/6034cce3-dd739998-3780-8030e0327f98/ECHDXPV1PAN0001/ECHDXPV1PAN0001_2.vmdk'
2022-09-27T22:07:33.355Z| vmx| I125: ConsolidateDiskCopyCB: Mirror Disk copy failed on src disk: /vmfs/volumes/6034cce3-dd739998-3780-8030e0327f98/ECHDXPV1PAN0001/ECHDXPV1PAN0001_2.vmdk and destination disk: /vmfs/volumes/6034cce3-dd739998-3780-8030e0327f98/ECHDXPV1PAN0001/ECHDXPV1PAN0001_2.vmdk.
...
2022-09-27T22:07:33.890Z| vcpu-0| I125: ConsolidateItemComplete: Online consolidation failed for disk node 'scsi0:2': The operation failed (36).
As we are receiving the on call alerts only since last month or so, I will check if it has something to do with our update to Veeam v. 11a.

Best regards
Seve
PetrM
Veeam Software
Posts: 3264
Liked: 528 times
Joined: Aug 28, 2013 8:23 am
Full Name: Petr Makarov
Location: Prague, Czech Republic
Contact:

Re: Slow snapshot consolidation on big VMs

Post by PetrM » 1 person likes this post

Hello,

Technically, the term "Veeam snapshot" is not correct because the backup application just sends a request to the hypervisor to create or consolidate a snapshot. Also, I'm not sure that the test with a simple snapshot creation via vSphere client is relevant. I believe you should create a snapshot manually and perform .vmdk read in the corresponding transport mode before checking the snapshot consolidation time. Our support engineers can help to simulate .vmdk data read over VADP native functions and move with the investigation of this infrastructural issue together with VMware support.

Thanks!
vmtech123
Veeam Legend
Posts: 235
Liked: 134 times
Joined: Mar 28, 2019 2:01 pm
Full Name: SP
Contact:

Re: Slow snapshot consolidation on big VMs

Post by vmtech123 » 1 person likes this post

Have you checked your firewall logs too? Sounds bizarre, but man did I have some weird Veeam and SRM issues recently with things working, but not correctly and 1 port was missed in the config. It would time out obviously waiting for something, but the error was not intuitive at all.

If the Snapshot works in VMware and veeam just calls that function, seems like a it COULD be a communication breakdown somewhere.
Seve CH
Enthusiast
Posts: 69
Liked: 32 times
Joined: May 09, 2016 2:34 pm
Full Name: JM Severino
Location: Switzerland
Contact:

Re: Slow snapshot consolidation on big VMs

Post by Seve CH » 1 person likes this post

Hi
Some updates.

We have analyzed the vCenter logs from May-2022 (Veeam Instant Recovery is great). The problem was already there (3 times in a month only), so it has nothing to do specifically with Veeam 11a. What we experience now was also happening with v.10.

Veeam support provided some interesting optimizing ideas to try to debug the hot-add problem, but we have moved to direct SAN. This takes the hot-add proxies out of the equation.
After 2 days, Direct-SAN seems OK. No more stuck snapshots. It is still too early to say victory. We will see during the weekend.

We also have analyzed our infra and found this:
https://kb.vmware.com/s/article/86291 "smx-provider crashes due to memory allocation issues in ESXi 6.x/7.x on HPE hosts"
Some of the servers are HPE Gen10 servers, installed with an ISO compatible with Gen9. This installed Gen9 monitoring agents and one of them crashes on Gen10 servers after starving the pool's memory (nice...). This could potentially disrupt networking (our arrays are iSCSI).

CAB meeting on Monday. Most probably:
- Remove as recommended by HPE the faulty agents from all Gen10 servers
- Update agents (if needed) on Gen9 servers
- Update storage's firmware (There is a bug with "write-same" on one of the storage clusters)
- Stop debugging the problem. We are currently migrating everything to vSphere 7 on fresh installed servers.

Best regards
vmtech123
Veeam Legend
Posts: 235
Liked: 134 times
Joined: Mar 28, 2019 2:01 pm
Full Name: SP
Contact:

Re: Slow snapshot consolidation on big VMs

Post by vmtech123 »

Good luck. That is a good plan. V7 is required anyways as V6 is EOL.

ALWAYS check the VMware supported hardware, drivers, storage etc. Once you get to a late version Veeam is compatible with VMware 7.0 and 7.3 so there are no issues there.
HostedBDR
Novice
Posts: 6
Liked: 2 times
Joined: Dec 01, 2021 8:44 pm
Full Name: Paul Huff
Contact:

Re: Slow snapshot consolidation on big VMs

Post by HostedBDR »

You mention:

Hot-Add backups

If you disable Hotadd (or make it unavailable) and just use NBD does the problem disappear too? Hotadd is wonderful, however it can indeed lead to what you're describing (regardless of product line). Happened back in ESXi5, in 6, and it does in 7. As always make sure that your hosts are patched, that your vCenter has plenty of RAM, also helps if you don't backup your VCSA while you're backing up the environment.

There's a plethora of tips and tricks and best practices, however yes, at times hotadd will leave a .vmdk attached, or a snapshot open. The best practices really do mitigate most of this, however it will happens from time to time. Best fix is usually to make sure the .vmdks aren't attached any longer and then open an additional snap on the affected VM and then 'delete all' to clear it up.

Typical culprits are job contention, patch levels, low VC RAM, backing up the VC at the same time as backing up other VMs. With NBD this conversation all but disappears. The others in this forum are right v6 is quickly becoming EOL, I think it's official in 2 weeks, regardless this does happen in v7 too, however it more of an environmental conversation rather than vendor defect at this point.
Seve CH
Enthusiast
Posts: 69
Liked: 32 times
Joined: May 09, 2016 2:34 pm
Full Name: JM Severino
Location: Switzerland
Contact:

Re: Slow snapshot consolidation on big VMs

Post by Seve CH »

Hello,

We will debug the problem once the migration to vSphere 7 is 100% finished. Just in case somebody's else is interested in the vCenter migration:

We have already hot-migrated one of the vCenters 6.5 to version 7 using the "lift" or "elevator" system: The lift arrives, the VMs get in, the lift goes up to the new vCenter and the VMs leave the lift without interruption.

We did a new vCenter installation (best option whenever possible to avoid trailing old settings) using v.7. We also intend to consolidate several vCenters to this new one.
Some new ESX hosts (new hardware, etc.) with v. 7. Some old hosts will be reinstalled with ESX 7.
Source and target clusters were configured with EVC. New hosts are EVC compatible with the old ones.
Source, lift and target hosts can see the datastores where the VMs are currently running.

Preparation:
Create one cluster per vCenter to use for lifting VMs. EVC enabled.
1 Host, the "lift", with ESX 6.x highest version compatible with the old and new vCenter, hardware compatible (EVC) with old and new hosts.
Confirm that the lift can do vMotion with both worlds (old/new). Document the changes needed if required when switching environments (we required new VLANs for vMotion on the new one).
Make sure you have a recent, compatible powercli powershell module to avoid delays updating it (we lost a lot of time with this)
Export all VM Tag assignment using Power-CLI if you use tags (we map VMs+Guest processing settings+Credentials using tags, for instance)

Procedure:
Disable the concerned backup jobs.
Connect the lift to the old vCenter, cluster "Lift".
vMotion of some VMs to the lift.
Disconnect the lift. Connect the lift to the new vCenter, "lift" cluster.
vMotion to the new cluster/hosts, respools, etc.
Use the Veeam's vCenter migration utility to remap IDs (https://www.veeam.com/kb2136) (read the doc, when it is applicable and when it is not).
- Prepare migration task.
- Review mappings file and leave only the VMs you intend to migrate (and the vc-old -> vc-new line)
- Use get-vm lab1_GW|select-object Name,Id on the new and old vCenter to help you identify the VMs in case of doubt.
- Duplicated VMs are commented out and must be uncommented (remove the // before the ID). This is not clear in the tool's documentation.
- Unneeded lines can be deleted for clarity shake, instead of commenting them out. (also not clear in the doc).
- Execute migration task.
Test a "quick backup" with one VM to validate it succeeds (DO NOT reenable the backup jobs yet). It must succeed.
Disconnect the lift from the new vCenter.
Reconnect the lift to the old vCenter.
Remove the old, now orphaned VMs from the old vCenter.
Reassign the new backup tags (we scripted everything to avoid errors).
Repeat the procedure with a new batch of VMs or reenable the backup jobs if you are done for today (we did the migration in 2 days, during office hours)

Advantages:
Greenfield (everything is fresh installed, tested, qualified, documented, no trash migrated).
No downtime.
Old and new vCenter stay always functional (No impact to DRS, VDI provisioning, etc.).
No loss of redundancy of any cluster (we aren't disconnecting production hosts).
No need for extra storage (no copy/import of VMs, no restores, no replica failovers, etc.)
No broken backup chains.
CBT working. No need to fully read the VMs again (we are considering to do that by phases anyway to make sure CBT is clean)
Progressive: you set the pace of the migration. You can migrate a couple of VMs, test and when you are confident, migrate some other VMs another day.

This is a bit of kung-fu but every time I use this system, my colleagues and boss are quite happy with the results :-).

Best regards.
MaraDantuono
Lurker
Posts: 2
Liked: 1 time
Joined: Feb 18, 2022 8:18 am

Re: Slow snapshot consolidation on big VMs

Post by MaraDantuono »

Any news on this topic?
PetrM
Veeam Software
Posts: 3264
Liked: 528 times
Joined: Aug 28, 2013 8:23 am
Full Name: Petr Makarov
Location: Prague, Czech Republic
Contact:

Re: Slow snapshot consolidation on big VMs

Post by PetrM »

Hello and Welcome to Veeam R&D Forums!

What kind of update are you looking for? The issue does not come from our code and must be investigated at the level of infrastructure. Please see the posts above for more details.

Thanks!
MaraDantuono
Lurker
Posts: 2
Liked: 1 time
Joined: Feb 18, 2022 8:18 am

Re: Slow snapshot consolidation on big VMs

Post by MaraDantuono » 1 person likes this post

I was wondering if the Update to vSphere 7 resolved the problem.
Thanks for your answer!
Seve CH
Enthusiast
Posts: 69
Liked: 32 times
Joined: May 09, 2016 2:34 pm
Full Name: JM Severino
Location: Switzerland
Contact:

Re: Slow snapshot consolidation on big VMs

Post by Seve CH »

Hi MaraDantuono
No. The migration to vSphere 7 didn't fix the problem. It is still painful slow and I had no time allotted to debug the problem.
Best regards
es-aelaan
Lurker
Posts: 1
Liked: never
Joined: Jan 28, 2020 8:35 pm
Full Name: Al van der Laan
Contact:

Re: Slow snapshot consolidation on big VMs

Post by es-aelaan »

I started seeing something similar and noted that my vMotion was very aggressive. Could it be that your servers are relocating at the same time veeam creates a snap? I have trimmed back the settings in VMware and will monitor for a week to see if it has improved.
Post Reply

Who is online

Users browsing this forum: No registered users and 54 guests