Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

mikeely · Post by **mikeely** » Feb 05, 2021 7:40 pm this post

This probably isn't a Veeam-specific issue beyond the fact that Veeam is writing and removing snapshots all the time, but I thought I'd ask in here to see if anyone else has had the same experience. We stood up a new cluster on brand-new hardware and have been migrating VMs over to it - we're going Intel->AMD on this change so we're having to shut down the VMs and all that. We're also upgrading the vmware hardware version on the VMs to get them current although this doesn't seem to be making any difference in the snapshot deletion time.

Anyhow what I'm seeing is VMs that previously took on average 4-6 seconds for snapshot deletion are now taking 40-60 seconds, consistently. It's really bad as the VMs are being stunned for long enough during snapshot consolidation that it's causing crashes, alerts, etc. Is this anything somebody else here has seen before, and if so were you able to resolve the issue?

Post by **foggy** » Feb 05, 2021 7:53 pm this post

Hi Mike, what about the underlying storage where VMs reside - has it also changed?

mikeely · Post by **mikeely** » Feb 05, 2021 7:55 pm this post

Nope, same storage: Tintri with the VAAI plugin.

Post by **Gostev** » Feb 05, 2021 8:21 pm this post

As far as I remember, Tintri was truly one of a kind with how it integrates with VMware? Some sort of very special storage virtualization technic that allowed them to improve VM snapshot management operations performance specifically. Sounds like this logic just does not play well with ESXi7?

soncscy · Post by **soncscy** » Feb 05, 2021 9:48 pm this post

I'm curious, what isolation tests have you done? As long as a backup uses vADP, it's a standard call to the same type of API, you can even reproduce with PowerCLI if you're interested.

For your storage backing the v7 environment, how do normal snapshot deletions for snapshots that run for the same amount of time fare? Both from the host itself and from remote servers using PowerCLI?

mikeely · Post by **mikeely** » Feb 06, 2021 12:10 am this post

I don't have great data for manual snapshot removal either via powercli or vsphere web console because we just don't do that many of those - one offs for big software updates is about it. I did a test just now both on web and powercli and in both cases the VMs I was snapshotting took a long time to snapshot but consolidated almost instantly.

Is there a specific powershell way to use vADP or does something like

Code: Select all

 Get-Vm vmname | New-Snapshot -name "foo" -Quiesce -Memory

use the API?

soncscy · Feb 06, 2021 7:24 am

Unless VMware has sneakily changed something (like Microsoft with HypeV...) it should be the exact same call as here: https://vdc-download.vmware.com/vmwb-re ... -guide.pdf (pg 69)

Code: Select all

 // At this point we assume the virtual machine is identified as ManagedObjectReference vmMoRef.
    String SnapshotName = "Backup";
    String SnapshotDescription = "Temporary Snapshot for Backup";
    boolean memory_files = false;
boolean quiesce_filesystem = true;
ManagedObjectReference taskRef = serviceConnection.getservice().CreateSnapshot_Task(vmMoRef,
SnapshotName, SnapshotDescription, memory_files, quiesce_filesystem);

That should be exactly what PowerCLI calls.

For your tests, just to be sure though, the length of time on snapshot and the time since the last backup was roughly equivalent to the backup situation? Time on snapshot tends to be a factor that gets overlooked I think since you have redo log growth for even allegedly inactive machines. A few years back had a client who had long snapshot consolidation on their Lotus Notes machines, and turns out the backup window was overlapping with some replication/garbage collection procedure, which churned tons of data and made the snapshot redo logs bloat.

I'm not saying that's specifically your issue, but more to illustrate the effect that the time on snapshot can have

Feb 08, 2021 10:23 am

As far as I know the snapshot commit process did not change by VMware.

As usual datastore performance, snapshot place (if changed), data in the snapshot (changes during snapshot lifetime) and overall the IO load at the time the snapshot gets commited are key factors.

It do not matter if the snapshot was triggered by Veeam, the API, CLI or UI the process is always the same.

mikeely · Feb 08, 2021 6:01 pm

I was able to prove the issue with VMWare support, they've escalated it to engineering. Something's different, unsure whether it has to do with something vmware did, differences in behavior between Intel (our old infra) and AMD (new), or something else entirely. It's pretty ugly though. I'll report back here once I have news from VMWare in case another Veeam user runs into this.

mikeely · Post by **mikeely** » Feb 08, 2021 11:54 pm this post

Yeah, they're trying to blame Veeam:

I see that the previous engineer was able to identify that the issue is residing on your 7.0 vCenter and the snapshots are slow during the Veeam backups?

Have you gotten in touch with Veeam backupd?

No mention of the 2+ gigs of support packages I uploaded to the ticket, either.

Feb 09, 2021 12:10 am

But of course... it's an easy way out

Which is exactly why experienced users already recommended above that you try and reproduce the issue by creating, holding and removing VM snapshots without Veeam in the picture. On a second thought though, this may prove to be a challenge in case your issue also requires significant concurrent load on Tintri, such as one from backup jobs reading data. In this case, running IOmeter in a few VMs while snapshot is being removed should do.

Also, you really need to put Tintri in the loop with VMware, as honestly they would be the primary suspect for me, as opposed to VMware. Because by now, we can be confident that ESXi 7 itself does not have some major regression with snapshot deletion times. Otherwise, Veeam forums would have had a 10+ pages topic devoted to this issue by now

mikeely · Post by **mikeely** » Feb 09, 2021 12:18 am this post

Yeah, we're working the Tintri angle as well. I'd open a Veeam ticket on this if I thought there was any reason to think there was a need to but since Veeam's making the same API calls to both systems there's no logic behind that.

+1 on the iometer suggestion, and I'll do something to dirty a lot of blocks on the VM I'm testing with.

mikeely · Post by **mikeely** » Feb 12, 2021 6:05 pm this post

As we've been working through this a new data point has emerged - the difference in snapshot consolidation times is very strongly correlated to switching over to hot-add versus NBD. I can't think of what the mechanism might be to cause this but we saw it when we moved all our VMs at a given location over to hot-add while changing nothing else on those VMs - they're on the same hardware with the same backing storage on the same ESX servers and the same vSphere server at the same versions of all things.

I've opened ticket 04643396 at Sev2 since the minute-plus stun times are crashing things in our environment.

mikeely · Feb 12, 2021 9:55 pm

Had a great call with support, and it turns out the problem is basically this:
https://kb.vmware.com/s/article/2010953

VMs which are not on the same compute host as the proxy will be stunned for at minimum about 40 seconds. Moving to NFSv4 at the ESX datastore level is not an option for us.

One possible approach would be to have one proxy per ESX host, but here are the problems I can't see a way past so far (please feel free to contribute solutions to them):

With multiple proxies come multiple threads, and we would almost immediately overwhelm the Tintri with too many backup threads.
Setting "automatically select proxy" only chooses the proxy with the fewest threads being used - it does not AFAIK support ESX host affinity - but it would be nice if it did!

Our backup model is that any VM created gets pulled in by Veeam unless it's excluded either individually or by being placed in an excluded folder in the vSphere client. To set hard proxy-to-vm-to-host affinity we'd need to create one backup job per ESX host, set the proxy for that AND never migrate the proxy off that ESX host AND never migrate the other VMs off that host. Might as well use some free virtualization platform that doesn't support live migration at that point

Post by **Gostev** » Feb 13, 2021 1:13 am this post

Can you not just use Direct NFS transport mode instead of hot add? This issue above was the very reason why we added this transport mode in the first place...

Feb 14, 2021 10:21 pm

Correct, by far the best way to address this is to use DirectNFS mode instead of hotadd and it works amazingly well (it is one of my favorite Veeam features).

However, just to let you know, ESX affinity with hotadd mode actually is supported as well, but it must be enabled via a registry key EnableSameHostHotaddMode. This regkey is documented in Veeam KB1681 in the "Known issues with NFS 3.0 Datastores" section. However, I would only do this if, for some reason, Direct NFS can't be used in your case.

mikeely · Post by **mikeely** » Feb 16, 2021 11:32 pm this post

Our VBR server is a physical host so hotadd isn't an option using it. What's the Linux equivalent to that registry key?

Questions about DirectNFS mode:
1. Does it still require a physical host with an HBA or is that no longer a requirement?
2. Linux support?

Post by **Gostev** » Feb 16, 2021 11:47 pm this post

It's the same key.

1. DirectNFS never required a physical host or an HBA.
2. If you mean Linux proxy, then in v11 Linux proxies support DirectNFS too.

mikeely · Post by **mikeely** » Feb 16, 2021 11:51 pm this post

Thanks. Looking forward to 11 for a lot of reasons. Going to go shuffle through getting DirectNFS working.

Post by **Gostev** » Feb 17, 2021 12:05 am this post

You should be able to just use DirectNFS from your physical backup server host?

mikeely · Post by **mikeely** » Feb 17, 2021 12:07 am this post

Yeah that's the plan, and we still hadn't decommissioned the Windows proxy at our other datacenter. It's a VM but as I read the docs that shouldn't be a problem for DirectNFS right?

mikeely · Feb 17, 2021 12:56 am

Ah, that's nice. I had everything set up for DirectNFS and didn't even know it - all I had to do was change the radio button on the proxy settings. Getting about 300MB/s performance on a largish backup job right now.

Thanks y'all.

R&D Forums

Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Re: Seeing greatly increased snapshot deletion times in ESX7 vs ESX6

Who is online