Comprehensive data protection for all workloads
Post Reply
olafurh
Service Provider
Posts: 31
Liked: 18 times
Joined: Oct 29, 2014 9:41 am
Full Name: Olafur Helgi Haraldsson
Location: Iceland
Contact:

VBR v13 slow Instant Recovery boot from S3-compatible object storage

Post by olafurh » 1 person likes this post

Hi all,

We are troubleshooting a significant Instant Recovery performance regression after upgrading from VBR v12 to v13 (13.0.2.27).

Support case: #08108902

I would like to hear from others who are using Instant Recovery directly from S3-compatible object storage after upgrading to v13. Specifically: have you seen any change in instant boot performance, vPower NFS behavior, or VMware Tools/SureBackup stabilization time after the upgrade?

Environment summary:
- Backups stored on local on-prem S3-compatible object storage. (Big install)
- No WAN, internet, VPN, or cloud-provider path involved
- S3 has high aggregate read capability, we are not seeing general backup or normal restore performance issues
- Multible SureBackup/Instant Recovery hosts are large quad-CPU servers with ~1.5 TB RAM
- Instant Recovery redo/write cache is redirected to NVMe-backed VMware datastores

The important point: normal backup and full restore performance to/from object storage looks fine. (100% same path) The issue seems specific to live Instant Recovery, where the VM is running directly from backup data through the vPower NFS / Instant Recovery path.

Observed behavior:
After the v13 upgrade, SureBackup jobs started failing because VMs boot very slowly and Windows services / VMware Tools do not become ready in time.

For Domain Controllers we see NTDS-related warnings such as “NTDS write not started yet,” but this is not limited to DCs. Non-DC Windows servers also show service startup timeout behavior. So this looks like a general slow boot / slow disk access problem, not an AD-specific issue.

Manual Instant Recovery comparison:
We ran controlled manual Instant Recovery tests outside the full SureBackup job using two affected VMs:

The Instant Recovery setup itself completed successfully. vPower NFS check, publishing backup files, change storage preparation, VM registration, and snapshot creation all completed. Setup took roughly 4 minutes.

After powering on the instant-recovered VMs from vCenter:
- vm1 reached the Windows logon screen after ~6–7 minutes, but VMware Tools had still not started / no heartbeat
- vm2 reached the Windows logon screen after ~15 minutes, but VMware Tools had still not finished starting / no heartbeat

We could not log into the guest at that stage unless with built in credentials, so we are not treating this as a guest OS investigation yet. From the outside, this reproduces the SureBackup stabilization problem.

Full restore comparison:
We then restored the same VMs using Entire VM Restore from the same S3-compatible object storage to the same VMware environment and the same NVMe-backed datastore used for the IR write cache.

This was done during a high backup-load window btw.

Restore performance was acceptable:
vm1:
- Restore size shown by Veeam: 110 GB
- Processing time: 0:06:07
- Disk restore rates: 311 MB/s and 384 MB/s over NBD
- Restore completed successfully

vm2:
- Restore size shown by Veeam: 105 GB
- Processing time: 0:05:32
- Disk restore rates: 292 MB/s and 368 MB/s over NBD
- Restore completed successfully

After full restore to NVMe:
- VM booted in ~15–20 seconds
- VMware Tools reached full running status and reported IP in ~30 seconds
- AD/SureBackup recovery of the fully restored machines, including reboots, completed end-to-end in ~3–4 minutes

That seems to rule out the guest OS, VMware Tools installation, ESXi host, and NVMe datastore as the root cause.

Additional test:
We also backed up VMs that had been successfully instant-booted through SureBackup and were running from the Instant Recovery path.

That backup completed, but Veeam reported:
Load: Source 99% > Proxy 1% > Network 0% > Target 0%
Primary bottleneck: Source

So the disk source path of the instant-booted VMs appears to be the bottleneck.

vSphere metrics during Instant Recovery boot:
During manual Instant Recovery boot, PowerCLI/vSphere metrics showed very high virtual disk read latency and very low throughput.

For the two instant-booted VMs:
vm1:
- Average virtual disk read latency: ~142 ms
- P95 read latency: ~506 ms
- Max read latency: ~546 ms
- Average read throughput: ~1.47 MiB/s
- Average CPU usage: ~5.3%

vm2:
- Average virtual disk read latency: ~145 ms
- P95 read latency: ~501 ms
- Max read latency: ~547 ms
- Average read throughput: ~1.44 MiB/s
- Average CPU usage: ~3.4%

Host-level metrics did not indicate CPU pressure, CPU ready pressure, host network saturation, or physical storage adapter/path latency. Host CPU readiness was near zero, and the host was not under meaningful load.

So from our side the pattern looks like this:

- Entire VM Restore from object storage: fast enough
- Boot from restored NVMe VMDK: fast
- VMware Tools from restored VM: fast
- AD recovery from fully restored VM: 3–4 minutes
- Instant Recovery live boot from the same restore point: slow
- vSphere sees high read latency and very low read throughput on the instant-booted VM disks
- Backup of instant-booted VMs reports Source 99% bottleneck

This seems to isolate the problem to the live Instant Recovery / vPower NFS read path rather than general object storage throughput, host capacity, target datastore, or guest OS.

vPower NFS observations:
Looking at Svc.VeeamNFS.log on one of the mount/NFS servers, we see repeated sequences like:

nfstcps | ERR | Error occurred while trying to read. Code: system:10054
nfsxdr | WARN | Failed to retrieve next request from queue

These appear around vPower NFS activity and share handling. The sessions appear to come from IPv6 link-local addresses such as fe80::... with ports 300/301/302.

We also noticed vPower NFS settings such as:
- AsyncTcp : false
- BackupMountOperationsTimeoutMs : 2000
- BackupMountConnectionsNumber : 1

I do not know whether these are expected/default values in v13, or whether they are relevant for Instant Recovery from object storage. I am not assuming they are wrong; I am asking because they look like the kind of settings that could matter if vPower NFS is serving multiple live-booting VMs with high random-read pressure.

Questions for the community/R&D:
1. Has anyone else seen slower Instant Recovery boot performance from S3-compatible object storage after upgrading to VBR v13?
2. For those using on-prem S3-compatible storage, how does Instant Recovery boot performance compare between v12 and v13?
3. Are there known v13 changes in Instant Recovery read behavior, object storage read concurrency, prefetching, caching, or vPower NFS behavior?
4. Which logs are best for debugging the live read path?
- We have Svc.VeeamNFS.log.
- We are also collecting Agent.Mount.Client.log / Agent.Mount.Server.log.
- We are specifically looking for VfsStatistics / RC / Q / MQ counters and backend read timing.
5. Are the system:10054 / “Failed to retrieve next request from queue” sequences expected during vPower NFS mount/setup, or do they indicate ESXi/NFS client connection resets that could affect the live read path?
6. Is it expected for vPower NFS sessions to use IPv6 link-local addresses in this context, or should we try to force/validate IPv4-only access for the temporary VeeamBackup datastore?
7. Are there any Veeam-supported registry keys or advanced settings worth testing for:
- vPower NFS performance
- mount server queueing
- Instant Recovery cache behavior
- object-storage read concurrency
- read-ahead / prefetch behavior
- NFS TCP behavior
- S3-compatible repository Instant Recovery performance
8. Are settings like AsyncTcp or BackupMountConnectionsNumber relevant/tunable for this use case, or should we leave those alone unless Support/R&D explicitly directs otherwise?

We have an active support case open: #08108902.

I am mainly looking for field experience and R&D guidance on where to focus next. Full restore works; boot from restored NVMe works; the problem is live Instant Recovery boot from object-storage-backed restore points through vPower NFS.

Oli
haslund
Veeam Software
Posts: 913
Liked: 167 times
Joined: Feb 16, 2012 7:35 am
Full Name: Rasmus Haslund
Location: Denmark
Contact:

Re: VBR v13 slow Instant Recovery boot from S3-compatible object storage

Post by haslund »

Hey Oli,

Can you confirm whether this is a Backup Server running Windows or Linux?
If Windows, can you check if the following key exists and if yes, which value it has? Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Veeam\Veeam NFS\PerfLogPath
Rasmus Haslund | Twitter: @haslund | Blog: https://rasmushaslund.com
olafurh
Service Provider
Posts: 31
Liked: 18 times
Joined: Oct 29, 2014 9:41 am
Full Name: Olafur Helgi Haraldsson
Location: Iceland
Contact:

Re: VBR v13 slow Instant Recovery boot from S3-compatible object storage

Post by olafurh »

Hi Rasmus !

All servers involved here are Windows-based.

I checked the following registry path in our environment and that key/value does not exist on the servers we checked.

Oli
haslund
Veeam Software
Posts: 913
Liked: 167 times
Joined: Feb 16, 2012 7:35 am
Full Name: Rasmus Haslund
Location: Denmark
Contact:

Re: VBR v13 slow Instant Recovery boot from S3-compatible object storage

Post by haslund » 1 person likes this post

Hi Oli,

In this case, please keep working the support case with the customer support team.
Rasmus Haslund | Twitter: @haslund | Blog: https://rasmushaslund.com
mv@cloudio.dk
Service Provider
Posts: 13
Liked: 1 time
Joined: Sep 06, 2019 11:30 am
Full Name: Martin Veng
Location: Denmark
Contact:

Re: VBR v13 slow Instant Recovery boot from S3-compatible object storage

Post by mv@cloudio.dk »

Hi Olie,

Confirming same behavior – S3 Surebackup regression after v13 upgrade by alot

We see it on multiple environments after the upgrade.
Windows VBR
S3 (VAST) over LAN.

Have you got any feedback from Veeam support yet? I was also planning to create the same support case as you, I will do that.
haslund
Veeam Software
Posts: 913
Liked: 167 times
Joined: Feb 16, 2012 7:35 am
Full Name: Rasmus Haslund
Location: Denmark
Contact:

Re: VBR v13 slow Instant Recovery boot from S3-compatible object storage

Post by haslund »

@mv@cloudio.dk Please do go ahead an open a support case if you are experiencing the same situation.
Rasmus Haslund | Twitter: @haslund | Blog: https://rasmushaslund.com
olafurh
Service Provider
Posts: 31
Liked: 18 times
Joined: Oct 29, 2014 9:41 am
Full Name: Olafur Helgi Haraldsson
Location: Iceland
Contact:

Re: VBR v13 slow Instant Recovery boot from S3-compatible object storage

Post by olafurh » 1 person likes this post

I recommend that you open a support case as well.

I already have an ongoing case with Veeam, but so far there has been no conclusion. At this stage they are mostly gathering facts, logs, and performance data from our side.

Opening a separate case from your environment would help confirm that this is not isolated to our setup. It may also add more weight if Veeam sees the same behavior reported by multiple customers/environments.
Post Reply

Who is online

Users browsing this forum: Bing [Bot], Google [Bot] and 311 guests