Host-based backup of Microsoft Hyper-V VMs.
Post Reply
dasfliege
Service Provider
Posts: 330
Liked: 69 times
Joined: Nov 17, 2014 1:48 pm
Full Name: Florin
Location: Switzerland
Contact:

Failover Cluster crashing during backup after 2025 update

Post by dasfliege »

We've in-place upgraded our 6-node Hyper-V 2022 cluster to 2025. Everything was working fine until we raised the cluster function level to 2025.
Since then, when we start the veeam backups (on-host), some of the nodes immediately are at 99% CPU, stop responding and disconnect (timeout) several CSVs.
We are already troubleshooting the issue with MS, but i wondered if there are any other people facing the same problems, since it seems to be directly related to backup operations or other heavy load operations on the hosts.
Veeam Case 07912704
david.domask
Veeam Software
Posts: 3155
Liked: 727 times
Joined: Jun 28, 2016 12:12 pm
Contact:

Re: Failover Cluster crashing during backup after 2025 update

Post by david.domask »

Hi Florin,

Thank you for sharing the case number and sorry to hear about the difficulties.

I can see Support has already begun the investigation and requested additional information you provided, so please continue with Support as they will be able to better comment after a review of the debug logs. At first blush, not aware of issues related specifically to increasing the functional cluster level, so please continue with Support on the investigation.
David Domask | Product Management: Principal Analyst
Frosty
Expert
Posts: 212
Liked: 46 times
Joined: Dec 22, 2009 9:00 pm
Full Name: Stephen Frost
Contact:

Re: Failover Cluster crashing during backup after 2025 update

Post by Frosty »

Probably doesn't help -- but -- we run a 2-Node HyperV Cluster which was built fresh on Windows Server 2025 (not upgraded from 2022) -- we have not noticed any problems with Veeam backups.
dasfliege
Service Provider
Posts: 330
Liked: 69 times
Joined: Nov 17, 2014 1:48 pm
Full Name: Florin
Location: Switzerland
Contact:

Re: Failover Cluster crashing during backup after 2025 update

Post by dasfliege »

Just wanted to give a quick overview of what we've found out so far, in case some others run into the same issues:

We are seeing severe instability in a 6-node Hyper-V cluster after upgrading the Cluster Functional Level from 2022 to 2025.

Environment:
- 6-node Hyper-V cluster, Windows Server 2025
- CSVs on shared SAN (no SAN / MPIO errors)
- Veeam Backup & Replication (on-host backups)
- Backups ran stable for a long time on FL 2022 with the same load

Symptoms:
- Immediately after backup jobs start (before significant data transfer):
- One cluster node suddenly goes to ~99% CPU
- Host becomes almost unresponsive
- No single process shows constant high CPU, but:
- Failover Cluster Service
- WMI-related activity spike intermittently
- CSVs may later enter paused / redirected state (STATUS_IO_TIMEOUT)
- The affected node varies between runs (not host-specific)

Key Observation:
- On the affected node, we consistently see a very high number of VeeamHVWMIProxy processes (e.g. 20+), while healthy nodes show only 2–5 instances.
- During the incident, WMI becomes extremely slow or unresponsive on the affected node.
- Once WMI responsiveness recovers, the node stabilizes.

Mitigations tested:
- Sophos XDR completely removed from all nodes
- Windows Defender exclusions for:
- All Veeam folders & processes
- C:\ClusterStorage\*
- Issue still reproducible

Notable behavior:
- Disabling Windows Defender on the affected node during the incident leads to:
- CPU dropping
- Gradual host recovery
- Backups continuing without stopping

Microsoft feedback:
- MS confirmed that Cluster FL 2025 introduces stricter control-path and resiliency behavior (faster hang detection, CSV auto-pause).
- Workloads that were stable under FL 2022 may now trigger protective behavior under FL 2025 when backup/WMI activity and filter stacks are involved.
- No official regression article yet.
_tcpip_
Novice
Posts: 3
Liked: never
Joined: Sep 28, 2023 1:33 pm
Contact:

Re: Failover Cluster crashing during backup after 2025 update

Post by _tcpip_ »

Hi,

Which file system are you using?

That sounds similar to the refs bug under 2025.

veeam-backup-replication-f2/server-2022 ... 96912.html

If it's refs, was the version upgraded during the upgrade?

Could it be?
dasfliege
Service Provider
Posts: 330
Liked: 69 times
Joined: Nov 17, 2014 1:48 pm
Full Name: Florin
Location: Switzerland
Contact:

Re: Failover Cluster crashing during backup after 2025 update

Post by dasfliege »

I've seen this thread as well, but i don't think it has anything to do with our case, since our problems appear on the Hyper-V hosts and not on the backupserver.
On our Hype-V nodes we are using NTFS for the boot partition and CSVFS for the CSVs.
_tcpip_
Novice
Posts: 3
Liked: never
Joined: Sep 28, 2023 1:33 pm
Contact:

Re: Failover Cluster crashing during backup after 2025 update

Post by _tcpip_ »

I think csvfs works with an underlying file system (NTFS or ReFS).

And if it's refs, it doesn't matter whether it's a backuserver or Hyper-V host. Because the error relates to refs.

If it is based on NTFS, this is irrelevant.

Boot partition in ntfs is ok. They can't do refs yet. In the next version.
dasfliege
Service Provider
Posts: 330
Liked: 69 times
Joined: Nov 17, 2014 1:48 pm
Full Name: Florin
Location: Switzerland
Contact:

Re: Failover Cluster crashing during backup after 2025 update

Post by dasfliege »

Underlying FS is NTFS in our case:


FriendlyName FileSystemType DriveType HealthStatus OperationalStatus SizeRemaining Size
------------ -------------- --------- ------------ ----------------- ------------- ----
csv-xxxxxx CSVFS_NTFS Fixed Healthy OK 2.04 TB 5.96 TB
_tcpip_
Novice
Posts: 3
Liked: never
Joined: Sep 28, 2023 1:33 pm
Contact:

Re: Failover Cluster crashing during backup after 2025 update

Post by _tcpip_ »

OK, thanks for clarifying that.

So it has nothing to do with refs :-)
dasfliege
Service Provider
Posts: 330
Liked: 69 times
Joined: Nov 17, 2014 1:48 pm
Full Name: Florin
Location: Switzerland
Contact:

Re: Failover Cluster crashing during backup after 2025 update

Post by dasfliege »

We would like to share findings from this ongoing case with Veeam and MS Support and raise a feature request that we believe will become increasingly relevant with the adoption of Cluster Functional Level (FL) 2025.

Environment:
  • Windows Server 2025 Hyper-V cluster (Cluster FL 2025)
  • 6 nodes, ~800 VMs, 70+ tenants
  • Each tenant has its own CSV and its own backup job
  • VM placement is dynamic; no strict host-to-CSV or host-to-tenant affinity
  • All jobs start at the same time for operational simplicity
Important context:
The same hardware, SAN, network, and backup configuration ran stable for a long time under Cluster FL 2022. The described issues only started to appear after upgrading the cluster functional level to FL 2025.

Key findings (confirmed with Veeam and MS Support):
  • The observed instability is not caused by data transfer or VeeamFCT.
  • The pressure occurs during the initial orchestration phase (WMI enumeration, validation, and snapshot preparation).
  • Existing limits (task limits per host, concurrent snapshots per CSV, repository limits) primarily apply to the execution / data transfer phase and do not meaningfully limit the orchestration phase.
  • VeeamHVWMIProxy processes are not the root cause, but a symptom of WMI congestion when orchestration calls take longer or are retried.
  • In large and dynamic environments, parallel orchestration of many jobs starting at the same time can create a short but intense WMI burst.
  • With Cluster FL 2025, the cluster control path and hang detection logic are more aggressive, making this orchestration burst far more likely to result in CSV auto-pause or VM pause events than under FL 2022.
  • Currently, there is no supported Veeam setting to throttle or smooth only the orchestration phase itself.
Operational impact:
  • The only effective mitigation today is staggering job start times.
  • This does not scale well in MSP or large enterprise environments with many tenants, dynamic VM placement, and operational requirements to keep job management simple.
  • Task limits alone are insufficient, as they do not regulate pre-backup tasks.
Why this matters going forward:
Cluster FL 2025 is the current and future functional level for Hyper-V clusters. As more customers upgrade to Windows Server 2025 and FL 2025, we expect this behavior to surface more frequently, especially in larger and more dynamic environments. What was previously a “near the edge” but stable workload under FL 2022 can now trigger protective cluster behavior.

Feature request / suggestion:
It would be highly beneficial if Veeam could extend effective load regulation to also cover pre-backup tasks such as WMI orchestration, VM / host enumeration, and validation and snapshot preparation. Possible approaches could include throttling or serializing orchestration calls per Hyper-V host, making orchestration respect existing task limits, or introducing a dedicated orchestration concurrency limit.

This would allow predictable and scalable load regulation on Hyper-V nodes, especially in large and highly dynamic environments, without requiring manual job staggering. We fully understand that the current behavior is by design, but given the increasing adoption of Cluster FL 2025, we believe this area deserves attention to avoid operational issues for customers at scale.
Post Reply

Who is online

Users browsing this forum: iDeNt_5 and 3 guests