Windows Server 2019 Hyper-V VM I/O Performance Problem

mkeating44 · Post by **mkeating44** » Jun 13, 2024 3:12 am this post

josues wrote: ↑Jun 06, 2024 6:58 pm A possible solution that has been helping us a lot is converting the VHDX disks from dynamic to fixed. Surprisingly, the performance of the VMs has improved significantly.

Unfortunately the CSVs that have the issue in our environment are caused by disks that are fixed.

gman42 · Post by **gman42** » Jun 13, 2024 3:48 pm this post

I'm pretty certain that recommendation was from the days when there was a relatively huge performance penalty to allocating more space to a dynamic VHDX, which has been reduced to virtually zero now.

Post by **SnakeSK** » Jun 13, 2024 8:07 pm this post

For spinning rust that applies, for ssds, not so much

svensurfs · Jun 14, 2024 12:15 pm

Hi all, just want to react to this topic, to help someone out. We don't have veeam but had the same problem with event id 9 and these problems happened all day long.

For use the problem was crc errors on the fiber switch. So if you have fiber switches to the storage, check for crc errors (porterrshow) and replace the cables and sfp if needed. On the storage no problems were visible in the logs.

In our situation we have 5 node cluster. If one off the nodes has crc errors the whole cluster is affected by this. So we have event id 9 on all the nodes. I shutted down the node with crc errors and all the event id 9 are gone on all our nodes.

Hope this helps for someone!

GabesVirtualWorld · Jun 19, 2024 6:24 am

Anyone with the CBT bug already testing with CSV volumes on Windows Server 2025 and see a different behavior?

@stephc_msft maybe you got some info to share?

Jun 20, 2024 10:20 am

Internal investigations are continuing.
Some buffered access to the affected live data vhdx (eg as part of the veeam backup) that is affecting the internal OS cache on the host and that isnt getting cleaned up fully in some situations seems to be related to the issue.
Live Migration of the VM re-establishes a 'fresh/clean/fast'' connection to the vhdx on the new host.
(Note vhdx io from the host normally runs unbuffered, and can be adversely affected if there is other buffered access or 'left-over' buffered access present)

Re: The new Windows Server 2025
There are no significant changes in this area, at present, that I know of, and so its highly likely that WS2025 would also show the issue. Although timing differences etc might impact whether it is affected. If someone gets the chance to try, it would be useful.

I might also be looking for people who can consistently repro and who are agreeable to test a new private (ws2019 and ws2022) in due course
-- private message me if interested.

Jun 21, 2024 9:24 am

Correction/update:
Re: The new Windows Server 2025
There might be a change already in there that should help (in the 2406 June version).
So if someone gets the chance to try, it would be useful.

Knuppel · Jun 23, 2024 5:01 pm

Please keep in mind that upgrading to 2025 is not possible for everyone, and this bug still needs fixing.

joelg · Post by **joelg** » Jun 25, 2024 4:23 pm this post

Followup - I'm mostly done converting all my dynamic disks to fixed. I totalled up the Event ID 9 errors and I'm not seeing any reduction in their frequency. So this has not really been the solution for us.

Joel

GabesVirtualWorld · Jul 04, 2024 6:54 pm

@stephc_msft we have an environment on which we quite often have the issue. Depending on what you'll be asking, we might be able to show you the issue

I will send you a PM

mirtelo · Jul 09, 2024 7:43 am

@stephc_msft same situation here. we have about 700 Hyper-V Hosts. Running in this issue every day.

PGPulp · Post by **PGPulp** » Jul 10, 2024 9:29 am this post

Hi,
recently discovering this issue ; what i can tell is that we are suffering "from time to time" EventID8
Failed to map guest I/O buffer for write access with status 0xC0000044 in Hyper-V\StorageVSP
Using Hyper-V 2019 under HPe SYNERGY combined with HPe SAN & 3PARP8K storage

Already doing lots of troubleshooting on our side ; while not being able to reproduce the issue.
HW / drivers / firmwares are for sure out-of-the cause.
We are not using VEEAM but CommVault as backup solution ; while CV is also out of the game as recent new VM stagings not yet backuped are suffering from the issue.
Plus, these recent VM stagings are not yet used in production ; so not heavy IO.loaded

Anyway, could partipate in sharing more informations if could help in resolution.

Rgds,
Laurent.

Post by **RobMiller86** » Jul 11, 2024 1:40 pm this post

We have 4 different clusters. 3 of them are running Server 2022, and 1 is running Server 2019. All are connected to Nimbles over FC. It seems that the 2022 clusters can go a month or more with a completely clean StorageVSP log. Occasionally we will see some Event ID 9 warnings on them, but few and far between.

Our last 2019 cluster is loaded with Event ID 8 errors and Event ID 9 warnings all day every day.

We are experiencing random latency issues that I've felt for a while were due to array IO overload as most of our arrays are not all flash yet. But now I'm wondering if it's not a combo of both of these issues.

I've read here that people are still experiencing this problem with server 2022. But looking in our logs, it appears that 2022 hosts have much cleaner storage logs than our 2019 hosts. Is this problem improved at all by going to 2022? Wondering if it's worth the effort to push this last cluster to 2022, or just wait for 2025.

eponerine · Jul 16, 2024 11:59 am

@stephc_msft -

First, I want to thank you for all the continued effort to support fixing this problem. I'm sure everyone here agrees that you've done an amazing job triaging and relaying info between customers and PG. Truly appreciated.

Before I commit to "testing" WS2025 for you, I need some basic info:

Do we have any info on how what conditions are required to trigger the repro? (VHDX size? Average IO size? RAM in system?) In other words, what does the Hyper-V and backup scenario need to be to get this to happen?

When would VHDX IO with host ever NOT be unbuffered? - this was surprisng to read. I assumed all VHDX IO was by default unbuffered. What is causing it to be buffered instead? Is this because the RO handle to the VHDX during backup has to pass thru a 3rd party filter/driver? Is it because you can instruct VHDMP how you want the VHDX opened (buff vs unbuff)? Like... what is causing the buffered IO to occur and where in the filter stack is that happening?

joelg · Post by **joelg** » Jul 16, 2024 6:43 pm this post

joelg wrote: ↑Jun 25, 2024 4:23 pm Followup - I'm mostly done converting all my dynamic disks to fixed. I totalled up the Event ID 9 errors and I'm not seeing any reduction in their frequency. So this has not really been the solution for us.

Further followup, I think the converting from Dynamic to Fixed has had a positive improvement on our issue. It seems we're having the issue at a very targeted time now (for about 30min per day). I'm not sure what the cause is at that particular time, but Stephen and another MS rep have been in contact and reviewed logs that suggest it is a disk latency issue during that time.

Joel

Aug 01, 2024 12:51 pm

"Do we have any info on how what conditions are required to trigger the repro?" No, and thats one of the biggest challenges in investigating this.
Only some VM's with particular IO patterns seem to get affected, and the exact trigger is unclear (although a veeam backup seems to be able to trigger it for people here). There could conceivably be other ways to trigger it, but the various activity would have to occur in a particular order and at particular times to trigger it?

"When would VHDX IO with host ever NOT be unbuffered?" Exactly. The Hyper-V host, on behalf of the running VM, is doing it unbuffered, and continues to do it unbuffered. But there is some indication that something else (veeam backup of the vhdx maybe) was accessing it buffered [which is ok-ish for the period of the backup] but that somehow, sometimes, it then stays stuck in some state where it thinks there is still some buffered activity potentially occurring, which in turn has a detrimental effect on the ongoing unbuffered IO.

phil_h · Post by **phil_h** » Aug 01, 2024 10:19 pm this post

joelg wrote: ↑Jul 16, 2024 6:43 pm Further followup, I think the converting from Dynamic to Fixed has had a positive improvement on our issue. It seems we're having the issue at a very targeted time now (for about 30min per day). I'm not sure what the cause is at that particular time, but Stephen and another MS rep have been in contact and reviewed logs that suggest it is a disk latency issue during that time.

Sadly this hasn't been the case in our environment (Windows2019 Hyper-V/FC direct attached storage). I converted one of our troublesome virtual servers (it would trigger 153 randomly a few times a day, then go silent for a week) from a dynamic disk to a fixed disk. It was quiet for 1 day before it started alerting again.

The interesting thing from my point of view is that the event only seem to trigger on the C: drive (The IO operation at logical block address 0x11e32aeb for Disk 0 (PDO name: \Device\00000035) was retried). On all the VMs that are affected it is just the C: drive (I have a database server with Tb of disk - no issue, but it's C: drive causes the alert).

joelg · Sep 10, 2024 1:42 pm

Just for clarification, we converted all ~150 dynamic disks to fixed, not just one for testing. I beleive the overhead of that many dynamic disks were causing the bulk of our issues.

After converting all the disks, our errors dropped dramatically, and I was able to narrow down one VM that was still causing errors during a specific timeframe and moved that VM to local storage on one of our hosts.

We've gone from tens of thousands of errors daily to most days not having any.

Joel

WingDog · Post by **WingDog** » Sep 20, 2024 1:20 pm this post

@stephc_msft
hi there!
any updates about WS2025 or backport fix for WS2019?

Oct 01, 2024 7:54 am

Investigations and work on a likely fix are ongoing, as are internal discussions with Veeam. More news in due course.

Nick-SAC · Oct 01, 2024 1:45 pm

For the record: It has now been 5 Years – FIVE FULL YEARS – since I first reported this BUG... and the best we’ve gotten is, “Investigations and work on a likely fix are ongoing, as are internal discussions with Veeam. More news in due course.”

joloo · Post by **joloo** » Oct 01, 2024 3:02 pm this post

Half way there!

Oct 03, 2024 11:11 am

Good news, looks like Microsoft has finally found the root cause guys... from the description of the bug they found in the file system components, it seems they have finally nailed it. We will help them to test a fix for Server 2025 and if it addresses the issue then as I understand they plan to backport it.

Post by **SnakeSK** » Oct 04, 2024 11:28 pm this post

5 years...wow

dm_ch · Post by **dm_ch** » Oct 08, 2024 7:30 am this post

stephc_msft wrote: ↑Oct 01, 2024 7:54 am Investigations and work on a likely fix are ongoing, as are internal discussions with Veeam. More news in due course.

Good news, but I think we have been at this point several times. Is there any timeline when the fix is maybe ready? Or are we still at the point "we think they found something, which is maybe fix the issue"

BR DM

Oct 08, 2024 9:08 am

It seems legit this time. As far as I understood, they found a bug in file system components causing VHDX to get "stuck" in buffered mode once it is opened for read in buffered mode (which is what backup applications typically do). So all processes including Hyper-V process were "forced" to use buffered mode due to this bug. But for some guest workloads, this mode reduces IO performance in a few times.

And you may recall some pages ago there were already talks about creating a mod to have Veeam operate in unbuffered more for testing purposes, so connecting these dots it seems they already had a good idea of what might be causing the performance degradation following backup.

Post by **johan.h** » Oct 08, 2024 1:03 pm this post

As soon as we're done validating the fix we'll post the outcome here, feeling good about it. As for a repro case, this was never simple - as evidenced by the length of this thread.

If you want access to a Server 2022 test fix, I believe we've been given an older one to review, and if you have a test environment available, feel free to DM me for more information. Actual backport of course depends on testing success.

phil_h · Post by **phil_h** » Oct 11, 2024 2:59 am this post

I do hope that the backport will include Windows server 2019 as I have many customers that are still on this version (and haven't got the budget yet to uplift)

briani · Oct 14, 2024 2:26 pm

Could someone please explain how to actually open a Hyper-V case with MS and have them actively work on it for an extended period like this? Usually, we can't ever get past the third-party support partners and reach actual Microsoft teams and employees.

cptkommin · Oct 14, 2024 9:40 pm

We are experiencing an issue with our WS2022 HCI S2D Cluster where our storage/CSV performance degrades rapidly once a node is taken offline, ie for Windows Patching. We see massive (Seconds latency) on our CSVs, causing the hosted VMs to start to fail, and causing data corruption. We also see random high write latencies during the day. In Event Viewer, the only event ID we find that claims anything to high latency is an EventID 9 under Hyper-V-StorageVSP Channel:

"An I/O request for device 'C:\ClusterStorage\3WM_CSV03\Virtual Machine Name\Virtual Hard Disks\Virtual Machine Name - C_Drive.vhdx' took 24040 milliseconds to complete. Operation code = SYNCHRONIZE CACHE, Data transfer length = 0, Status = SRB_STATUS_SUCCESS."

Our Setup:
5x Dell R740xd

Each node has the following:

1.5TB DDR4 3200 RAM

2 x 6.4TB MU Dell Enterprise NVMe (Samsung)

10x 8TB SAS 12Gps 7.2k Dell Enterprise spindle disks

2x Intel Gold 5220R CPUs

2x Intel 25G 2P E810-XXV NICs

All 5 nodes are set up in an S2D cluster. The NVMe serves as the cache and the spindles as the storage. The cluster is set with an in-memory cache value of 24GB per server. Storage repair speed is set to high, dropping this to the recommended medium speed does not make any difference. Cache mode is set to Read/Write for both SSD and HDD in the config. The cache page size is 16kb and the cache metadata reserve is 32GB. On a Hyper-V level, we have enabled NUMA Spanning. We have five 3-way mirror CSVs in the storage pool. Networking consists of a SET Switch, 5 virtual networks (1x management, 1x backups, 4xRMDA ). We have 2x Dell S5248F switches servicing the core/physical network. Adaptors are set up with Jumbo packets enabled, VMQ and VRSS, iWARP, and no SR-IOV.

Firmware/Drivers are mostly up to date, but this has not proven to be of any help. In fact, we are running v22.0.0 (Dell versioning) firmware/drivers for the NICs as it has proven to be stable, ie not causing the host OS to BSOD.

We were running Server 2019 when we first encountered this issue. After months of back and forth with MS Premier Support, the solution was to upgrade to Server 2022 due to the improvements in S2D/ReFS. We complied and started the upgrade process. Initially, two nodes were removed and reloaded with WS22, and everything was configured as stated above, with one exception: the CSVs were a 2-way mirror since only 2 nodes were present in the cluster. We started migrating VMs, added the 3rd node, and created the first 3-way mirror CSV; all is still well and dandy. We continued with this until we had a full 5 node '22 HCI S2D Cluster, and then give or take 3-4 months in, we started experiencing the exact same issue. I must add, not as high latencies as in WS19, but they are still high enough to cause a VM to crash and corrupt data. And if staying in maintenance long enough, it will bring down the cluster.

We have another MS Premier Support ticket open, and as you can imagine, they have no clue what the issue could be. We have uploaded probably close to 1TB worth of TSS logs/Cluster Event logs etc, and still no step closer to a cause or some sort of solution. Dell Support is of no help since none of the Dell Support TSR logs show anything. No failed hardware, no warnings, or errors, ie a failed drive, etc.

This effectively prohibits us from doing any "live" maintenance as anything could potentially cause high IO/latency, and when we want to schedule maintenance for patching, we need to shut down all clustered services, which is a nightmare to try and schedule with clients every month.

Then, to add fuel to the fire, it seems as if our average latency, outside of maintenance, is increasing over time, causing the overall performance of the cluster and VMs to slow down.

Yes, we have Veeam, no, the issue isn't related to backup times, yes we have CBT enabled on the jobs, yes we have dynamic VHDX files, and yes we have hyperthreading enabled.

What we can see, however, is there are massive spikes in Guest CPU % Total Run Time (Tracking via perfmon on the host) for the host CPU. We can see the logical cores light up like it's xmas or something.

I am definitely very interested in seeing what the solution for this is. From what I can see from recent posts, we might be getting an answer soon!? If anyone has any tips or tricks in the meantime I can use to improve overall performance/stability, please share. From what I can gather, workarounds include, Converting to fixed VHDX, Disabling CBT in the jobs (additionally the host/volume object), Disabling hyperthreading in BIOS, and if I understand correctly, live migrating the resources around seem to make a difference as well.

Thank You!

R&D Forums

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Who is online