Discussions specific to the Microsoft Hyper-V hypervisor
ChristineAlexa
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

For the "Failed to map guest I/O buffer for write access with status 0xC0000044"
I see those occasionally as well... Double check your permissions on the VHDX
(see url) https://redmondmag.com/articles/2017/08 ... blems.aspx

johan.h
Veeam Software
Posts: 559
Liked: 115 times
Joined: Jun 05, 2013 9:45 am
Full Name: Johan Huttenga
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by johan.h »

Christine, I would venture to say that permissions would only affect whether a file is accessible or not at all. Permissions would not be a factor in situations with variable I/O performance (as in the VHD is accessible, but reads and writes are performing poorly)

ChristineAlexa
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

Johan.h
You are right, I was thinking it could help in some odd inherited permissions issue.

pterpumpkin
Enthusiast
Posts: 31
Liked: 3 times
Joined: Jun 14, 2016 9:36 am
Full Name: Pter Pumpkin
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by pterpumpkin »

If anyone has any Microsoft cases open, could you please PM the ticket numbers to me? The MS tech we're working with has acknowledged this thread and has asked for as many case numbers as possible.

ChristineAlexa
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

No case number.

However some more data points

Some additional things that helped
- Disable Deduplication (helped some, the built in jobs seemed to not be helping, however even when the built in jobs disabled, a job on low background settings could have the issue during the deduplication (like 10% cores, 10% memory, StopIfBusy, Low Priority)
- NTFS instead of ReFS (helped some)
- Lower the number of columns from 8 (auto value) to 2 (Equiv of 20, 1TB NVMe / Node) - This helped quite a bit
- Limit the number of Virtual Disks from 4 to 2 (on CSV, one Assigned as file server) - This helped moderately

The above has made the issue minimal until there is a patch.
It still occurs occasionally during planned node Pause/Drain operations

JSKAAHS
Lurker
Posts: 1
Liked: never
Joined: Oct 17, 2018 11:32 am
Full Name: Jesper Skovdal
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by JSKAAHS »

Hi we have the same problem with disk latency solved by live migrating VM once in a while.
Ther is a Microsoft support number i will post it later.
The environment is hyper-v 2019 cluster on ucs blades and Infinidat storage.
I hope posting here we can collect as many cases as possible.

Regards
Jesper

Nick-SAC
Enthusiast
Posts: 59
Liked: 7 times
Joined: Oct 27, 2017 5:42 pm
Full Name: Nick
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Nick-SAC »

Wishr,

I did open a case with Microsoft Support on this last October but after 2 months of interacting with them (scores of Conversations, Reports, Logs & Remote Sessions, etc.) they – without any explanation of why – just stopped responding! The last I heard from them was in January when they said they would 'get back to me' ... but they never did and never even replied to my follow-up messages!

FWIW, I also had a case open with Dell Support (who was pretty accommodating) but that was essentially put 'on hold' when they concluded that it wasn’t a Hardware related issue.

Nick

Nick-SAC
Enthusiast
Posts: 59
Liked: 7 times
Joined: Oct 27, 2017 5:42 pm
Full Name: Nick
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Nick-SAC »

DG-MC,

You referenced “Version 10 VMs”. Was that a typo? AFAIK 9.1 is currently the highest Version (on Win10/Server 1903).

Thanks,
Nick

wishr
Veeam Software
Posts: 1818
Liked: 206 times
Joined: Aug 07, 2018 3:11 pm
Full Name: Fedor Maslov
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by wishr »

Hi Nick,

Sad to hear that.

I recently spoke to @pterpumpkin in private and provided him the number of the ticket we had opened with Microsoft some time ago. Also, as far as I got from his post above he has an ongoing conversation with Microsoft engineer and Microsoft needs as many support cases as possible to push a fix for this issue.

Thanks

pterpumpkin
Enthusiast
Posts: 31
Liked: 3 times
Joined: Jun 14, 2016 9:36 am
Full Name: Pter Pumpkin
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by pterpumpkin » 1 person likes this post

Thank you for all the PM's with case numbers! Please keep them coming. I have a total of 6 now (including my own).

bkowalczyk
Lurker
Posts: 1
Liked: 1 time
Joined: Sep 07, 2020 7:12 am
Full Name: Bartłomiej Kowalczyk
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by bkowalczyk » 1 person likes this post

Hello,

A very interesting topic.
We use a backup system other than veeam, but the problem is the same.
I have registered case to Microsoft Service Request <120052425000074 >.
Unfortunately, it remains unanswered.

Eluich
Lurker
Posts: 2
Liked: 1 time
Joined: Jul 06, 2020 2:31 pm
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Eluich » 1 person likes this post

Hi

For information, I had the person following our case at Microsoft and he told me that he had grouped together 7/8 cases with the same problem.
I think they're trying to reproduce the problem in the lab.

Best Regards

ChristineAlexa
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

[Possibly Solved, LONG] Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa » 4 people like this post

OK.. I have another update. I believe I see what the combination of issues are on a Dell 12g/13g server and it's a combination of
- Note this is an ALL PCIe NVMe Cluster so results may vary. I still need to test a non-cluster, but I expect similar results
- I will be doing additional testing later on a non-cluster machine and will add another post with those results


Causes (all conditions)
- Meltdown/Spectre microcode BIOS + OS patches (hinted at from some posts I found, but no real solution without leaving the server vulnerable)
- The New Hyper-V Core Scheduler vs Classic Scheduler (as a result of above)
- VM spans NUMA boundaries on processors/Logical cores (this affected ReFS Guest (integrity on or off) more than NTFS Guest)


The following seems to have gotten me the performance I expected (Guest I/O being no more than 5-10% slower than Host I/O, with latency delays between Host and Guest tracking each other, vs the Guest latency spiking and/or other Guests affect each other)

How to Diagnose the problem
To Track down the error, you can set all the VM's to use any StorageQOSPolicy (e.g. minimum IOPS 50, maximum 0, maxbandwidth 0) - this is more for monitoring as it DOES NOT solve the problem before the changes detailed below
- Run this command to track latency of the VM's and compare it with the Host HW I/O Latency, they should follow/track each other (one VM shouldn't just kill others performance latency, causing the Storage latency bugs in the original topic of the message), Note this command because of averaging will be a few seconds behind depending on number of I/Os issued (since it has to average out a little, hence it should "track/follow" Host latency + a few percent for overhead.

Get-StorageQoSFlow | Sort-Object InitiatorLatency | Select -Last 10

(The example below are real numbers, on a Host volume CSVFS_NTFS, Guest Volume tested is NTFS. Both are using default NTFS settings)

I can't get in to all of the details on how to read the results of the diskspd command here, but there are some good articles if you google it. The important highlights

So when running the below testing with diskspd, I could see huge differences between Host latency and guest latency. The Command I was using to put load on the Host or Guest to compare the latency uses diskspd from microsoft (free download), in a horrible worst case to simulate SQL whereas
- This will leave HW cache enabled, Software disabled (needed for SSD/NVMe to work correctly)
- It writes a 20GB file to iotest.dat so ensure that points to your test volume (e.g. here for F: drive)
- You can read the full breakdown of the command parameters using diskspd /?
- it will run for ~ 35 seconds (30 sec test time + warm up and cool down)
- If you run the command a second time, (output to a different file) the values should be similar between run1 and run2. If they aren't check for other I/O on the volume and/or run a 3rd time and average as appropriate)

diskspd -b8K -d30 -o4 -t8 -Su -r -w25 -L -Z1G -c20G F:\iotest.dat > testResults.txt

Comparing the CPU Used, compare the following
(CPU load, bad, notice the low CPU load)

CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 11.82%| 0.73%| 11.09%| 88.18%
1| 11.56%| 0.63%| 10.94%| 88.44%
2| 11.35%| 0.94%| 10.42%| 88.65%
3| 11.09%| 0.73%| 10.36%| 88.91%
4| 11.25%| 0.47%| 10.78%| 88.75%
5| 11.25%| 0.63%| 10.63%| 88.75%
6| 10.99%| 0.57%| 10.42%| 89.01%
7| 10.47%| 0.47%| 10.00%| 89.53%
8| 7.19%| 0.94%| 6.25%| 92.81%
9| 6.36%| 0.94%| 5.42%| 93.64%
10| 5.63%| 0.68%| 4.95%| 94.37%
11| 5.05%| 0.05%| 5.00%| 94.95%
12| 5.21%| 0.31%| 4.90%| 94.79%
13| 7.61%| 1.36%| 6.26%| 92.39%
14| 5.84%| 0.52%| 5.32%| 94.16%
15| 5.99%| 0.94%| 5.05%| 94.01%
16| 6.36%| 1.09%| 5.27%| 93.64%
17| 5.73%| 0.89%| 4.84%| 94.27%
18| 5.16%| 0.26%| 4.90%| 94.84%
19| 5.01%| 0.42%| 4.59%| 94.99%
20| 5.52%| 0.52%| 5.00%| 94.48%
21| 5.06%| 0.31%| 4.74%| 94.94%
22| 4.79%| 0.36%| 4.43%| 95.21%
23| 6.26%| 0.26%| 6.00%| 93.74%
-------------------------------------------
avg.| 7.61%| 0.63%| 6.98%| 92.39%

(CPU Load, good, notice it actually makes some of the CPU's busy now)
CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 95.42%| 3.07%| 92.35%| 4.58%
1| 96.46%| 2.92%| 93.55%| 3.54%
2| 96.56%| 2.55%| 94.01%| 3.44%
3| 95.52%| 3.12%| 92.40%| 4.48%
4| 95.26%| 2.86%| 92.40%| 4.74%
5| 95.00%| 3.49%| 91.51%| 5.00%
6| 95.16%| 2.71%| 92.45%| 4.84%
7| 95.58%| 2.86%| 92.71%| 4.42%
8| 35.64%| 0.47%| 35.17%| 64.36%
9| 35.17%| 0.31%| 34.86%| 64.83%
10| 32.43%| 0.42%| 32.01%| 67.57%
11| 31.86%| 0.47%| 31.39%| 68.14%
12| 30.14%| 0.26%| 29.88%| 69.86%
13| 31.39%| 0.21%| 31.18%| 68.61%
14| 27.11%| 0.16%| 26.95%| 72.89%
15| 27.33%| 0.26%| 27.07%| 72.67%
16| 25.51%| 0.47%| 25.04%| 74.49%
17| 27.54%| 0.21%| 27.33%| 72.46%
18| 28.32%| 0.21%| 28.11%| 71.68%
19| 25.86%| 0.10%| 25.75%| 74.14%
20| 28.58%| 0.05%| 28.53%| 71.42%
21| 26.69%| 0.21%| 26.48%| 73.31%
22| 25.98%| 0.16%| 25.82%| 74.02%
23| 26.65%| 0.16%| 26.50%| 73.35%
-------------------------------------------
avg.| 51.30%| 1.15%| 50.14%| 48.70%


Comparing the summary at the end to start, you should see the values in 95th or 99th and below "tracking" the latency from the Get-StorageQOSFlow command above.


(summary example, Bad VM Guest, before changes, see how the Write latency goes horribly bad, during this time, the StorageQOSFlow showed horrible latency, which affected other volumes, yet the Host IO Latency (Windows Admin Center for example) showed the latency being low during this same time, proving it was "overhead" induced in the Hyper-Visor somewhere)

total:
%-ile | Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
min | 0.033 | 0.466 | 0.033
25th | 0.137 | 83.024 | 0.196
50th | 0.331 | 91.370 | 0.498
75th | 0.606 | 100.709 | 4.642
90th | 0.962 | 113.450 | 94.638
95th | 1.504 | 170.363 | 103.408
99th | 4.117 | 226.077 | 178.741
3-nines | 78.729 | 1024.335 | 405.798
4-nines | 309.404 | 1901.500 | 1512.306
5-nines | 311.810 | 2077.842 | 2063.667
6-nines | 312.010 | 2077.842 | 2077.842
7-nines | 312.010 | 2077.842 | 2077.842
8-nines | 312.010 | 2077.842 | 2077.842
9-nines | 312.010 | 2077.842 | 2077.842
max | 312.010 | 2077.842 | 2077.842


(summary example, Good example, notice the latency)
total:
%-ile | Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
min | 0.124 | 0.375 | 0.124
25th | 1.418 | 1.601 | 1.462
50th | 2.038 | 2.232 | 2.088
75th | 2.944 | 3.139 | 2.996
90th | 4.298 | 4.549 | 4.361
95th | 5.492 | 5.911 | 5.598
99th | 9.346 | 9.946 | 9.506
3-nines | 23.907 | 26.661 | 24.948
4-nines | 69.912 | 96.569 | 80.284
5-nines | 99.576 | 100.424 | 99.909
6-nines | 101.410 | 106.454 | 106.129
7-nines | 106.453 | 106.454 | 106.454
8-nines | 106.453 | 106.454 | 106.454
9-nines | 106.453 | 106.454 | 106.454
max | 106.453 | 106.454 | 106.454



So now to the solution in my environment

- Disabling SMT/HyperThreading in the BIOS
- this forces fallback to the core scheduler
- ensure all your VM's run the following and set HWThreadCountPerCore to 0 on Windows Server 2019 host, 1 on Windows Server 2016 host, Whereas VMName is the VM (Mine was previously set to 0, which is follow SMT Setting), see https://docs.microsoft.com/en-us/window ... nistrator.
Set-VMProcessor -VMName <VMName> -HwThreadCountPerCore <0, 1, 2>
- Ensure no VM spans NUMA (per VM Logical Processor <= Physical core count of smallest physical processor)


*NOW*, CBT only occasionally causes some storage latency FOR JUST THE Guest Volume Involved now, and only if the Guest Volume is REFS. If the Guest partition NTFS, this was not observed, DURING the backup, after large data changes (added 1.2 TB of partitions to CBT). Subsequent CBT backups did not cause latency issues.


This fix also worked for host CSVFS_ReFS (with/without integrity), Guest NTFS. Guest REFS (no integrity tested) as well (needs more testing on my end)

Obviously CSVFS_ReFS is slower on my insane test (25-40% slower) but no I/O Latency spiking issues, just "not as fast" in absolute numbers

I still have more testing to do but Hoping the above helps M$ track this down, and others solve/work around the issue for their environments
Thanks,
Christine

nmdange
Expert
Posts: 497
Liked: 127 times
Joined: Aug 20, 2015 9:30 pm
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by nmdange »

If NUMA spanning is actually part of the issue, you can disable/block NUMA spanning at the host level by doing "Set-VMHost -NumaSpanningEnabled $false" in PowerShell. I've always done this on my Hyper-V servers to improve performance. It would be interesting to see what it looks like with hyperthreading enabled but NUMA spanning disabled.

ChristineAlexa
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

Also, the more I look at this, I think we are all chasing two partially overlapping problems...
#1 the guest VM performance issue, which is what I've documented seems to be resolved with the Guest VM, for many hours, The steps to return to the Classic Scheduler and SMT/HyperThreading disabled in the BIOS solves the performance now, until #2 happens.
#2 the I/O Scheduler just seeming to get confused. When #1 is resolved, the improved performance seems to reduce the frequency of #2 since the Guest I/Os are writing more quickly, not because the cause of #2 has been resolved

- Numa spanning only seemed to upset CSVFS_ReFS and/or ReFS *directly*, and may be a red herring

- When on CSVFS_ReFS, the storage subsystem could bring itself to a near halt, when migrating a 400GB or larger sized ReFS Guest with CBT (whether guest was running or not), even after ensuring I had done a backup of the offline VM (so There shouldn't be any "changes" after that, right?!?!)
- When things slow down enough to get the originally post's error ("Source: Microsoft-Windows-Hyper-V-StorageVSP"), I've seen it show for the target, VM's irrespective of if the path is for DAS or CSV storage, irrespective of Cluster., MANY times, the path is the path of a file that doesn't exist on the volume anymore. And irrespective of if I've issued Flush-Volume commands against the Volume. I can also see (at least on a Cluster), that sometimes the QOSFlow is duplicated after moving a volume between storage... And it doesn't drop the QOSFlow for the "old" location/path until you have stopped and restarted the VM (even hours after the VM was done being moved)

Some hypothesis,
I've also noticed, once you get the i/O "quiet" on the Hyper-V Host/Cluster, *AND* shutdown most/all VMs. The storage subsystem will catch back up... and it seems like it clears out what ever was hanging it. And can stay that way even under load afterwards..
It is almost like I/O scheduler and/or CBT gets "confused" and thinks an "old" I/O hasn't completed, and that starts hanging subsequent I/O dependent on that Read/Write (even though it occurred)....

So for now, Host CSV is CSVFS_NTFS, and all but one partition (due to timing, will convert tonight) are NTFS. And there were no IO issues during all of that, moving everything ~ 1.2GB/s or more, with no delay. It was moved from CSVFS_ReFS, and the entire time, the latency on the CSVFS_NTFS destination was lower than the latency on the CSVFS_ReFS Source.

So after tonight, in any scenario where there is a Guest VM, I am avoiding ReFS on the host or guest. I will leave CBT on for a few days, and if there are any issues, completely disable that to see if that finishes eliminating the issues.
I will continue to use ReFS for my backup storage, as that seems to work without any issues, as long as Guest VMs are not on the partition.

I'll give some more reports back on this, but to summarize
Problem 1 - Guest VM has poor I/O Performance on WIndows Server 2019 hosted VM
- Turn off SMT/Hyperthreading and let the classic scheduler work the way things used to, which means the spectre/meltdown patches aren't part of the problem
- Don't use RefS/CSVFS_ReFS for Host or Guest (overlaps with helping with Problem 2)

Problem 2 - The originally reported ("Source: Microsoft-Windows-Hyper-V-StorageVSP") error and I/Os hanging as a result
- Perform steps for Problem 1 to improve performance, which reduces the occurrence
- TBD, after I run with this configuration for a few days, disable CBT completely in the backups

I think it's the combination of these overlaps that making it so difficult to track down/reproduce properly to be 100% reliable for Problem 2

Again, hope all of this helps us diagnose as a community what the underlying bugs causing this are, and worst case, these are additional data points that may solve the problems for you
Thanks,
Christine

ChristineAlexa
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

Ok.. I've done more testing, now on NON-Clustered servers.

All Servers were Dell 11g and 12g servers.
All are DAS against Both SAS SFF HDD (10K and 15K RPM) and SATA SSD (prosumer and consumer grade) on HW Raid (PERC H700p, PERC H710p (internal controllers), and PERC H800 (external) )
So that eliminates Storage Spaces and S2D and Networking (40Gbe) as a factor in the performance

This Solution (disabling SMT/HyperThreading) has dramatically improved disk I/O for VM's in all scenarios I've tested.

Additionally (second test), converting the Hosts' volume back to NTFS, had a minimal effect on performance, but enough to be worth converting them back.
I also did some NUMA testing, and had little/no change in performance (within margin of error)

Can someone else (Original Poster?) Try turning off the SMT/HyperThreading on their Host to see if this improves their performance as well?
Thanks,
Christine

gmsugaree
Lurker
Posts: 1
Liked: never
Joined: Apr 03, 2020 5:13 pm
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by gmsugaree »

pterpumpkin wrote: Sep 02, 2020 8:20 pm Thank you for all the PM's with case numbers! Please keep them coming. I have a total of 6 now (including my own).
Peter can you please PM me and I'll reply with my similar case number. The forum is not allowing me to send PMs yet because I have not been in discussions. Thanks!

Gostev
SVP, Product Management
Posts: 26699
Liked: 4274 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Gostev »

Well, now you can ;)

ChristineAlexa
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

nmdange wrote: Sep 09, 2020 2:33 am If NUMA spanning is actually part of the issue, you can disable/block NUMA spanning at the host level by doing "Set-VMHost -NumaSpanningEnabled $false" in PowerShell. I've always done this on my Hyper-V servers to improve performance. It would be interesting to see what it looks like with hyperthreading enabled but NUMA spanning disabled.
Just to directly follow up, NUMA was not part of the performance issue.

@nickthebennett
Lurker
Posts: 1
Liked: never
Joined: Sep 16, 2020 12:36 pm
Full Name: Nick Bennett
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by @nickthebennett »

Hi,

I'm seeing the event ID 9 on a single VM in a customer environment. Very intermittent in terms of occurence, sometimes we run fine for days without an issue and then we get it twice in 2 days. Also the period of time the issue occurs for varies greatly.

From reviewing this thread I get the impression disabling CBT in Veeam has no effect and that the issue lies within the Microsoft RCT Driver on the Hyper-V hosts, something that we can't actually disable and the workaround is to migrate the VM to another host interrupting the RCT process which is causing the issue. I'll try this next time is occurs.

Anyone with MS tickets logged getting any sensible feedback.

Thanks

ChristineAlexa
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

@nickthebennett
Look at my previous few replies for some details that may help (essentially, turning off SMT/Hyperthreading/Logical processors in the BIOS, and using NTFS instead of ReFS for both the Host Volume and the VM Client Volume

Also above is how I have been testing it to show the large difference in performance (before and after). In all the servers I've tested (specs above), this solved the issue WITHOUT disabling CBT.

Let use know your machine specs and config, and if your environment has positive results when trying my suggestions,
Thanks,
Christine

Post Reply

Who is online

Users browsing this forum: No registered users and 12 guests