Host-based backup of Microsoft Hyper-V VMs.
Post Reply
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

For the "Failed to map guest I/O buffer for write access with status 0xC0000044"
I see those occasionally as well... Double check your permissions on the VHDX
(see url) https://redmondmag.com/articles/2017/08 ... blems.aspx
johan.h
Veeam Software
Posts: 723
Liked: 185 times
Joined: Jun 05, 2013 9:45 am
Full Name: Johan Huttenga
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by johan.h »

Christine, I would venture to say that permissions would only affect whether a file is accessible or not at all. Permissions would not be a factor in situations with variable I/O performance (as in the VHD is accessible, but reads and writes are performing poorly)
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

Johan.h
You are right, I was thinking it could help in some odd inherited permissions issue.
pterpumpkin
Enthusiast
Posts: 36
Liked: 4 times
Joined: Jun 14, 2016 9:36 am
Full Name: Pter Pumpkin
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by pterpumpkin »

If anyone has any Microsoft cases open, could you please PM the ticket numbers to me? The MS tech we're working with has acknowledged this thread and has asked for as many case numbers as possible.
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

No case number.

However some more data points

Some additional things that helped
- Disable Deduplication (helped some, the built in jobs seemed to not be helping, however even when the built in jobs disabled, a job on low background settings could have the issue during the deduplication (like 10% cores, 10% memory, StopIfBusy, Low Priority)
- NTFS instead of ReFS (helped some)
- Lower the number of columns from 8 (auto value) to 2 (Equiv of 20, 1TB NVMe / Node) - This helped quite a bit
- Limit the number of Virtual Disks from 4 to 2 (on CSV, one Assigned as file server) - This helped moderately

The above has made the issue minimal until there is a patch.
It still occurs occasionally during planned node Pause/Drain operations
JSKAAHS
Lurker
Posts: 1
Liked: never
Joined: Oct 17, 2018 11:32 am
Full Name: Jesper Skovdal
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by JSKAAHS »

Hi we have the same problem with disk latency solved by live migrating VM once in a while.
Ther is a Microsoft support number i will post it later.
The environment is hyper-v 2019 cluster on ucs blades and Infinidat storage.
I hope posting here we can collect as many cases as possible.

Regards
Jesper
Nick-SAC
Enthusiast
Posts: 76
Liked: 16 times
Joined: Oct 27, 2017 5:42 pm
Full Name: Nick
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Nick-SAC »

Wishr,

I did open a case with Microsoft Support on this last October but after 2 months of interacting with them (scores of Conversations, Reports, Logs & Remote Sessions, etc.) they – without any explanation of why – just stopped responding! The last I heard from them was in January when they said they would 'get back to me' ... but they never did and never even replied to my follow-up messages!

FWIW, I also had a case open with Dell Support (who was pretty accommodating) but that was essentially put 'on hold' when they concluded that it wasn’t a Hardware related issue.

Nick
Nick-SAC
Enthusiast
Posts: 76
Liked: 16 times
Joined: Oct 27, 2017 5:42 pm
Full Name: Nick
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Nick-SAC »

DG-MC,

You referenced “Version 10 VMs”. Was that a typo? AFAIK 9.1 is currently the highest Version (on Win10/Server 1903).

Thanks,
Nick
wishr
Veteran
Posts: 3077
Liked: 455 times
Joined: Aug 07, 2018 3:11 pm
Full Name: Fedor Maslov
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by wishr »

Hi Nick,

Sad to hear that.

I recently spoke to @pterpumpkin in private and provided him the number of the ticket we had opened with Microsoft some time ago. Also, as far as I got from his post above he has an ongoing conversation with Microsoft engineer and Microsoft needs as many support cases as possible to push a fix for this issue.

Thanks
pterpumpkin
Enthusiast
Posts: 36
Liked: 4 times
Joined: Jun 14, 2016 9:36 am
Full Name: Pter Pumpkin
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by pterpumpkin » 1 person likes this post

Thank you for all the PM's with case numbers! Please keep them coming. I have a total of 6 now (including my own).
bkowalczyk
Lurker
Posts: 1
Liked: 1 time
Joined: Sep 07, 2020 7:12 am
Full Name: Bartłomiej Kowalczyk
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by bkowalczyk » 1 person likes this post

Hello,

A very interesting topic.
We use a backup system other than veeam, but the problem is the same.
I have registered case to Microsoft Service Request <120052425000074 >.
Unfortunately, it remains unanswered.
Eluich
Influencer
Posts: 19
Liked: 18 times
Joined: Jul 06, 2020 2:31 pm
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Eluich » 1 person likes this post

Hi

For information, I had the person following our case at Microsoft and he told me that he had grouped together 7/8 cases with the same problem.
I think they're trying to reproduce the problem in the lab.

Best Regards
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

[Possibly Solved, LONG] Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa » 5 people like this post

OK.. I have another update. I believe I see what the combination of issues are on a Dell 12g/13g server and it's a combination of
- Note this is an ALL PCIe NVMe Cluster so results may vary. I still need to test a non-cluster, but I expect similar results
- I will be doing additional testing later on a non-cluster machine and will add another post with those results


Causes (all conditions)
- Meltdown/Spectre microcode BIOS + OS patches (hinted at from some posts I found, but no real solution without leaving the server vulnerable)
- The New Hyper-V Core Scheduler vs Classic Scheduler (as a result of above)
- VM spans NUMA boundaries on processors/Logical cores (this affected ReFS Guest (integrity on or off) more than NTFS Guest)


The following seems to have gotten me the performance I expected (Guest I/O being no more than 5-10% slower than Host I/O, with latency delays between Host and Guest tracking each other, vs the Guest latency spiking and/or other Guests affect each other)

How to Diagnose the problem
To Track down the error, you can set all the VM's to use any StorageQOSPolicy (e.g. minimum IOPS 50, maximum 0, maxbandwidth 0) - this is more for monitoring as it DOES NOT solve the problem before the changes detailed below
- Run this command to track latency of the VM's and compare it with the Host HW I/O Latency, they should follow/track each other (one VM shouldn't just kill others performance latency, causing the Storage latency bugs in the original topic of the message), Note this command because of averaging will be a few seconds behind depending on number of I/Os issued (since it has to average out a little, hence it should "track/follow" Host latency + a few percent for overhead.

Get-StorageQoSFlow | Sort-Object InitiatorLatency | Select -Last 10

(The example below are real numbers, on a Host volume CSVFS_NTFS, Guest Volume tested is NTFS. Both are using default NTFS settings)

I can't get in to all of the details on how to read the results of the diskspd command here, but there are some good articles if you google it. The important highlights

So when running the below testing with diskspd, I could see huge differences between Host latency and guest latency. The Command I was using to put load on the Host or Guest to compare the latency uses diskspd from microsoft (free download), in a horrible worst case to simulate SQL whereas
- This will leave HW cache enabled, Software disabled (needed for SSD/NVMe to work correctly)
- It writes a 20GB file to iotest.dat so ensure that points to your test volume (e.g. here for F: drive)
- You can read the full breakdown of the command parameters using diskspd /?
- it will run for ~ 35 seconds (30 sec test time + warm up and cool down)
- If you run the command a second time, (output to a different file) the values should be similar between run1 and run2. If they aren't check for other I/O on the volume and/or run a 3rd time and average as appropriate)

diskspd -b8K -d30 -o4 -t8 -Su -r -w25 -L -Z1G -c20G F:\iotest.dat > testResults.txt

Comparing the CPU Used, compare the following
(CPU load, bad, notice the low CPU load)

CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 11.82%| 0.73%| 11.09%| 88.18%
1| 11.56%| 0.63%| 10.94%| 88.44%
2| 11.35%| 0.94%| 10.42%| 88.65%
3| 11.09%| 0.73%| 10.36%| 88.91%
4| 11.25%| 0.47%| 10.78%| 88.75%
5| 11.25%| 0.63%| 10.63%| 88.75%
6| 10.99%| 0.57%| 10.42%| 89.01%
7| 10.47%| 0.47%| 10.00%| 89.53%
8| 7.19%| 0.94%| 6.25%| 92.81%
9| 6.36%| 0.94%| 5.42%| 93.64%
10| 5.63%| 0.68%| 4.95%| 94.37%
11| 5.05%| 0.05%| 5.00%| 94.95%
12| 5.21%| 0.31%| 4.90%| 94.79%
13| 7.61%| 1.36%| 6.26%| 92.39%
14| 5.84%| 0.52%| 5.32%| 94.16%
15| 5.99%| 0.94%| 5.05%| 94.01%
16| 6.36%| 1.09%| 5.27%| 93.64%
17| 5.73%| 0.89%| 4.84%| 94.27%
18| 5.16%| 0.26%| 4.90%| 94.84%
19| 5.01%| 0.42%| 4.59%| 94.99%
20| 5.52%| 0.52%| 5.00%| 94.48%
21| 5.06%| 0.31%| 4.74%| 94.94%
22| 4.79%| 0.36%| 4.43%| 95.21%
23| 6.26%| 0.26%| 6.00%| 93.74%
-------------------------------------------
avg.| 7.61%| 0.63%| 6.98%| 92.39%

(CPU Load, good, notice it actually makes some of the CPU's busy now)
CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 95.42%| 3.07%| 92.35%| 4.58%
1| 96.46%| 2.92%| 93.55%| 3.54%
2| 96.56%| 2.55%| 94.01%| 3.44%
3| 95.52%| 3.12%| 92.40%| 4.48%
4| 95.26%| 2.86%| 92.40%| 4.74%
5| 95.00%| 3.49%| 91.51%| 5.00%
6| 95.16%| 2.71%| 92.45%| 4.84%
7| 95.58%| 2.86%| 92.71%| 4.42%
8| 35.64%| 0.47%| 35.17%| 64.36%
9| 35.17%| 0.31%| 34.86%| 64.83%
10| 32.43%| 0.42%| 32.01%| 67.57%
11| 31.86%| 0.47%| 31.39%| 68.14%
12| 30.14%| 0.26%| 29.88%| 69.86%
13| 31.39%| 0.21%| 31.18%| 68.61%
14| 27.11%| 0.16%| 26.95%| 72.89%
15| 27.33%| 0.26%| 27.07%| 72.67%
16| 25.51%| 0.47%| 25.04%| 74.49%
17| 27.54%| 0.21%| 27.33%| 72.46%
18| 28.32%| 0.21%| 28.11%| 71.68%
19| 25.86%| 0.10%| 25.75%| 74.14%
20| 28.58%| 0.05%| 28.53%| 71.42%
21| 26.69%| 0.21%| 26.48%| 73.31%
22| 25.98%| 0.16%| 25.82%| 74.02%
23| 26.65%| 0.16%| 26.50%| 73.35%
-------------------------------------------
avg.| 51.30%| 1.15%| 50.14%| 48.70%


Comparing the summary at the end to start, you should see the values in 95th or 99th and below "tracking" the latency from the Get-StorageQOSFlow command above.


(summary example, Bad VM Guest, before changes, see how the Write latency goes horribly bad, during this time, the StorageQOSFlow showed horrible latency, which affected other volumes, yet the Host IO Latency (Windows Admin Center for example) showed the latency being low during this same time, proving it was "overhead" induced in the Hyper-Visor somewhere)

total:
%-ile | Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
min | 0.033 | 0.466 | 0.033
25th | 0.137 | 83.024 | 0.196
50th | 0.331 | 91.370 | 0.498
75th | 0.606 | 100.709 | 4.642
90th | 0.962 | 113.450 | 94.638
95th | 1.504 | 170.363 | 103.408
99th | 4.117 | 226.077 | 178.741
3-nines | 78.729 | 1024.335 | 405.798
4-nines | 309.404 | 1901.500 | 1512.306
5-nines | 311.810 | 2077.842 | 2063.667
6-nines | 312.010 | 2077.842 | 2077.842
7-nines | 312.010 | 2077.842 | 2077.842
8-nines | 312.010 | 2077.842 | 2077.842
9-nines | 312.010 | 2077.842 | 2077.842
max | 312.010 | 2077.842 | 2077.842


(summary example, Good example, notice the latency)
total:
%-ile | Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
min | 0.124 | 0.375 | 0.124
25th | 1.418 | 1.601 | 1.462
50th | 2.038 | 2.232 | 2.088
75th | 2.944 | 3.139 | 2.996
90th | 4.298 | 4.549 | 4.361
95th | 5.492 | 5.911 | 5.598
99th | 9.346 | 9.946 | 9.506
3-nines | 23.907 | 26.661 | 24.948
4-nines | 69.912 | 96.569 | 80.284
5-nines | 99.576 | 100.424 | 99.909
6-nines | 101.410 | 106.454 | 106.129
7-nines | 106.453 | 106.454 | 106.454
8-nines | 106.453 | 106.454 | 106.454
9-nines | 106.453 | 106.454 | 106.454
max | 106.453 | 106.454 | 106.454



So now to the solution in my environment

- Disabling SMT/HyperThreading in the BIOS
- this forces fallback to the core scheduler
- ensure all your VM's run the following and set HWThreadCountPerCore to 0 on Windows Server 2019 host, 1 on Windows Server 2016 host, Whereas VMName is the VM (Mine was previously set to 0, which is follow SMT Setting), see https://docs.microsoft.com/en-us/window ... nistrator.
Set-VMProcessor -VMName <VMName> -HwThreadCountPerCore <0, 1, 2>
- Ensure no VM spans NUMA (per VM Logical Processor <= Physical core count of smallest physical processor)


*NOW*, CBT only occasionally causes some storage latency FOR JUST THE Guest Volume Involved now, and only if the Guest Volume is REFS. If the Guest partition NTFS, this was not observed, DURING the backup, after large data changes (added 1.2 TB of partitions to CBT). Subsequent CBT backups did not cause latency issues.


This fix also worked for host CSVFS_ReFS (with/without integrity), Guest NTFS. Guest REFS (no integrity tested) as well (needs more testing on my end)

Obviously CSVFS_ReFS is slower on my insane test (25-40% slower) but no I/O Latency spiking issues, just "not as fast" in absolute numbers

I still have more testing to do but Hoping the above helps M$ track this down, and others solve/work around the issue for their environments
Thanks,
Christine
nmdange
Veteran
Posts: 528
Liked: 144 times
Joined: Aug 20, 2015 9:30 pm
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by nmdange »

If NUMA spanning is actually part of the issue, you can disable/block NUMA spanning at the host level by doing "Set-VMHost -NumaSpanningEnabled $false" in PowerShell. I've always done this on my Hyper-V servers to improve performance. It would be interesting to see what it looks like with hyperthreading enabled but NUMA spanning disabled.
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

Also, the more I look at this, I think we are all chasing two partially overlapping problems...
#1 the guest VM performance issue, which is what I've documented seems to be resolved with the Guest VM, for many hours, The steps to return to the Classic Scheduler and SMT/HyperThreading disabled in the BIOS solves the performance now, until #2 happens.
#2 the I/O Scheduler just seeming to get confused. When #1 is resolved, the improved performance seems to reduce the frequency of #2 since the Guest I/Os are writing more quickly, not because the cause of #2 has been resolved

- Numa spanning only seemed to upset CSVFS_ReFS and/or ReFS *directly*, and may be a red herring

- When on CSVFS_ReFS, the storage subsystem could bring itself to a near halt, when migrating a 400GB or larger sized ReFS Guest with CBT (whether guest was running or not), even after ensuring I had done a backup of the offline VM (so There shouldn't be any "changes" after that, right?!?!)
- When things slow down enough to get the originally post's error ("Source: Microsoft-Windows-Hyper-V-StorageVSP"), I've seen it show for the target, VM's irrespective of if the path is for DAS or CSV storage, irrespective of Cluster., MANY times, the path is the path of a file that doesn't exist on the volume anymore. And irrespective of if I've issued Flush-Volume commands against the Volume. I can also see (at least on a Cluster), that sometimes the QOSFlow is duplicated after moving a volume between storage... And it doesn't drop the QOSFlow for the "old" location/path until you have stopped and restarted the VM (even hours after the VM was done being moved)

Some hypothesis,
I've also noticed, once you get the i/O "quiet" on the Hyper-V Host/Cluster, *AND* shutdown most/all VMs. The storage subsystem will catch back up... and it seems like it clears out what ever was hanging it. And can stay that way even under load afterwards..
It is almost like I/O scheduler and/or CBT gets "confused" and thinks an "old" I/O hasn't completed, and that starts hanging subsequent I/O dependent on that Read/Write (even though it occurred)....

So for now, Host CSV is CSVFS_NTFS, and all but one partition (due to timing, will convert tonight) are NTFS. And there were no IO issues during all of that, moving everything ~ 1.2GB/s or more, with no delay. It was moved from CSVFS_ReFS, and the entire time, the latency on the CSVFS_NTFS destination was lower than the latency on the CSVFS_ReFS Source.

So after tonight, in any scenario where there is a Guest VM, I am avoiding ReFS on the host or guest. I will leave CBT on for a few days, and if there are any issues, completely disable that to see if that finishes eliminating the issues.
I will continue to use ReFS for my backup storage, as that seems to work without any issues, as long as Guest VMs are not on the partition.

I'll give some more reports back on this, but to summarize
Problem 1 - Guest VM has poor I/O Performance on WIndows Server 2019 hosted VM
- Turn off SMT/Hyperthreading and let the classic scheduler work the way things used to, which means the spectre/meltdown patches aren't part of the problem
- Don't use RefS/CSVFS_ReFS for Host or Guest (overlaps with helping with Problem 2)

Problem 2 - The originally reported ("Source: Microsoft-Windows-Hyper-V-StorageVSP") error and I/Os hanging as a result
- Perform steps for Problem 1 to improve performance, which reduces the occurrence
- TBD, after I run with this configuration for a few days, disable CBT completely in the backups

I think it's the combination of these overlaps that making it so difficult to track down/reproduce properly to be 100% reliable for Problem 2

Again, hope all of this helps us diagnose as a community what the underlying bugs causing this are, and worst case, these are additional data points that may solve the problems for you
Thanks,
Christine
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

Ok.. I've done more testing, now on NON-Clustered servers.

All Servers were Dell 11g and 12g servers.
All are DAS against Both SAS SFF HDD (10K and 15K RPM) and SATA SSD (prosumer and consumer grade) on HW Raid (PERC H700p, PERC H710p (internal controllers), and PERC H800 (external) )
So that eliminates Storage Spaces and S2D and Networking (40Gbe) as a factor in the performance

This Solution (disabling SMT/HyperThreading) has dramatically improved disk I/O for VM's in all scenarios I've tested.

Additionally (second test), converting the Hosts' volume back to NTFS, had a minimal effect on performance, but enough to be worth converting them back.
I also did some NUMA testing, and had little/no change in performance (within margin of error)

Can someone else (Original Poster?) Try turning off the SMT/HyperThreading on their Host to see if this improves their performance as well?
Thanks,
Christine
gmsugaree
Lurker
Posts: 1
Liked: never
Joined: Apr 03, 2020 5:13 pm
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by gmsugaree »

pterpumpkin wrote: Sep 02, 2020 8:20 pm Thank you for all the PM's with case numbers! Please keep them coming. I have a total of 6 now (including my own).
Peter can you please PM me and I'll reply with my similar case number. The forum is not allowing me to send PMs yet because I have not been in discussions. Thanks!
Gostev
Chief Product Officer
Posts: 31816
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Gostev »

Well, now you can ;)
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

nmdange wrote: Sep 09, 2020 2:33 am If NUMA spanning is actually part of the issue, you can disable/block NUMA spanning at the host level by doing "Set-VMHost -NumaSpanningEnabled $false" in PowerShell. I've always done this on my Hyper-V servers to improve performance. It would be interesting to see what it looks like with hyperthreading enabled but NUMA spanning disabled.
Just to directly follow up, NUMA was not part of the performance issue.
@nickthebennett
Lurker
Posts: 1
Liked: never
Joined: Sep 16, 2020 12:36 pm
Full Name: Nick Bennett
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by @nickthebennett »

Hi,

I'm seeing the event ID 9 on a single VM in a customer environment. Very intermittent in terms of occurence, sometimes we run fine for days without an issue and then we get it twice in 2 days. Also the period of time the issue occurs for varies greatly.

From reviewing this thread I get the impression disabling CBT in Veeam has no effect and that the issue lies within the Microsoft RCT Driver on the Hyper-V hosts, something that we can't actually disable and the workaround is to migrate the VM to another host interrupting the RCT process which is causing the issue. I'll try this next time is occurs.

Anyone with MS tickets logged getting any sensible feedback.

Thanks
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

@nickthebennett
Look at my previous few replies for some details that may help (essentially, turning off SMT/Hyperthreading/Logical processors in the BIOS, and using NTFS instead of ReFS for both the Host Volume and the VM Client Volume

Also above is how I have been testing it to show the large difference in performance (before and after). In all the servers I've tested (specs above), this solved the issue WITHOUT disabling CBT.

Let use know your machine specs and config, and if your environment has positive results when trying my suggestions,
Thanks,
Christine
giomoda
Novice
Posts: 3
Liked: 1 time
Joined: Jun 20, 2020 6:41 pm
Full Name: Giovani Moda
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by giomoda » 1 person likes this post

Hello.

First of all, Christine, absolutely amazing work. I really hope this dedicated work of yours helps MS track down this issue.

You know, I always thought that guest VMs performance on Server 2019, specially when running on 14th generation Dell Servers, was just somewhat "off" when compared to the same setup on Server 2016. I think you know what I'm saying: laggy screens, a few extra seconds to open an application or to list the contents of a folder, etc, etc. But I could never really put my finger on it and, as long as everything was running, I just brushed it off and kept it going. But now it has become an issue.

Well, to the point: a few weeks ago I got a call from a MS engineer who gave me two registry keys to supposedly address the CBT issue:

Code: Select all

HKLM\System\CurrentControlSet\Services\vhdmp\Parameters\DisableResiliency = 1 (REG_DWORD)
HKLM\software\Microsoft\Windows nt\CurrentVersion\Virtualization\Worker\DisableResiliency = 1 (REG_DWORD)
I'm testing this on a very reduced lab environment for a week now, as I don't have access to high end servers on my lab, and haven't noticed anything wicked going on. But since the issue, at least for me, is very hard to reproduce I can not say that it has indeed fixed anything. Backups are running, the guest VMs seems to be responding normally and no alerts have been generated so far. I could not find any documentation about these keys, though, so this is something that is really bugging me.

Anyway, I'm sharing this so maybe some of you who can more easily reproduce the issue can maybe test it on a larger scale. Who knows, right?

Regards,
Giovani
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

Giovani,
If you could, could you run the tests I gave above (and below where F: is the volume), both outside a VM on the host's physical volume, and from within a VM that has its vhdx on the same physical volume.

Run the tests both before the changes you just posted (and see if they are drastically different, see above), and then run them afterwards (and compare to the before) for differences. It's much quicker than waiting for the intermittent errors, since they only start showing @ 10 sec + delays to I/O

diskspd -b8K -d30 -o4 -t8 -Su -r -w25 -L -Z1G -c20G F:\iotest.dat > testResults.txt

I'll have to test out your changes when I get some free time next week, but so far, since I turned off the HyperThreading, and used NTFS on every host volume that has guest vhdx's on them, as well as for the volume inside the vhdx, I haven't had any issues :)

HTH,
Christine
pterpumpkin
Enthusiast
Posts: 36
Liked: 4 times
Joined: Jun 14, 2016 9:36 am
Full Name: Pter Pumpkin
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by pterpumpkin » 1 person likes this post

Great news!!!! Just got off a call from Microsoft. They have confirmed that this has now been submitted as a bug. The product group is currently investigating the issue and working on a fix. They will review severity of the issue and possibly release it as a fix in a round of Windows Updates. If they determine that the issue is not that severe or wide spread and there is a sufficient workaround, they may not release a fix at all :(

They are also 90% sure that the issue only occurs on Server 2016 VM's that have been migrated from Hyper-V 2016 to Hyper-V 2019. They're confident that if you build a fresh VM with Server 2016 or Server 2019 on a Hyper-V 2019 host/cluster, the issue will not reoccur.

They believe that deleting all the "RCT reference points" which is just the RCT files associated with the VM may also resolve the issue. They were not 100% confident on this though, but possibly worth a try.
Nick-SAC
Enthusiast
Posts: 76
Liked: 16 times
Joined: Oct 27, 2017 5:42 pm
Full Name: Nick
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Nick-SAC » 1 person likes this post

Microsoft. ... They are also 90% sure that the issue only occurs on Server 2016 VM's that have been migrated from Hyper-V 2016 to Hyper-V 2019. They're confident that if you build a fresh VM with Server 2016 or Server 2019 on a Hyper-V 2019 host/cluster, the issue will not reoccur.

Our case was/is on a fresh Server 2019 Hyper-V Host with a fresh Server 2019 VM and a fresh Server 2016 VM – and I personally made that absolutely clear to MS Support... each & every time the case was escalated to the next tier support.

They believe that deleting all the "RCT reference points" which is just the RCT files associated with the VM may also resolve the issue. They were not 100% confident on this though, but possibly worth a try.

Tried that... No help...
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

Nick-SAC, have you tried my solution yet?
Nick-SAC
Enthusiast
Posts: 76
Liked: 16 times
Joined: Oct 27, 2017 5:42 pm
Full Name: Nick
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Nick-SAC »

No, I'm sorry to say that I haven't had a chance to try it yet Christine. I'm booked solid with other jobs right now...
CasperGN
Novice
Posts: 6
Liked: never
Joined: Sep 29, 2020 11:59 am
Full Name: Casper Glasius-Nyborg
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by CasperGN »

We are experiencing the same as you all do. The performance improved greatly when we disabled HT on a few host. But we learned that a live migration was still needed to get the performance back. That for me sounds like the RCT needs to be reset.
As many of you have, we also have a MS ticket on this and are not really getting anywhere. @pterpumpkin could you help out with your own MS ticket number, which I could reference to. Anybody else who have a MS ticket on this issue please PM me as well.

Best regards
Casper
ChristineAlexa
Enthusiast
Posts: 47
Liked: 10 times
Joined: Aug 26, 2019 7:04 am
Full Name: Christine Boersen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChristineAlexa »

Casper,
Ok, this is confirming my observations as well, that there are two issues.
- The HT/Meltdown/Spectre performance problem
- The RCT getting messed up

Have you tried the registry keys from above (copied here) that Giovani posted?

Code: Select all

HKLM\System\CurrentControlSet\Services\vhdmp\Parameters\DisableResiliency = 1 (REG_DWORD)
HKLM\software\Microsoft\Windows nt\CurrentVersion\Virtualization\Worker\DisableResiliency = 1 (REG_DWORD)
Hope that helps (I have *NOT* had a chance to try the registry keys yet, in middle of a deadline right now (on 22 hrs this work day so far)
Christine
WesleyUC
Influencer
Posts: 12
Liked: 3 times
Joined: Aug 27, 2019 8:55 am
Full Name: LeslieUC
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by WesleyUC » 2 people like this post

Disabling CBT and HT/Meltdown/Spectre is not the solutoion. The Problem is Microsoft RCT.
When doing an on-host backup with veeam a RCT file is created. When the RCT file is on the disk and you are testing with diskspd ons this disk, after this point the I/O Performance problem is there and remains.
But if you change or delete the backup job from on-host to an Agent job and delete the RCT files from the disks, the I/O Performance loss does not occur with diskspd testing. (We have our SQL server virtualized but we are doing an Agent backup job which solves our problems for now)
Also this is not a Veeam problem, but all back-up vendors that use RCT have the same issue and I/O performance loss. zie post May 16, 2020 in this thread.
Post Reply

Who is online

Users browsing this forum: No registered users and 15 guests