Windows Server 2019 Hyper-V VM I/O Performance Problem

street9009 · Post by **street9009** » Apr 26, 2022 5:15 pm this post

GabesVirtualWorld wrote: ↑Feb 01, 2022 1:59 pm To be sure, disable it on the host, the backup job and remove the files from the VM. For this the VM needs to be powered off.

My procedure is:
- shutdown VM
- remove cbt files
- migratie VM to different host
- power on the VM
- live migrate the VM to different host.

Hey Gabe,

We're having this exact trouble, as documented across all 8 pages of this thread. I'd like to test this solution but I'm a little unclear as to some of the details of your procedure above. Specifically:
1) What CBT files did you remove (are these the *.MRT and *.RCT files sitting alongside the VHDX files?)
2) Where/how did you disable CBT?

Post by **dasfliege** » Apr 28, 2022 9:28 am this post

I have been asked if there is any difference in terms of RCT file creation when you backup via direct-san- access (off-host) instead of on-host?
I assume that it doesn't make any difference, as RCT file are created and maintained during the VM snapshot process, which is also happening when using off-host backup mode?

Apr 28, 2022 11:06 am

Hello,

Your understanding is correct, there are no differences, data read method whether it is on-host or off-host has nothing to do with changed block tracking, in particular with RCT.

Thanks!

ChristineAlexa · May 02, 2022 1:02 am

Have any of the newer subscribers to this thread, tried the*partial fix* outlined much earlier in this thread?

All you have to do is Disable the hyperthreading on the physical hosts. For both of my clusters, it helped dramatically.
If you do this, adjust your cores per VM to half of the cores, per VM, no lower than 2 for any that were 2 or above previously.

I've been running since I posted that *partial* fix, and about the only time I get the error now is during PLANNED failover to do windows patches every month on the hosts. (so 4-8 errors a month) from unusable cluster when HT was enabled on the Hosts.

Look forward to seeing if this solution helps anyone else.

HTH
Christine

(more detail/thoughts, TL;DR)

My hypothesis at this point, is that it is a combination of the
- Meltdown patches
- Hyperthreading
- CBT
- (possible trigger) A bug in the HyperV cache flush mechanism locking
(See earlier in the post where Veeam discusses going to sync write + cache flushing, as you would expect, and timeouts)

Post by **ChriFue** » May 02, 2022 8:20 am this post

ChriFue wrote: ↑Apr 06, 2022 8:51 am Hello,
I am also struggling with this issues.
Customer has a software defined storage infrastructure.
Set up of the 2 node system (hv 2019, datacore sds VM on each node) and a few weeks everything was ok.
All Flash! No spinning rust. 10Gbit LAN. iSCSI connections for CSVs.

Daily veeam backup job, on-host mode, no problems. Insane Backup speeds.

Then suddenly we got Event 9 Issues on the vhdx files of the VMs.
AND the issues also sent our local SDS-VM to hell and freezed it. Which is bad, because it serves the iscsi Targets for the hyperv hosts.

The Failover-Cluster crashed and went in to a loop of trying getting VMs up on another node, because CSVs were gone.
I/O latency for VMs rising to over one minute ... after a few hours everything went back to normal, but VMs needed hard reset because they were unresponsive.

Interesting: These Software-Defined-Storage VMs are sitting on a seperate raid controller on their own volume with their own SSDs ...
But they also crash sometimes when Event 9 on the CSVs an the VMs is happening.
They also think they "loose" there local hyperv disk sometimes (eventlog). Happens always during backup window.

And that is the point i don't understand.
Why is my local VM on my local RAID also struggling with I/O problems?
It is not on the CSVs, it is not on the cluster, it is just a little Windows VM hosted locally. And this VM is not backed up!

So, a problem in the MS-HyperV Storage Stack eventually?
Maybe it says "Hey, something is wrong, i will slow down ALL I/Os on my hypervisor, i don't mind if the VMs are on CSVs or local".

BUT: For investigating, we evacuated all VMs to a non clustered system.
One VM after another, VEEAM Replication Job did its job perfectly.

Now on single serverhardware, local RAID, no Cluster, no CSVs. Just a "single hyper v host".
Again, daily VEEAM Backup in On-Host mode.

And .... we also got Event 9 errors during backup windows with VEEAM.
I/O Requests with 39969ms and longer. Yes, that are 40 seconds ....

I was surprised that the VMs survive this latency, maybe because of hyper-v i/o caching and looong timeouts.

In the meanwhile we did a complete fresh setup of our software-defined-storage cluster, serverhardware vendor and storage software vendor were in the team. Also changed raid controllers on both nodes ... who knows!
Again, some days we were perfect. After 5 days of runtime, Event 9 came back.
It did not crash our system again, because i activated VEEAM storage I/O control. Also Backup does one VM (vhdx) after another sequentially, to keep impact on storage low.
But again massive Event 9 entries on the hyperv- host. Also the eventlog of the VMs say "heeey i think this I/O took to long, but the retry was successfull). But VMs survive.

And now i am back here, sitting on expensive hardware with expensive software and crashing it when doing backups how i want to (more than 1 VM simultaneously).

Thank you all for sharing your experiences with this problem, it helped mi getting focused.

Besides my story, there is my question:
As some of you wrote, is it true that daily live-migration to another host and back helps a lot?
Then i would try to get a script which does the job.

Chris

Update to my problem:

We had a few VMs left on the cluster with virtual hardware version 5.0 (w2k12r2).
These VMs were migrated and migrated and migrated. From HyperV Host 2012R2 in the past, over 2016 and now 2019.
But never did a virtual hardware upgrade of these VMs.

So ... there was nothing left to try, so we upgraded these VMs from 5.0 to 9.0 hardware version. And ... tadaaaaa

No HyperV storage hangs, crashing cluster or dying software defined storage nodes any more!

So in my case it was a problem when veeam started backup a v5.0 VM on a 2019 HyperV software defined cluster ...

I still have the high latency warnings in eventlogs some times.
But we see no impact on the VMs at the moment.
Will try to disable Hyperthreading. We use single socket 16 Core / 32 Thread CPUs in the servers.

Chris

Henk de Langen · May 09, 2022 7:20 am

Hi,
There is a new bios setting X2Apic.(The x2APIC is Intel’s Advanced Programmable Interrupt Controller design to improve efficiency in multiprocessor computer systems)
That setting needs to be disabled when you use 1 physical processor.
In my case it solved the performance problem on the visual file servers .

Post by **dasfliege** » May 23, 2022 11:04 am this post

ChristineAlexa wrote: ↑May 02, 2022 1:02 am Have any of the newer subscribers to this thread, tried the*partial fix* outlined much earlier in this thread?

All you have to do is Disable the hyperthreading on the physical hosts. For both of my clusters, it helped dramatically.
If you do this, adjust your cores per VM to half of the cores, per VM, no lower than 2 for any that were 2 or above previously.

I've been running since I posted that *partial* fix, and about the only time I get the error now is during PLANNED failover to do windows patches every month on the hosts. (so 4-8 errors a month) from unusable cluster when HT was enabled on the Hosts.

Look forward to seeing if this solution helps anyone else.

HTH
Christine

(more detail/thoughts, TL;DR)

My hypothesis at this point, is that it is a combination of the
- Meltdown patches
- Hyperthreading
- CBT
- (possible trigger) A bug in the HyperV cache flush mechanism locking
(See earlier in the post where Veeam discusses going to sync write + cache flushing, as you would expect, and timeouts)

Is this workaround focused on the RCT issue or any other problem?

akrimmer · Post by **akrimmer** » Sep 02, 2022 3:50 pm this post

Does anyone have recent updates on this issue? We are seeing this on our 2019 Hyper-V cluster that was recently migrated from 2016. We have disabled CBT at both the host and job level in Veeam and followed the previously noted procedure to delete the RCT/MRT files. We saw an improvement after doing so but we still see the latency creep up shortly after the weekly backup job runs, even with CBT disabled and no RCT/MRT files. Live migrating the server(s) seems to fix it until the backup runs again. We have a premier case open with MS but have yet to really dig into it with an engineer. Curious if anyone actually got anywhere with their previous MS cases. Thanks!

GabesVirtualWorld · Sep 04, 2022 9:13 pm

Nope, been working with MS Support for some time now, but they even didn't want to call this "a bug". We even canceled our Premier Subscription over this issue.

What we will be testing in the coming weeks, is:
- VM with CBT issue, on existing cluster which has hosts that have backups jobs running that use CBT.
- Shutdown VM, move to a new clean cluster on which never before a VM has run and in VEEAM all hosts have CBT disabled.
- Add VM to a new backup job with CBT disabled.

We hope to then see if this changes anything.

I _think_ that somehow, once CBT is triggered, there is a driver injection in a host and even when the job is not doing any CBT anymore, that driver remains in the host. By building a new fresh cluster we hope to get rid of that "driver".

KimRechnagel · Post by **KimRechnagel** » Sep 26, 2022 9:57 am this post

Hi Gabe,

Have you had the opportunity to test what you described here? What was the results?

We are seeing the same issues at one of our customers who are running SQL servers on Hyper-v 2019. During backup we see thousands of eventid 9 "An I/O request for device..." errors and the general latency on some guest disks are in the hundreds of milliseconds. The host volume latency is max 4ms during the same period.

Thanks.

Sep 26, 2022 10:29 am

In the past I ran into an issue where one of the other Microsoft protocols that interact with the storages hadin some combinations some side effects. If someone has a non production lab with this issue and can test something with me, please drop me a PN here in the forum system. Thanks.

GilesEadon · Post by **GilesEadon** » Sep 26, 2022 10:39 am this post

We are seeing the same at one of our customers. As others have already posted, we can live migrate the VM and the issue goes away until the next synthetic full backup. Currently the customer is fully aware of the issue and migrates the VM every Monday morning. We only really see this issue on one VM, the VM itself houses all the FSLogix profiles. So, it's fair to say it's a "busy" VM and the latency is very noticeable from an end user perspective when the issue occurs.

This is a 2 node Server 2019 S2D cluster, all latest Windows and Veeam updates.
Currently we have Hyper-Threading disabled but this hasn't fixed the issue.

Sep 26, 2022 10:42 am

I had some latency issues in environments with enabled ODX. Maybe give it a try and disable ODX:
https://aidanfinn.com/?p=14231

GilesEadon · Post by **GilesEadon** » Sep 26, 2022 11:06 am this post

Thanks for the suggestion Andreas but ODX is not enabled in this particular environment.

GabesVirtualWorld · Sep 26, 2022 6:44 pm

Andreas Neufert wrote: ↑Sep 26, 2022 10:29 am In the past I ran into an issue where one of the other Microsoft protocols that interact with the storages hadin some combinations some side effects. If someone has a non production lab with this issue and can test something with me, please drop me a PN here in the forum system. Thanks.

We don't have a real test environment but just installed three fresh Hyper-V hosts on which we put their heavy SQL VMs to do some CBT-less testing. Unfortunately we forgot to turn the VMs off after migration and remove the RCT/MRT files. So we still don't know if we have a clean test.

Just tell me what you would like to test.

GabesVirtualWorld · Sep 26, 2022 7:12 pm

KimRechnagel wrote: ↑Sep 26, 2022 9:57 am Hi Gabe,

Have you had the opportunity to test what you described here? What was the results?

We are seeing the same issues at one of our customers who are running SQL servers on Hyper-v 2019. During backup we see thousands of eventid 9 "An I/O request for device..." errors and the general latency on some guest disks are in the hundreds of milliseconds. The host volume latency is max 4ms during the same period.

Thanks.

Hi,
Haven't been able to test, since something went wrong.

We build a new Hyper-V cluster, latest updates and patches. Moved the VMs using Live Storage Migration. Made sure the backup job had no CBT enabled. The Hyper-V hosts in VEEAM had the CBT flag off (disabled). But we forgot to shutdown the VMs and remove the RCT/MRT files, so we're still not sure if the VMs were now completely clean.

The backup job showed that no CBT was used, but still the latency spiked after the backup and was gone after the live migration to a different host in the same cluster.

akrimmer · Oct 03, 2022 1:48 pm

We have been working on our premier ticket with MS for a few weeks now and last week they came back with this recommendation:

I have checked internally regarding the issue, these kinds of issues can be seen when there is too many RCT References inside VMCX. In an Ideal working scenario, the application which creates the reference points should be able to clean them.

Here in this case the cleaning up the reference point might be causing the problem.

So, I would request you to create a below registry path and help me with an update.

Path: HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication\ Name:HyperVReferencePointCleanup
Type: REG_DWORD
Value: 1

I replied to them disputing this suggestion as I was not able to find anything online regarding this registry key and it was never mentioned by Veeam support when we spoke to them before opening the MS ticket. Additionally, this would suggest that the RCT reference cleanup is only a Hyper-V 2019 issue since we never saw the issue on Hyper-V 2016.

Can someone from Veeam comment on this? Is this a recommended/viable fix?

Oct 03, 2022 10:43 pm

Hello,

It's difficult to say without knowing the root cause. As far as I see, MS support did not provide a definitive conclusion. I agree that the hypothesis about the big amount of RCT references fully makes sense but I would ask them to confirm this version based on debug logs analysis or some specific tests. Also, I don't think you should create any registry keys in the Veeam branch without approval from our engineers.

Thanks!

akrimmer · Oct 04, 2022 1:33 pm

Thank you. I have a Veeam ticket open and am working with support on clarifying the effect of the registry key.

wojtekr · Post by **wojtekr** » Oct 04, 2022 7:11 pm this post

very very slow transfers SOLUTION ... if you have QNAP forget about iSCSI, go back to SMB

I read a lot of forum posts and configured backup repository many times in different ways and the problem was always the same .. very very slow transfer 20-40 Mb/s ... with SMB transfer 400-700 MB /s ( backup of the file server 5 TB 3 days or a few hours )

B&R installed on VM with Windows Server 2022 on ESXi 7.0 Update 3 and backup cluster with Windows 2022 DataCenter and Windows 2012R2 and VM on ESXi ,

another solution is to connect QNAP by iSCSI to ESXI as another datastore and add this storage to VM with Veeam as another disk.

I don't know if it's a problem with Veeam, ESXi or Windows Server 2022, I don't want to waste time looking for a problem

Post by **Gostev** » Oct 04, 2022 8:43 pm this post

Your solution solves some completely different problem (with backup storage performance) than the one discussed in this thread (poor production VM performance on VMs with RCT enabled). Mods, please split this into a separate topic.

Nevertheless, please be aware that the use of SMB is highly discouraged for protocol reliability reasons, just like the use of low-end NAS devices, while with both together you're looking at the perfect storm

CelticDubstep · Post by **CelticDubstep** » Oct 21, 2022 6:52 pm this post

Just a quick question for those of you still facing this issue. My primary file server is 4x 8TB 7.2K SAS Drives in RAID10. It's where the VHDX file is stored (along with RCT/MRT files) for the file server. The server is a Dell PowerEdge T430 with a PERC RAID controller and running Windows Server 2019 Standard with two VM's (File Server/DC). The event viewer on the host shows the same latency entries (10000+ ms). A full backup of our file server VM takes nearly 60 hours as it only does it at a rate of 30 MB/s. The server, switch, and storage for backups are all 10 Gigabit. I also have users complaining of slow performance when working on the file server (saving files, opening files, making live changes, etc). We're a very small company so we don't have clusters, live migrations, etc.

Our branch office has a terrible setup compared to our main office. Much older hardware (13+ years old), 1 Gigabit, etc. However, the backups run around 110 MB/s maxing out the Gigabit Link and only takes 1-2 hours to backup. The difference? They're running Windows Server 2012 R2, so no RCT/MRT files. Their setup has always been flawless whereas our main office has had performance issues, Veeam not merging the snapshots back into the VHDX file, etc.

I know this goes against all things IT, but I'm curious to know if anyone here has reverted back to Windows Server 2012 R2 on their Hyper-V Host, because I'm strongly considering it. I know it's near EOL, but when it is taking an employee 4 minutes to open a file & 3 minutes to save a file and backups taking 60 hours, I'm almost willing to risk running an EOL Hyper-V Server so backups work correctly and performance issues are fixed.

I expect a lot of backlash for even considering this as an option, but truth be hold, none of our hardware is under service contracts, our newest server is 7 years old, we're 100% spinning rust on all servers. Workstations are all custom built with off the shelf parts and we were still running 8 year old workstations up until recently.

Anyway, thoughts? I am not a linux person and never used VMWare ESXi outside of a homelab. Our servers are also too old to run ESXi 7, but we'd be stuck with an EOL VMWare Setup as well. Replacing hardware/servers is not an option.

Post by **PetrM** » Oct 24, 2022 5:55 pm this post

Hello,

It seems that your issue with slow backups is related to a general infrastructural performance problem. I don't think that 10 K ms is a normal latency, it should be between 10 ms - 20 ms. For example, you may refer to this article. One more suspicious detail is that users complain about slow performance, I think that it's worth checking as well. From my point of view, it fully makes sense to discuss the problem with the storage vendor, find out and eliminate the cause of high latency and complaints from users, and after that double check backup speed.

Thanks!

CelticDubstep · Post by **CelticDubstep** » Oct 24, 2022 6:38 pm this post

I'm the Sole IT person at a company of 25 employees. We don't have a "storage vendor", nor do we have any service contracts for any hardware or software. We also don't use a SAN/vSAN/iSCSI. It's simply too expensive for a company of our size. These are basic servers with their own storage. The hardware is perfectly fine as the latency & diskspd tests on the actual host is perfectly fine. However, it's inside the virtual machine in Hyper-V were the performance suffers greatly. The issue only happens on Windows Server 2019 Hyper-V Hosts that have VM's that are backed up by Veeam & has RCT/MRT files. We have one server with Server 2012 R2 Hyper-V and the VM on that host which is also backed up by Veeam (but has no RCT/MRT files) works perfectly A-OK. This is clearly a Microsoft issue as the last 9 pages have discussed.

CelticDubstep · Post by **CelticDubstep** » Oct 26, 2022 2:12 pm this post

Gostev wrote: ↑Oct 04, 2022 8:43 pm Your solution solves some completely different problem (with backup storage performance) than the one discussed in this thread (poor production VM performance on VMs with RCT enabled). Mods, please split this into a separate topic.

Nevertheless, please be aware that the use of SMB is highly discouraged for protocol reliability reasons, just like the use of low-end NAS devices, while with both together you're looking at the perfect storm

Unfortunately, some of us have no choice, especially at small companies (less than 30 employees). All our servers refurbished and are between 7-13 years old & are all using spinning rust. I'm in the processing of downgrading our servers from Windows Server 2019 to 2012 R2 due to the RCT/MRT performance issue. Our backup repository's are standard off the shelf Synology Devices running on Atom CPU's. We cannot move to the cloud due to the nature of our work, nor can we afford $50,000 on new server hardware with service contracts, SSD's, etc. It sucks, but some of us have no choice in the matter. I don't use iSCSI because I've had no issues with SMB and by using iSCSI on Synology, it locks me out of a lot of features such as file versioning should the Synology get hit by ransomeware.

ChristineAlexa · Oct 31, 2022 8:13 pm

Ok, since you are to the point of rebuilding, did you happen to try the one thing that worked for me on Windows Server 2019, which was to turn of HyperThreading (see earlier in the post for details).

See if that has an effect on your problem if you don't mind since you are about to bare metal the thing anyways.

lllll42 · Post by **lllll42** » Nov 09, 2022 1:11 pm this post

hello,

Can you try running this command on hyperv hosts :
Set-VMNetworkAdapter -ManagementOS -VrssQueueSchedulingMode StaticVrss

bhead · Post by **bhead** » Nov 15, 2022 5:40 pm this post

Hello everyone.

Unfortunately the IO issue is still present on Windows Server 2022 Hyper-V Failover Cluster. This was confirmed by the Veeam support team.
After every backup session the overall IO drops down 80% - 90%. Moving the VM to a different node brings back the IO. Another option is to reset the CBT for the VM or disk.
Yet that doesn't seem to be an acceptable solution for a larger datacenter with a large amount of VMs.
I just can't believe that neither Microsoft, nor Veeam has got no solution for this!

We ended up disabling CBT for our high IO VMs, deleting the MRT and RCT files for the virtual hard disks.
This may be suitable for smaller VMs where backup sessions are short due to small vhdx files.
Yet running backups on larger VMs will take alot more time now.

I am hoping that there might be a fix with Veeam 12 coming soon?
Maybe there's someone else who already opened up a ticket with Microsoft who has a final solution to this?

In my opinion, Microsoft and Veeam should work together on a solution! This is an awful situation.

Regards

ChristineAlexa · Nov 15, 2022 6:10 pm

Did you try either of the possible solutions from above? (mine or lllll42's ?)

bhead · Post by **bhead** » Nov 16, 2022 1:02 pm this post

Hi Christine Alexa,

yes we did.
Setting the VRSS Queue Scheduling Mode to Statis VRSS does not solve our problem.
I did run the command on on of our nodes and even after a host reboot the IO speeds are still slow after running a backup.
I also disabled HT on the same host just to see if it makes any difference! It does not unfortunately.

I was also working on this issue with a very friendly Veeam technician today and we were looking at the CBT driver that is being loaded. Unloading the VeeamFCT did not change anything.
Yet we did notice a very strange behaviour.
After unloading the driver the IO speeds were fine after an incremental backup. We had our hopes up at this point.
Repeating the same IO tests after running yet another backup-job showed the same slow IOs.

So, both possible solutions don't make a difference!

R&D Forums

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Who is online