-
- Veteran
- Posts: 465
- Liked: 136 times
- Joined: Jul 16, 2015 1:31 pm
- Full Name: Marc K
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
If I remember correctly, Veeam B&R gets upset when it sees an old VM version like that on Windows Server 2016 or 2019. When I upgraded the 2012 R2 Hyper-V hosts to 2016, I had planned to leave the VM version at the old value for awhile to provide an easy way to go back if things didn't work out. But Veeam V&R complained about it and I changed the plan to upgrade the VM version immediately after the host upgrade.
-
- Enthusiast
- Posts: 76
- Liked: 16 times
- Joined: Oct 27, 2017 5:42 pm
- Full Name: Nick
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Thanks NMDANGE & Marc K,
I wasn’t really planning on trying to downgrade the existing/production VMs but I am toying with the idea of creating an older version Test VM just to see what happens... because I just don’t have enough things to waste my life on right now!
Nick
I wasn’t really planning on trying to downgrade the existing/production VMs but I am toying with the idea of creating an older version Test VM just to see what happens... because I just don’t have enough things to waste my life on right now!
Nick
-
- Lurker
- Posts: 2
- Liked: never
- Joined: May 17, 2020 1:08 pm
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Hey everyone
Can you confirm if your CSV file systems are running redirected or direct IO mode?
Execute the command Get-ClusterSharedVolumeState. I found out that we had setup ours as Refs which is not a good idea. We converted to ntfs and now our state is direct and most of our issues if not all are resolved. I can post more info with source links when I get to a computer.
Can you confirm if your CSV file systems are running redirected or direct IO mode?
Execute the command Get-ClusterSharedVolumeState. I found out that we had setup ours as Refs which is not a good idea. We converted to ntfs and now our state is direct and most of our issues if not all are resolved. I can post more info with source links when I get to a computer.
-
- Enthusiast
- Posts: 36
- Liked: 4 times
- Joined: Jun 14, 2016 9:36 am
- Full Name: Pter Pumpkin
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
We are experiencing this too.
Virtual Host hardware: Lenovo x240 M5
Virtual Host OS: Windows Server 2019 Datacentre
Storage: Nimble CS3000
VM OS: Windows Server 2016 Standard
VM Size: ~2.5TB
VM File System: NTFS (no dedup)
One thing we noticed is that the RCT file for the large ~2.34TB disk is 500MB, which seems quite large?
The issue does not happen for us while backing up, but instead around the morning logon time (the server hosts user profiles).
We tested doing a Live Migration to another host and the issue seems to have disappeared, although it has only been one day since we did this.
Virtual Host hardware: Lenovo x240 M5
Virtual Host OS: Windows Server 2019 Datacentre
Storage: Nimble CS3000
VM OS: Windows Server 2016 Standard
VM Size: ~2.5TB
VM File System: NTFS (no dedup)
One thing we noticed is that the RCT file for the large ~2.34TB disk is 500MB, which seems quite large?
The issue does not happen for us while backing up, but instead around the morning logon time (the server hosts user profiles).
We tested doing a Live Migration to another host and the issue seems to have disappeared, although it has only been one day since we did this.
-
- Novice
- Posts: 3
- Liked: 1 time
- Joined: Jun 20, 2020 6:41 pm
- Full Name: Giovani Moda
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Hey Mats.MatsL wrote: ↑Feb 07, 2020 8:51 am Hi.
We have the same problem in 4 different clusters after Veeam Backup, two different i-SCSI SAN (Dell Compellent & Hitachi VSP G370).
Hosts: DELL R640 (latest patches)
OS: Windows Server 2019 (latest patches)
Storage: CSV, I-SCSI, Dell Compellent & Hitachi VSP G370
Backup: Veeam 9.5 u4
We have found a work around, if we move the VM that has problems (usually the SQL data disk, but no other disks are affected, weird) to another Host then we get normal performance again.
We have a script that checks the disk performance every two hours, and finds it a server (VM) with poor performance, then the script moves the server to another host in the cluster.
We suffer quite a lot from this.
/Mats
Would you care sharing that script? I have a setup very similar to yours and I'm suffering from constant performance issues, to a point where the VM is rebooted by the cluster. Maybe this can help me work around it until Microsoft decides to consider this an issue worth adressing.
Regards,
Giovani
-
- Enthusiast
- Posts: 47
- Liked: 10 times
- Joined: Aug 26, 2019 7:04 am
- Full Name: Christine Boersen
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
I'm having the same issue. I can give some more details as well
- All servers experiencing it are upgrade installs from Windows Server 2016 GA (fully patche) to Windows Server 2019.
- It IMMEDIATELY happened after the upgrade.
- Disabling using RCT/CBT for all backup jobs on the volumes eliminates the issue.
- I've tried deleting/recreating volumes (and storage pools) for some of the affect machines, with no effect on the issue
- Setting the throttling to low values didn't seem to have an effect (not surprised since it looks like a FS bug with MS)
- Servers are
- 2 x Dell Poweredge R720 with H710p controllers
- 1 x Dell Poweredge R930 with H730p controllers (in 12x12 mode with dual cards)
Hope that helps narrow the issue
- All servers experiencing it are upgrade installs from Windows Server 2016 GA (fully patche) to Windows Server 2019.
- It IMMEDIATELY happened after the upgrade.
- Disabling using RCT/CBT for all backup jobs on the volumes eliminates the issue.
- I've tried deleting/recreating volumes (and storage pools) for some of the affect machines, with no effect on the issue
- Setting the throttling to low values didn't seem to have an effect (not surprised since it looks like a FS bug with MS)
- Servers are
- 2 x Dell Poweredge R720 with H710p controllers
- 1 x Dell Poweredge R930 with H730p controllers (in 12x12 mode with dual cards)
Hope that helps narrow the issue
-
- Enthusiast
- Posts: 76
- Liked: 16 times
- Joined: Oct 27, 2017 5:42 pm
- Full Name: Nick
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Christine,
I just tried disabling CBT on all of the configured Jobs but I’m still getting the same I/O Delays as before; both during and not during the Backup Jobs, in fact most of them were not concurrent with running jobs.
Following your 2016-2019 upgrade; did you bring your VMs up to v9 or are they still on v8 ?
Thanks,
Nick
I just tried disabling CBT on all of the configured Jobs but I’m still getting the same I/O Delays as before; both during and not during the Backup Jobs, in fact most of them were not concurrent with running jobs.
Following your 2016-2019 upgrade; did you bring your VMs up to v9 or are they still on v8 ?
Thanks,
Nick
-
- Enthusiast
- Posts: 47
- Liked: 10 times
- Joined: Aug 26, 2019 7:04 am
- Full Name: Christine Boersen
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
I'm on V10 already actually. And have been since V10s first patch. The upgrade was from 9.5b (whatever the latest patch a few months ago)
The Windows upgrade was 1-2 months after the Veeam upgrade
Also, these are *NOT* clustered (yet.... about to convert to S2D), so things like CSV aren't part of the equation either.
The Windows upgrade was 1-2 months after the Veeam upgrade
Also, these are *NOT* clustered (yet.... about to convert to S2D), so things like CSV aren't part of the equation either.
-
- Enthusiast
- Posts: 76
- Liked: 16 times
- Joined: Oct 27, 2017 5:42 pm
- Full Name: Nick
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Thanks Christine but I wasn’t questioning the Veeam VBR version (although that’s potentially helpful to know too).
I’d like to know what VM Machine/Configuration version you’re on (as seen in the Hyper-V Manager Console). Did you upgrade that from 8 to 9 when you did the Server 2016-2019 upgrade and if so, did you do so before or after you saw the I/O Delays?
Thanks again,
Nick
I’d like to know what VM Machine/Configuration version you’re on (as seen in the Hyper-V Manager Console). Did you upgrade that from 8 to 9 when you did the Server 2016-2019 upgrade and if so, did you do so before or after you saw the I/O Delays?
Thanks again,
Nick
-
- Influencer
- Posts: 19
- Liked: 18 times
- Joined: Jul 06, 2020 2:31 pm
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Hello, Nick.
We are currently experiencing the same latency issues with the event ID 9 Hyper-V-StorageVSP in the hypervisor event log.
We have several failover clusters of 3 nodes connected to a SAN in SAS.
OS: Windows Server 2019 Datacenter Version 1809
Node: Dell PowerEdge R640
SAN: Dell PowerVault ME4024
These clusters host a number of virtual machines, some of which contain vhdx profiles for RDSH farms.
The issue does around the morning logon time (the server hosts user profiles).
We tried several things but without success:
- Disabling CBT
- Upgrade of Vm from version 8 to 9
When the problem occurs, usually a simple dynamic migration of the server to another host will bring it back to normal.
We have opened a ticket with Microsoft, but for the moment we don't have a solution to offer yet.
Have you been able to solve the problem or set up a workaround strategy ?
Thanks
Eluich
We are currently experiencing the same latency issues with the event ID 9 Hyper-V-StorageVSP in the hypervisor event log.
We have several failover clusters of 3 nodes connected to a SAN in SAS.
OS: Windows Server 2019 Datacenter Version 1809
Node: Dell PowerEdge R640
SAN: Dell PowerVault ME4024
These clusters host a number of virtual machines, some of which contain vhdx profiles for RDSH farms.
The issue does around the morning logon time (the server hosts user profiles).
We tried several things but without success:
- Disabling CBT
- Upgrade of Vm from version 8 to 9
When the problem occurs, usually a simple dynamic migration of the server to another host will bring it back to normal.
We have opened a ticket with Microsoft, but for the moment we don't have a solution to offer yet.
Have you been able to solve the problem or set up a workaround strategy ?
Thanks
Eluich
-
- Enthusiast
- Posts: 76
- Liked: 16 times
- Joined: Oct 27, 2017 5:42 pm
- Full Name: Nick
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Hello Eluich,
We have not made any progress on this issue and “Microsoft Support” won’t even respond to the ticket anymore!
I’m currently in the process of putting together another Server 2019 Hyper-V box for the specific purpose of troubleshooting this issue in my shop. I will of course post up anything of value that I find.
Thanks for the feedback. The more info the better...
Nick
We have not made any progress on this issue and “Microsoft Support” won’t even respond to the ticket anymore!
I’m currently in the process of putting together another Server 2019 Hyper-V box for the specific purpose of troubleshooting this issue in my shop. I will of course post up anything of value that I find.
Thanks for the feedback. The more info the better...
Nick
-
- Enthusiast
- Posts: 36
- Liked: 4 times
- Joined: Jun 14, 2016 9:36 am
- Full Name: Pter Pumpkin
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
We are still having this issue. We have a case open with Microsoft that is getting now where fast.
-
- Enthusiast
- Posts: 36
- Liked: 4 times
- Joined: Jun 14, 2016 9:36 am
- Full Name: Pter Pumpkin
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Small update.
We have figured out that the Veeam backup of the server "triggers" the issue. For the example below, the VM backup kicked off at 9:58pm. We then saw an immediate jump in read latency and read queue length within the guest VM (expected when a backup kicks off). However, when the backup finished at 11:54pm, the latency and queue did not drop back down to near-zero or "recover" to where it was before the backup. Note - this is a user profile VM and users typically only login during 7:00am - 5:00pm, so the server would be close to idle at 11:54pm.
And below is what the server looks like in the middle of the day, with hundreds of users logged in. This was also a few hours after we did a live migration to resolve the issue. You can see the disk io (read throughput) is higher, but the read latency and queue length is lower, which would indicate that the disk IO its self is not the issue.
We have figured out that the Veeam backup of the server "triggers" the issue. For the example below, the VM backup kicked off at 9:58pm. We then saw an immediate jump in read latency and read queue length within the guest VM (expected when a backup kicks off). However, when the backup finished at 11:54pm, the latency and queue did not drop back down to near-zero or "recover" to where it was before the backup. Note - this is a user profile VM and users typically only login during 7:00am - 5:00pm, so the server would be close to idle at 11:54pm.
And below is what the server looks like in the middle of the day, with hundreds of users logged in. This was also a few hours after we did a live migration to resolve the issue. You can see the disk io (read throughput) is higher, but the read latency and queue length is lower, which would indicate that the disk IO its self is not the issue.
-
- Enthusiast
- Posts: 36
- Liked: 4 times
- Joined: Jun 14, 2016 9:36 am
- Full Name: Pter Pumpkin
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
I can't seem to edit my last post, so sorry for the spam.
I have just been reading about Veeam and CBT here. Perhaps the in-memory bitmap is the issue? Which is why when a Live Migration is done the issue gets resolved?
I have just been reading about Veeam and CBT here. Perhaps the in-memory bitmap is the issue? Which is why when a Live Migration is done the issue gets resolved?
-
- Veteran
- Posts: 3077
- Liked: 455 times
- Joined: Aug 07, 2018 3:11 pm
- Full Name: Fedor Maslov
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Hi Pter,
May I ask you to share all this information with our support engineers via a support case? Or maybe you already have an open case on this topic? If so, please share the number.
Thanks in advance!
May I ask you to share all this information with our support engineers via a support case? Or maybe you already have an open case on this topic? If so, please share the number.
Thanks in advance!
-
- Enthusiast
- Posts: 36
- Liked: 4 times
- Joined: Jun 14, 2016 9:36 am
- Full Name: Pter Pumpkin
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Hi,
I opened a case with Veeam but they said it was a known MS issue and closed it. Case number was #04219162.
Pter
I opened a case with Veeam but they said it was a known MS issue and closed it. Case number was #04219162.
Pter
-
- Enthusiast
- Posts: 47
- Liked: 10 times
- Joined: Aug 26, 2019 7:04 am
- Full Name: Christine Boersen
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Sorry, I just saw this. I had upgraded the VM Version as soon as I upgraded the Server OS, so I can't comment on old VM/New OS combination.Nick-SAC wrote: ↑Jul 06, 2020 1:11 pm Thanks Christine but I wasn’t questioning the Veeam VBR version (although that’s potentially helpful to know too).
I’d like to know what VM Machine/Configuration version you’re on (as seen in the Hyper-V Manager Console). Did you upgrade that from 8 to 9 when you did the Server 2016-2019 upgrade and if so, did you do so before or after you saw the I/O Delays?
Thanks again,
Nick
More updates on the RCT Issue,
Just to add to the fun. In the same servers, I've now upgraded them from standalone (SCVMM managed) hosts with local SSD drives, to a S2D Cluster, all NVME (3 way mirror, 4 x 6.4TB AIC NVMe per server).. I'm running sub-1 MS latency under moderate load now.
*YET* If I re-enable the RCT in Veeam, the delay issue comes back (and yet storage latency Via cluster performance) is still under 1ms during this time, yet occasionally operations in hyper-v VM are doing the 15-45 second delay issue again.
Turning off RCT solves it within a couple of minutes.
So something is definitely broken between the interaction of Veeam using RCT and REFS. And in my case, I've proved it has nothing to do with
- the drives
- the RAID /HBA controllers
- BusType (SAS vs SATA vs NVMe on PCIe)
- Whether or not it was standalone or clustered.
Hope all of that empirical testing helps in the debugging of the issue.
Christine
-
- Enthusiast
- Posts: 76
- Liked: 16 times
- Joined: Oct 27, 2017 5:42 pm
- Full Name: Nick
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Thanks Christine, great info there!
Are you using the term RCT interchangeably with CBT and if not, where/how are you turning off RCT?
I ask because I only find the one option in:
Backup Job > Storage > Hyper-V : Use Changed Block Tracking (CBT)
So, are you perhaps finding & disabling RCT somewhere else?
FWIW (as I noted in a prior post in this thread) I was told by a Veeam Tech that (emphasis mine):
--------------------------------------------------------->snip<-----------------------------------------------
... RCT basically takes over for Veeam's proprietary CBT mechanism that was developed pre-HV2016.
RCT now combines with VM reference points to determine the changes in the VM in what Microsoft calls the "most efficient manner". Starting in the VM hardware version 8 and above, RCT is enabled by default and as far as I understand it, it cannot be disabled. You can certainly disable Veeam CBT using the GUI but this would only affect the original Veeam CBT file system driver for pre-HV2016 hosts/cluster.
There's some additional information in the user guide about it:
https://helpcenter.veeam.com/archive/ba ... g.html#rct
It's possible that it could be causing the issue but I do not think there is a way to disable it aside from down leveling either the Hosts/Cluster or VM HW versions, which don't sound like good solutions. You may be able to set up a new VM with HW7 or below then try testing with that but it writes a lot easier than in practice I'm sure.
--------------------------------------------------------->snip<-----------------------------------------------
Thanks again!
Nick
Are you using the term RCT interchangeably with CBT and if not, where/how are you turning off RCT?
I ask because I only find the one option in:
Backup Job > Storage > Hyper-V : Use Changed Block Tracking (CBT)
So, are you perhaps finding & disabling RCT somewhere else?
FWIW (as I noted in a prior post in this thread) I was told by a Veeam Tech that (emphasis mine):
--------------------------------------------------------->snip<-----------------------------------------------
... RCT basically takes over for Veeam's proprietary CBT mechanism that was developed pre-HV2016.
RCT now combines with VM reference points to determine the changes in the VM in what Microsoft calls the "most efficient manner". Starting in the VM hardware version 8 and above, RCT is enabled by default and as far as I understand it, it cannot be disabled. You can certainly disable Veeam CBT using the GUI but this would only affect the original Veeam CBT file system driver for pre-HV2016 hosts/cluster.
There's some additional information in the user guide about it:
https://helpcenter.veeam.com/archive/ba ... g.html#rct
It's possible that it could be causing the issue but I do not think there is a way to disable it aside from down leveling either the Hosts/Cluster or VM HW versions, which don't sound like good solutions. You may be able to set up a new VM with HW7 or below then try testing with that but it writes a lot easier than in practice I'm sure.
--------------------------------------------------------->snip<-----------------------------------------------
Thanks again!
Nick
-
- Enthusiast
- Posts: 47
- Liked: 10 times
- Joined: Aug 26, 2019 7:04 am
- Full Name: Christine Boersen
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Pardon, it is still labeled CBT in the interface, Yes I'm using the terms interchangeably and probably shouldn't
That is what I am referring to as disabling RCT.
When the setting is disabled, it moves the entire VM of data (very obvious when watching the backup progress, time, etc, even on 40Gbps network), so I am assuming the interface just doesn't have an updated text for the entry, and that for 2016+ it is really disabling RCT instead of Veeam's CBT driver.
Either way, it worked great in 2016 LTSC with all patches, and has been horrible in 2019 LTSC (all patches)
If I'm incorrect about it just being a naming issue in the GUI, please Veeam Tech Support correct me on this so we all learn from it.
Thanks,
Christine
That is what I am referring to as disabling RCT.
When the setting is disabled, it moves the entire VM of data (very obvious when watching the backup progress, time, etc, even on 40Gbps network), so I am assuming the interface just doesn't have an updated text for the entry, and that for 2016+ it is really disabling RCT instead of Veeam's CBT driver.
Either way, it worked great in 2016 LTSC with all patches, and has been horrible in 2019 LTSC (all patches)
If I'm incorrect about it just being a naming issue in the GUI, please Veeam Tech Support correct me on this so we all learn from it.
Thanks,
Christine
-
- Enthusiast
- Posts: 47
- Liked: 10 times
- Joined: Aug 26, 2019 7:04 am
- Full Name: Christine Boersen
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
(furthermore)
I suspect they still call it CBT, since that is "what is occurring", ie it is a form of Change Block Tracking,
whereas the method used to do this (RCT vs Veeam's CBT Driver (pre 2016)) is obfuscated from the interface.
I suspect they still call it CBT, since that is "what is occurring", ie it is a form of Change Block Tracking,
whereas the method used to do this (RCT vs Veeam's CBT Driver (pre 2016)) is obfuscated from the interface.
-
- Veteran
- Posts: 643
- Liked: 312 times
- Joined: Aug 04, 2019 2:57 pm
- Full Name: Harvey
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Interesting topic and I really appreciate people posting their findings.
@Christine, I particularly appreciate your update there, but I suppose I have a simple question about the conclusion though, as I think it's more accurate to say that there is a relationship between RCT and the latency spikes.
I think a Veeam -> RCT connection would be more strongly supported if 2016 was also affected, but from my read, I would understand the result of the test as just being that the root cause of this comes down to RCT.
I suppose a simple test if you're eager enough would be spin up a temporary server with a trial or something of any other solution that uses RCT as well and just see what happens with the cluster during backups. I realize you might not have space for it, but I do think that it would be pretty conclusive then and Microsoft would really have to answer about what's going on.
Just my $.02 though, as I really have trouble understanding how a backup application can cause such a situation -- if it is related to some invalid RCT state, I really need to wonder why RCT doesn't prevent such a set up in the first place.
@Christine, I particularly appreciate your update there, but I suppose I have a simple question about the conclusion though, as I think it's more accurate to say that there is a relationship between RCT and the latency spikes.
I think a Veeam -> RCT connection would be more strongly supported if 2016 was also affected, but from my read, I would understand the result of the test as just being that the root cause of this comes down to RCT.
I suppose a simple test if you're eager enough would be spin up a temporary server with a trial or something of any other solution that uses RCT as well and just see what happens with the cluster during backups. I realize you might not have space for it, but I do think that it would be pretty conclusive then and Microsoft would really have to answer about what's going on.
Just my $.02 though, as I really have trouble understanding how a backup application can cause such a situation -- if it is related to some invalid RCT state, I really need to wonder why RCT doesn't prevent such a set up in the first place.
-
- Enthusiast
- Posts: 47
- Liked: 10 times
- Joined: Aug 26, 2019 7:04 am
- Full Name: Christine Boersen
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Sonscy - Good point.
(and to be clear, all my criticism is about ReFS.. AS I believe that is the real culpruit, even if Veeam may or may not induce the problem occurring more frequently)
Unfortunately, ReFS performance continued to give intermittent slow i/o(set a monitor on events of source Microsoft-Windows-Hyper-V-StorageVSP-Admin to watch them).
It was crazy enough that even after moving data off a volume, hours later it would intermittently throw them on the moved data. And I had done a flush-volume after the moves, so that is a bit scary slow i/o to a file that doesn't exist on a volume anymore, after the cache is flushed successfully, hours later
I've moved all my data off to CFS_NTFS for now on the cluster, as I can't beta test ReFS at this point. It has cost me too much time, and I don't have faith in it, in this environment.
I'm still using it on my Backup server (where it performs GREAT!!!). It seems to be something about Hyper-V + ReFS + RCT...
I'm just out of time or patience to debug it, or risk my data on it until there is a known solution.
Have not had a single event (and have even re-enabled the old CBT setting) since migrating to CFS_NTFS.
I will keep the post updated in case there is some other underlying issue, but I suspect the problem is solved for now.
Given how well ReFS had performed in 2016, I was rather surprised and disappointed to see performance degrade this drastically with 2019.
(and to be clear, all my criticism is about ReFS.. AS I believe that is the real culpruit, even if Veeam may or may not induce the problem occurring more frequently)
Unfortunately, ReFS performance continued to give intermittent slow i/o(set a monitor on events of source Microsoft-Windows-Hyper-V-StorageVSP-Admin to watch them).
It was crazy enough that even after moving data off a volume, hours later it would intermittently throw them on the moved data. And I had done a flush-volume after the moves, so that is a bit scary slow i/o to a file that doesn't exist on a volume anymore, after the cache is flushed successfully, hours later
I've moved all my data off to CFS_NTFS for now on the cluster, as I can't beta test ReFS at this point. It has cost me too much time, and I don't have faith in it, in this environment.
I'm still using it on my Backup server (where it performs GREAT!!!). It seems to be something about Hyper-V + ReFS + RCT...
I'm just out of time or patience to debug it, or risk my data on it until there is a known solution.
Have not had a single event (and have even re-enabled the old CBT setting) since migrating to CFS_NTFS.
I will keep the post updated in case there is some other underlying issue, but I suspect the problem is solved for now.
Given how well ReFS had performed in 2016, I was rather surprised and disappointed to see performance degrade this drastically with 2019.
-
- Enthusiast
- Posts: 76
- Liked: 16 times
- Joined: Oct 27, 2017 5:42 pm
- Full Name: Nick
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
I think we may be dealing with different underlying problems that just happen to have the same symptom...
All of my I/O Delays have been on the NTFS Volumes where the VM’s VHD’s reside.
FWIW, we are using ReFS on the VBR Backup Storage Repos (both an iSCSI NAS and USB Ext HDDs) and have never experienced an I/O Delay issue there – and at one point I even spun up a test VM right on the iSCSI NAS Volume and still didn’t encounter the problem there (although the problem is so maddening intermittent (5 times in 1 Day and then not again for weeks...) that it's possible I just didn't run the test long enough...
All of my I/O Delays have been on the NTFS Volumes where the VM’s VHD’s reside.
FWIW, we are using ReFS on the VBR Backup Storage Repos (both an iSCSI NAS and USB Ext HDDs) and have never experienced an I/O Delay issue there – and at one point I even spun up a test VM right on the iSCSI NAS Volume and still didn’t encounter the problem there (although the problem is so maddening intermittent (5 times in 1 Day and then not again for weeks...) that it's possible I just didn't run the test long enough...
-
- Enthusiast
- Posts: 47
- Liked: 10 times
- Joined: Aug 26, 2019 7:04 am
- Full Name: Christine Boersen
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Nick-SAC :
Interesting.. I had not heard of anyone having that issue (the delays on NTFS or CFS_NTFS) which is your case, cluster or natural?
On my backup repositories, they are still ReFS, and have no issues.. it seems to be an interaction of Hyper-V + ReFS in my case (ReFS and CFS_ReFS, as it happened on both, with different drives, controllers, now to NVME)
Now that I'm making the change, seeing 3+ GB/sec throughput, and under 1ms latency, no slips (I/O delays)
Will continue to update the post as I gain more empirical evidence and "age" on the volume.
Interesting.. I had not heard of anyone having that issue (the delays on NTFS or CFS_NTFS) which is your case, cluster or natural?
On my backup repositories, they are still ReFS, and have no issues.. it seems to be an interaction of Hyper-V + ReFS in my case (ReFS and CFS_ReFS, as it happened on both, with different drives, controllers, now to NVME)
Now that I'm making the change, seeing 3+ GB/sec throughput, and under 1ms latency, no slips (I/O delays)
Will continue to update the post as I gain more empirical evidence and "age" on the volume.
-
- Enthusiast
- Posts: 47
- Liked: 10 times
- Joined: Aug 26, 2019 7:04 am
- Full Name: Christine Boersen
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
And just one more data point,
All of our equipment is on the S2D 'Certified Plus/Pro" or whatever it is called, for supporting all the features, just to eliminate that "question".
All of our equipment is on the S2D 'Certified Plus/Pro" or whatever it is called, for supporting all the features, just to eliminate that "question".
-
- Enthusiast
- Posts: 76
- Liked: 16 times
- Joined: Oct 27, 2017 5:42 pm
- Full Name: Nick
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Christine,
No Cluster on mine... It’s an All-In-One box (see the first post in this thread for the details).
I didn’t previously mention that we’re seeing it on NTFS because until I saw your last post I didn't realize anyone was seeing it as an ReFS specific issue.
Thanks,
Nick
No Cluster on mine... It’s an All-In-One box (see the first post in this thread for the details).
I didn’t previously mention that we’re seeing it on NTFS because until I saw your last post I didn't realize anyone was seeing it as an ReFS specific issue.
Thanks,
Nick
-
- Enthusiast
- Posts: 47
- Liked: 10 times
- Joined: Aug 26, 2019 7:04 am
- Full Name: Christine Boersen
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Thanks Nick,
Yes it seems (from my current testing) at least in my enviroment, 100% ReFS related.
I was seeing it before we went to a cluster, and we replaced our entire storage subsystem over it (for many reasons, this being one) to NVMe drives with sub ms guaranteed latency. Each of the three severs has FOUR 6.4 TB NVMe AIC SSD physical storage. Which then eliminated our entire storage controller and previous SSD drives from the equation if they were the problem.
Each NVMe AIC can sustain over 2GB/s writes (or more) natively.
So, with our workload, there is absolutely no reason for I/O latency.
All the networking is showing no loss on the lossless RoCEv2 segments, dual 40Gbps links to redundant switches, with 6 x 40 GBps stacking between the switches, and 3 x 10 GBps links (per switch) to the three stacked 1Gbps switches.
So we are kinda overkill everywhere
Yes it seems (from my current testing) at least in my enviroment, 100% ReFS related.
I was seeing it before we went to a cluster, and we replaced our entire storage subsystem over it (for many reasons, this being one) to NVMe drives with sub ms guaranteed latency. Each of the three severs has FOUR 6.4 TB NVMe AIC SSD physical storage. Which then eliminated our entire storage controller and previous SSD drives from the equation if they were the problem.
Each NVMe AIC can sustain over 2GB/s writes (or more) natively.
So, with our workload, there is absolutely no reason for I/O latency.
All the networking is showing no loss on the lossless RoCEv2 segments, dual 40Gbps links to redundant switches, with 6 x 40 GBps stacking between the switches, and 3 x 10 GBps links (per switch) to the three stacked 1Gbps switches.
So we are kinda overkill everywhere
-
- Veteran
- Posts: 3077
- Liked: 455 times
- Joined: Aug 07, 2018 3:11 pm
- Full Name: Fedor Maslov
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Hi guys,
Have anyone of you tried reaching Microsoft with this issue? We had a ticket opened with them on a kind of a very similar issue that was related to a bug in RCT that was acknowledged.
Thanks
Have anyone of you tried reaching Microsoft with this issue? We had a ticket opened with them on a kind of a very similar issue that was related to a bug in RCT that was acknowledged.
Thanks
-
- Novice
- Posts: 3
- Liked: 1 time
- Joined: Jun 20, 2020 6:41 pm
- Full Name: Giovani Moda
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Just to chime in here: I'm following this thread because I have a very similar issue, but not while use Veeam. This particular customer is using Arcserve UDP, which leverages on RCT as well, and we are seeing constant Hyper-V-StorageVSP warnings, mostly regarding to read operations, to a point were one of our VMS, a File Server, is rebooted by the cluster service because it becomes unresponsive.soncscy wrote: ↑Aug 09, 2020 8:57 am Interesting topic and I really appreciate people posting their findings.
@Christine, I particularly appreciate your update there, but I suppose I have a simple question about the conclusion though, as I think it's more accurate to say that there is a relationship between RCT and the latency spikes.
I think a Veeam -> RCT connection would be more strongly supported if 2016 was also affected, but from my read, I would understand the result of the test as just being that the root cause of this comes down to RCT.
I suppose a simple test if you're eager enough would be spin up a temporary server with a trial or something of any other solution that uses RCT as well and just see what happens with the cluster during backups. I realize you might not have space for it, but I do think that it would be pretty conclusive then and Microsoft would really have to answer about what's going on.
Just my $.02 though, as I really have trouble understanding how a backup application can cause such a situation -- if it is related to some invalid RCT state, I really need to wonder why RCT doesn't prevent such a set up in the first place.
This setup is:
2 R640 with Windows Server 2019 Datacenter
Hyper-V Failover Cluster
CSV with NTFS on iSCSI
SCV3020 storage with SAS 10k disks
10Gb connection to the storage
We have two sites with identical setups and we are seeing the issue on both of them.
We've been looking at this for a while now, we see spikes on read operations from the hosts to the storage when the issue manifests itself. It has gotten to a point where Dell SC Support suggested adding faster disks to the array because we are maxing out IOPS on the mechanical 10k disks. But the thing is: I don't have such a huge demand on these sites to explain such an IOPS spike, and this thread got me thinking that RCT might be causing this. It does not happen during the backup window, it's always random and we are seeing this at least once at every 15 days.
Giving what I seeing, I don't think it has to do exclusively with ReFS or Veeam, but it seems to be a bug on how RCT is handling read operations inside the virtual disk.
That's my 2 cents at least. Just thought of telling you guys what I'm seeing because, although the issue is exactly the same, we are using another backup solution.
Regards,
Giovani
-
- Lurker
- Posts: 1
- Liked: never
- Joined: Aug 10, 2020 3:29 pm
- Full Name: DG
- Contact:
Re: Windows Server 2019 Hyper-V VM I/O Performance Problem
Hey all,
Jumping in here because we're seeing the same issues, despite not being a Veeam shop, and in this case the issue is independent of backups since we're not (yet) taking any of a newly provisioned environment.
The setup is as follows:
2x Dell R640 - Windows Server 2019 Standard (17763.339) - StarWind VSAN 2-Node Cluster
2x Intel Xeon Silver 4208 / 256GB RAM (per node)
Hyper-V Failover Cluster in Primary Site
CSV / NTFS / StarWind iSCSI
RAID10 of 6x Intel DC S4600 960GB SSDs for CSVFS Storage (per node)
1x HP DL380 G10 - Windows Server 2019 Standard (17763.339)
1x Xeon Silver 4210 / 192GB RAM
Standalone Server in DR Site
NTFS / Separate volumes for OS/Hyper-V
RAID5 of 6x HPE Mixed-Use 480GB SSDs for Hyper-V Storage
We are seeing Event ID 8 & 9 in the Hyper-V StorageVSP Admin logs for I/O operations inside VHDX files only.
Disk performance on all 3 hosts' underlying storage is stellar as shown by DiskSpd & FIO benchmarks.
The delayed reads and writes manifest themselves in very odd DiskSpd & FIO benchmark results, generally with inconsistently poor read performance (20-3000+MBps) and consistently poor write performance (0-20MBps). Random 4K & 64K writes seem to be the most adversely affected, while sequential 4K & 64K writes are better but still sub-optimal by about an order of magnitude.
This occurs inside running Version 5 & 9 VMs as well as empty VHDX files mounted on the hypervisor, and also occurs whether the VHDX lives on CSVFS or directly on the host NTFS underlying storage.
We have completely removed A/V including Windows Defender with no change in behavior. As mentioned above, no backups are being taken because this environment is greenfield and being built as I type this, so there are no VSS filters or backup agents on any of the hosts or VMs.
(Sample errors for search metadata)
Occurring on version 5 & 10 VMs:
An I/O request for device 'C:\ClusterStorage\csv1\VM1\Virtual Hard Disks\VM1.vhdx' took 18859 milliseconds to complete. Operation code = READ16, Data transfer length = 45056, Status = SRB_STATUS_SUCCESS.
An I/O request for device 'C:\ClusterStorage\CSV2\VM2\VM2.vhdx' took 11729 milliseconds to complete. Operation code = WRITE16, Data transfer length = 4096, Status = SRB_STATUS_SUCCESS.
Occurring on version 5 VMs only:
Failed to map guest I/O buffer for write access with status 0xC0000044. Device name = C:\ClusterStorage\csv1\VM1\Virtual Hard Disks\VM1.vhdx
If anyone would like to see more data, please let me know. Otherwise, this thread seems to be the most (and really, only) relevant one out there so far. I will be taking a backup of the version 5 VM and upgrading it to version 10 to see if there are any changes in behavior on that and/or other VMs.
Thanks!
Jumping in here because we're seeing the same issues, despite not being a Veeam shop, and in this case the issue is independent of backups since we're not (yet) taking any of a newly provisioned environment.
The setup is as follows:
2x Dell R640 - Windows Server 2019 Standard (17763.339) - StarWind VSAN 2-Node Cluster
2x Intel Xeon Silver 4208 / 256GB RAM (per node)
Hyper-V Failover Cluster in Primary Site
CSV / NTFS / StarWind iSCSI
RAID10 of 6x Intel DC S4600 960GB SSDs for CSVFS Storage (per node)
1x HP DL380 G10 - Windows Server 2019 Standard (17763.339)
1x Xeon Silver 4210 / 192GB RAM
Standalone Server in DR Site
NTFS / Separate volumes for OS/Hyper-V
RAID5 of 6x HPE Mixed-Use 480GB SSDs for Hyper-V Storage
We are seeing Event ID 8 & 9 in the Hyper-V StorageVSP Admin logs for I/O operations inside VHDX files only.
Disk performance on all 3 hosts' underlying storage is stellar as shown by DiskSpd & FIO benchmarks.
The delayed reads and writes manifest themselves in very odd DiskSpd & FIO benchmark results, generally with inconsistently poor read performance (20-3000+MBps) and consistently poor write performance (0-20MBps). Random 4K & 64K writes seem to be the most adversely affected, while sequential 4K & 64K writes are better but still sub-optimal by about an order of magnitude.
This occurs inside running Version 5 & 9 VMs as well as empty VHDX files mounted on the hypervisor, and also occurs whether the VHDX lives on CSVFS or directly on the host NTFS underlying storage.
We have completely removed A/V including Windows Defender with no change in behavior. As mentioned above, no backups are being taken because this environment is greenfield and being built as I type this, so there are no VSS filters or backup agents on any of the hosts or VMs.
(Sample errors for search metadata)
Occurring on version 5 & 10 VMs:
An I/O request for device 'C:\ClusterStorage\csv1\VM1\Virtual Hard Disks\VM1.vhdx' took 18859 milliseconds to complete. Operation code = READ16, Data transfer length = 45056, Status = SRB_STATUS_SUCCESS.
An I/O request for device 'C:\ClusterStorage\CSV2\VM2\VM2.vhdx' took 11729 milliseconds to complete. Operation code = WRITE16, Data transfer length = 4096, Status = SRB_STATUS_SUCCESS.
Occurring on version 5 VMs only:
Failed to map guest I/O buffer for write access with status 0xC0000044. Device name = C:\ClusterStorage\csv1\VM1\Virtual Hard Disks\VM1.vhdx
If anyone would like to see more data, please let me know. Otherwise, this thread seems to be the most (and really, only) relevant one out there so far. I will be taking a backup of the version 5 VM and upgrading it to version 10 to see if there are any changes in behavior on that and/or other VMs.
Thanks!
Who is online
Users browsing this forum: No registered users and 17 guests