Host-based backup of Microsoft Hyper-V VMs.
Post Reply
DarrenD
Service Provider
Posts: 14
Liked: 3 times
Joined: Feb 19, 2015 5:10 pm
Full Name: Darren Durbin
Location: Hampshire, UK
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by DarrenD »

At the risk of dragging the thread off topic, we're defragging the actual VHDX file held on the CSV, not the contents of the VHDX file itself, so the defrag would be invisible to the Exchange VM itself. We just put the disk in to redirected access mode, then on the owning node ran defrag.exe <path to CSV junction> /u /a /v. And waited.
eengland09
Influencer
Posts: 17
Liked: 1 time
Joined: Oct 07, 2021 5:38 pm
Full Name: Eric England
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by eengland09 »

I'd say we are still on topic :) Thanks for the info. I will discuss that internally. It's hard to say if we will defrag those or not.
dasfliege
Service Provider
Posts: 238
Liked: 53 times
Joined: Nov 17, 2014 1:48 pm
Full Name: Florin
Location: Switzerland
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by dasfliege »

Is there any news? Guess there is still no fix available from Microsoft?
JKSAAHS
Lurker
Posts: 1
Liked: 1 time
Joined: Oct 11, 2021 7:03 am
Full Name: Jesper Kaasen Skovdal
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by JKSAAHS » 1 person likes this post

just following
GabesVirtualWorld
Expert
Posts: 244
Liked: 38 times
Joined: Jun 15, 2009 10:49 am
Full Name: Gabrie van Zanten
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by GabesVirtualWorld »

johan.h wrote: Sep 30, 2021 8:06 pm Just a short update. I've been given a new private fix for RCT to test, targeting Windows Server 2019 RS5. I'll let you know as soon as I'm done testing. Feel free to DM me if you want to try it as well.
Hi Johan,
Any updates you can share?

We're having the exact same issue. Since VMs have moved to 2019 hosts, there is a performance issue. After live migration the issue is almost completely gone but it will come back. We just can't figure out what triggers it to come back.

Regards
Gabrie
dasfliege
Service Provider
Posts: 238
Liked: 53 times
Joined: Nov 17, 2014 1:48 pm
Full Name: Florin
Location: Switzerland
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by dasfliege »

@GabesVirtualWorld
You got your answer here. The RCT issue is constantly slowing down all the VMs. I had to live-migrate all our VMs yesterday because the backup windows streched from a few hours to almost 24 hours.
amir-geu
Novice
Posts: 4
Liked: never
Joined: Oct 17, 2019 8:46 pm
Full Name: amir nazary
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by amir-geu »

DarrenD wrote: Oct 21, 2021 8:24 pm At the risk of dragging the thread off topic, we're defragging the actual VHDX file held on the CSV, not the contents of the VHDX file itself, so the defrag would be invisible to the Exchange VM itself. We just put the disk in to redirected access mode, then on the owning node ran defrag.exe <path to CSV junction> /u /a /v. And waited.
Doing a storage migration is an alternative if you don't want to put your CSV in redirected mode. It also defrags (by virtue of creating a fresh new disk), and also makes the problem go away.
But for how long?
amir-geu
Novice
Posts: 4
Liked: never
Joined: Oct 17, 2019 8:46 pm
Full Name: amir nazary
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by amir-geu »

the storage migration didn't work for long, The fix was short-lived. problem came back pretty quickly the next day
skf
Service Provider
Posts: 2
Liked: never
Joined: Nov 22, 2017 2:21 pm
Full Name: Sven Kjartan Figved
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by skf »

Starting to think the upgrade path will be 2016 -> 2022 at some point

Anyone knows if this RCT bug is still present in 2022?
johan.h
Veeam Software
Posts: 712
Liked: 182 times
Joined: Jun 05, 2013 9:45 am
Full Name: Johan Huttenga
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by johan.h »

Yes. It is still present in Server 2022 as far as I'm aware.
Henk de Langen
Lurker
Posts: 2
Liked: 1 time
Joined: May 16, 2018 6:01 am
Full Name: Henk de Langen
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Henk de Langen »

Hi John,

We migrated all file servers to the Windows 2019 Hyper-V with win2019 VMs and have now also the problem off freezing file access random.
I just turned off the CBT on the host, will that solve the problem?
Any update from Microsoft on this point?

regards Henk
GabesVirtualWorld
Expert
Posts: 244
Liked: 38 times
Joined: Jun 15, 2009 10:49 am
Full Name: Gabrie van Zanten
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by GabesVirtualWorld »

To be sure, disable it on the host, the backup job and remove the files from the VM. For this the VM needs to be powered off.

My procedure is:
- shutdown VM
- remove cbt files
- migratie VM to different host
- power on the VM
- live migrate the VM to different host.
GabesVirtualWorld
Expert
Posts: 244
Liked: 38 times
Joined: Jun 15, 2009 10:49 am
Full Name: Gabrie van Zanten
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by GabesVirtualWorld »

As many of you, we have a long running ticket with Microsoft Support and today we received the following answer:



Hi Gabrie

There is no specific code defect identified in windows 2019 that is specifically triggered and is causing issues you are noticing. Our “bug” database records possible issues reported by multiple customers and work on identifying code defects or optimizations. There are some optimization mechanisms used in windows 2019 for the memory management that might not work as expected in certain conditions compared with Windows 2016.

To better understand our issue, I’ll quote form Veeam source:
https://helpcenter.veeam.com/docs/backu ... ml?ver=110
<snipped the VEEAM article>

The performance issue appears when the chain of avhdx grows and there are multiple RCT and MRT files the requestor should trigger the merge of avhdx files after backup completes. Sometimes there is intensive VM activity that prevents the merge of avhdx in time.
Stopping the IO to disks would allow the merge to complete in time.

Resilient Change Tracking: .RCT files are a new addition to Windows Sever 2016 Hyper-V. That allows the tracking of changes between backup operations. Instead of having to backup the entire VHD(X) file or traverse the whole file, the .RCT file tracks changes and directs backup software to only the blocks that have changed. This provides much quicker backups that previous version of Hyper-V and puts the technology on par with VMware’s Change Block Tracking (CBT)

Modifiable Region Table: .MRT Like the .RCT file, the .MRT file aids in tracking changes between backup operations, but its function is to provide resiliency in the event of a host crash, BSOD or even a power failure. These make sure data is not missed in the case of something catastrophic happening to one of your hosts during a backup procedure. When a snapshot is created, there is an avhd file, often used to create checkpoints and backup tasks。AVHD or .AVHDX are differencing files where all newly written data is stored after a Checkpoint is created either by the administrator or as a result of backup procedure.

The bug that Emanuel mentioned describes possible performance issues when there are multiple RCT files created for the VM as the backup requestor is creating them every time. The bug also investigates possible unbuffered write performance on server 2019 compared with 2016. This would only happen if you copy large amount of data.

The MRT file is issuing buffered write-though IOs. This causes the file system to post the IO to a worker thread, but it will use at most 2 workers per volume. IOs that need to wait for MRT/RCT writes end up getting posted to the environment passive completion worker thread. This means we only have one thread issuing IOs on the host. The lazy writer flush thread has high CPU time in the buffered read/write IO paths and when we flush the file we issue synchronous writes for each dirty region (cache manager behavior).

The bug is meant to identify how we can optimize the use of memory-mapped IO to eliminate the read/write IO paths and how to update the file flushing to flush using multiple worker threads (lazy writer thread + up to 7 workers using the environment passive completion workers. This is quite challenging to achieve as it might cause other host related performance issues.

Conclusion:
We are still investigating why the requester and host integration services used by the backup application when is using their own CBT in relation with RCT MRT files is not doing the cleanup properly. The private fix that will be released by Microsoft will address the memory management related to lazy writer that will improve the write to disks

Source: Veeam
microsoft-hyper-v-f25/large-amounts-of- ... 59426.html
https://helpcenter.veeam.com/docs/backu ... ml?ver=110

Workaround:
Temporary disable vendor’s CBT as previously suggested or apply the workaround to remove the RCT/ MRT files , merge the avhdx files manually and live migrate the VM to new host.
Gostev
Chief Product Officer
Posts: 31513
Liked: 6691 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Gostev » 2 people like this post

If only there actually was "vendor's CBT" to disable in case from Microsoft Hyper-V 2016 onwards. Veeam provided "vendor's CBT" only for previous Hyper-V versions, which did not have native CBT. While starting from Hyper-V 2016, we're using native Hyper-V CBT aka RCT exclusively. They even quoted "Veeam source" which specifically explain this:
Resilient Changed Tracking
For VMs running on Microsoft Hyper-V Server 2016 or later, Veeam Backup & Replication uses Resilient Change Tracking, or RCT. RCT is a native Microsoft Hyper-V mechanism for changed block tracking in virtual hard disks of VMs.
@johan.h something to fix when you touch base with Microsoft on this issue next time. Their support should stop confusing customers by talking about "vendor's CBT" and "own CBT" on Hyper-V 2016/2019/2022 support cases with Veeam customers. It simply does not exist for these Hyper-V versions, as per the source they are referring to themselves.
johan.h
Veeam Software
Posts: 712
Liked: 182 times
Joined: Jun 05, 2013 9:45 am
Full Name: Johan Huttenga
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by johan.h »

@gostev noted. Will reach out to the Microsoft Support team involved.
GabesVirtualWorld
Expert
Posts: 244
Liked: 38 times
Joined: Jun 15, 2009 10:49 am
Full Name: Gabrie van Zanten
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by GabesVirtualWorld » 1 person likes this post

Gostev wrote: Feb 03, 2022 1:07 pm If only there actually was "vendor's CBT" to disable in case from Microsoft Hyper-V 2016 onwards. Veeam provided "vendor's CBT" only for previous Hyper-V versions, which did not have native CBT. While starting from Hyper-V 2016, we're using native Hyper-V CBT aka RCT exclusively. They even quoted "Veeam source" which specifically explain this:

@johan.h something to fix when you touch base with Microsoft on this issue next time. Their support should stop confusing customers by talking about "vendor's CBT" and "own CBT" on Hyper-V 2016/2019/2022 support cases with Veeam customers. It simply does not exist for these Hyper-V versions, as per the source they are referring to themselves.
Actually, I have been explaining that to MS Support as well, but they don't seem to get it :-)
Also, never got a real answer on how to disable CBT as the only answer was to disable it on the job. But from my VMware background I knew this wasn't enough as VMware ESXi would still keep the CBT running. Luckily @johan.h just explained me in a call, that there is no Hyper-V option to disable CBT. The only way is disable in VEEAM Job and manually remove the CRT and MRT from the powered-off VM. I usually also migrate the VM to a different host, then power on and live migrate back again to make sure the memory bitmap is gone too.

Thanks to VEEAM for always willing to explain what is happening deep down in the stack!
eengland09
Influencer
Posts: 17
Liked: 1 time
Joined: Oct 07, 2021 5:38 pm
Full Name: Eric England
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by eengland09 »

Man - we are just getting HAMMERED with these I/O Warnings- still to this day. It is causing our Exchange DAG to almost CRASH. I am a bit lost on what to do. Disabling CBT/RBT and the whole migrating process seems (I'm sorry) a bit nuts. How can one have a decent backup environment and a functioning exchange environment if the backups are always running in the background? Is there any hope in this mess of I/O Warnings that never seem to stop? Shutting down exchange VMs during the day in which we have to have them both up due to ADFS authentication having to happen (switching ADFS to the replication server is not enticing) would be back breaking. Given - I understand that exchange is heavily demanding but our newer hardware is more than capable of handling the I/O. Is everyone in agreement that this indeed is CBT/RCT at the source of this warning?
alexpt
Lurker
Posts: 1
Liked: never
Joined: Feb 14, 2022 2:57 pm
Full Name: Paulo Pinho
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by alexpt »

Has anyone tried to empty the working sets using the rammap utility?
eengland09
Influencer
Posts: 17
Liked: 1 time
Joined: Oct 07, 2021 5:38 pm
Full Name: Eric England
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by eengland09 »

Alexpt,

I have not tried that. Our latest findings so far for this I/O warning are as such: We found (can't believe it has taken us months to discover this) that when an exchange backup is running (and this may be why we found it) our Exchange performance is excellent. We see NO I/O warnings of any kind on the nodes during the backup! As soon as the backups are complete - we are hit with the warnings once again. We only noticed this because we had a backup running quite late into the next business day which corresponded with empty event logs where we would expect to see warnings continuously as every day previous. Our current status is that we took our exchange VMs down last night so we could remove the .crt and .rct files from all disks connected to exchange. We then quick migrated them to different nodes, booted up the vms, and live migrated them to different nodes in attempt to fully release the CBT mechanism which we believe to be the culprit behind these warnings since when the backup is running - no additional changes are being sent back and forth (as best as we can understand it - please correct me if I'm wrong). However, since our maintenance last night - we had only disabled CBT at the JOB level in veeam and NOT at the Node Level. I didn't want to have to do that if we didn't have to as we would LIKE to use CBT for less read/write intensive VMs - however, it looks like our next steps is to disable it from the backup infrastructure tab in veeam and see if this ultimately helps. We are hoping that is the light at the end of the tunnel - because after that we are running extremely thin on options. I've seen maybe one or two others in this thread have success with disabling at the job and node level but few have the on premise exchange environment or performance issues we are seeing with exchange. Will update as soon as we proceed!
dbaddorf
Novice
Posts: 4
Liked: never
Joined: Jan 19, 2022 6:18 pm
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by dbaddorf »

Just to add my 2 cents (as if 8 pages of postings aren't enough): I've been fighting this bug (if that's what it is) for a year. Every time I try to have Veeam backup our SQL server, the disk latency for the SQL server increases within days after the backup. Yes, at first, we have RCT enabled in the Veeam backup and found that we would need to live migrate the SQL VM to get the performance back (even after shutting down the server and removing the .mrt and .rct files). I had a post on the subject here: https://docs.microsoft.com/en-us/answer ... -spec.html.
But just recently I tried backing up this SQL VM without enabling the Veeam Backup flag "Use changed block tracking data" (CBT). Even without RCT becoming enabled on this VM (no evidence of .mrt and .rct files being created), we *still* had high latency creep up days after the backup had completed. It wasn't until the VM was live-migrated that the disk write latency came back down.
I can't explain this, but it seems like when Veeam does a backup EVEN WITHOUT CBT ENABLED it can cause high latency issues.
PetrM
Veeam Software
Posts: 3258
Liked: 525 times
Joined: Aug 28, 2013 8:23 am
Full Name: Petr Makarov
Location: Prague, Czech Republic
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by PetrM »

Hello,

Maybe you could do a test without Veeam, for example to create and delete checkpoints in order to see how disk latency is affected? By the way, I'd suggest to try on a test VM at first and do not stress production workloads by this type of testing.

Thanks!
Gostev
Chief Product Officer
Posts: 31513
Liked: 6691 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Gostev »

I very vaguely remember that there might two different type of checkpoints, and the one created by default may not the one that backup vendors are supposed to use. In other words, it may not be as simple as doing this same test with VMware. Just something to keep in mind.
eengland09
Influencer
Posts: 17
Liked: 1 time
Joined: Oct 07, 2021 5:38 pm
Full Name: Eric England
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by eengland09 »

dbaddorf wrote: Feb 17, 2022 7:27 pm Just to add my 2 cents (as if 8 pages of postings aren't enough): I've been fighting this bug (if that's what it is) for a year. Every time I try to have Veeam backup our SQL server, the disk latency for the SQL server increases within days after the backup. Yes, at first, we have RCT enabled in the Veeam backup and found that we would need to live migrate the SQL VM to get the performance back (even after shutting down the server and removing the .mrt and .rct files). I had a post on the subject here: https://docs.microsoft.com/en-us/answer ... -spec.html.
But just recently I tried backing up this SQL VM without enabling the Veeam Backup flag "Use changed block tracking data" (CBT). Even without RCT becoming enabled on this VM (no evidence of .mrt and .rct files being created), we *still* had high latency creep up days after the backup had completed. It wasn't until the VM was live-migrated that the disk write latency came back down.
I can't explain this, but it seems like when Veeam does a backup EVEN WITHOUT CBT ENABLED it can cause high latency issues.
Very interesting. This does have me worried about other highly active VMs later after we migrate to this cluster. You have CBT disabled at the VM and Host Level in Veeam? We have not tried the HOST level just yet. Backups have been running all day - and our Exchange performance has been flawless as far as we can tell. No I/O Warnings. Running full backups may be our failsafe if the host level CBT disabling doesn't work. Hope to reply back when we do that.
jwmb224
Lurker
Posts: 2
Liked: never
Joined: Feb 18, 2022 4:07 pm
Full Name: Jared Waldner
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by jwmb224 »

man you guys have really dug into this. in trying to improve the I/O of my hyper-v disks, would anyone know how to increase the channel count from the VHDX file to the host storage drive? microsoft mentions it in this article:
https://docs.microsoft.com/en-us/window ... erformance
but the mentioned registry keys are nowhere to be found in my system. there is a direct correlation in channel count and virtual processor count, but i don't have enough cores to get to the channel count i need. all i'm trying to achieve is I/O performance on the VHDX files that's the same as the host drive. is that unreasonable with virtualization?
johan.h
Veeam Software
Posts: 712
Liked: 182 times
Joined: Jun 05, 2013 9:45 am
Full Name: Johan Huttenga
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by johan.h »

(That key shouldn't be relevant to this discussion, I realize it's a slightly different path, but I could find HKLM\System\CurrentControlSet\Enum\Room\VMBUS on my system. The article indicates there is a default, so it could be that there simply is nothing there.)

However, the problem we're describing doesn't have anything to do with the I/O throughput of your VHDX files, instead it has to do with how Hyper-V writes to the files used for change block tracking, and how these are queued and threaded. This is what appears to be causing performance issues - affecting the VM the change block tracking is for.

There are different types of checkpoints used in Hyper-V, to reproduce the RCT issues I normally either create a backup, or I use New-VmBackupCheckpoint as part of https://www.powershellgallery.com/packa ... ckup/1.0.4
jwmb224
Lurker
Posts: 2
Liked: never
Joined: Feb 18, 2022 4:07 pm
Full Name: Jared Waldner
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by jwmb224 »

i searched the entire registry, and the key microsoft mentions is nowhere to be found. i wonder why the registry is different for almost everybody. in any event, thank you for the clarification. i figured we're all in a similar boat when it comes to the I/O issues with hyper-v.
dbaddorf
Novice
Posts: 4
Liked: never
Joined: Jan 19, 2022 6:18 pm
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by dbaddorf »

eengland09 wrote: Feb 17, 2022 9:23 pm Very interesting. This does have me worried about other highly active VMs later after we migrate to this cluster. You have CBT disabled at the VM and Host Level in Veeam? We have not tried the HOST level just yet. Backups have been running all day - and our Exchange performance has been flawless as far as we can tell. No I/O Warnings. Running full backups may be our failsafe if the host level CBT disabling doesn't work. Hope to reply back when we do that.
I didn't disable CBT at the host level - I have other VM's on the host that I do want to use RCT/CBT with. I only create a separate backup job in Veeam, backing up this one SQL server, and unselected/deselected the use of CBT in the Veeam backup. I don't know of a "proper" way to remove CBT/RCT from a VM in Windows (see https://docs.microsoft.com/en-us/answer ... -spec.html). (The "improper" way to remove CBT/RCT is to shutdown the VM, remove the .rct & .mrt files, startup the VM, and live-migrate to another host).
Brunok
Enthusiast
Posts: 36
Liked: 6 times
Joined: Sep 02, 2014 7:16 am
Full Name: Bruno
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by Brunok »

Hi,
Did you solve the problem with disabling RCT/CBT ? We have a single Hyper-V Host with the same errors. I found a thred about a similar error (pointing also to this article) with Event ID 8 on a cluster. At the end it was a problem with the network adapter.. (maybe this help others for finding this ** error (https://www.mcseboard.de/topic/219908-f ... io-buffer/) We don't have a solution until know after searching for a month or more..
eengland09
Influencer
Posts: 17
Liked: 1 time
Joined: Oct 07, 2021 5:38 pm
Full Name: Eric England
Contact:

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by eengland09 »

Update on our I/O Warnings - last week we decided to do some reconfiguring of how and where our VMs are stored on our Dell SC series SAN. I am quite frustrated with Dell for recommending their "RECOMMENDED *ALL TIERS" Storage Profile in their Storage Manager where you manage your volumes. Having a mix of 7200RPM and SSD's - the "Recommended" tier - THEY CLAIM - will sort read/write intensive blocks to the SSD's. Anything not used within 14 days is moved to the slower storage (in our case the 7200RPM spinners). I don't believe that is happening - or my understanding of exactly how the tier'd storage works is a bit off - the Dell storage "expert" I spoke with didn't seem to even hint at what we were to do next. What we did - was during our maintenance - we created new Hyper-V machines and volumes in our SSD performance tier and migrated JUST the Exchange C Drives over to these new volumes in our SSD tier. GUYS....THIS FIXED ALL OF OUR CRIPPLING EXCHANGE PERFORMANCE ISSUES. The VM is now able to speak coherently with the DB and Log Volumes with little to no latency. We have not received a SINGLE I/O Warning since performing these changes. REMINDER: We received I/O warnings constantly! After 7 months since we migrated from 2013 (single exchange vm) to a 2019 DAG (2 VMs) - we are now performance tuned!! I believe these warnings are directly storage related - if you are seeing the I/O warnings - our "workaround" was to run backups against the machines (in our case - they ran constantly with CBT turned off at only the VM level - we never turned it off at the host level) because it was scanning the entire disk architecture. This somehow "freed" up performance for us. This might help those of you until you can get your VM storage sorted out. Go SSD for Intensive SQL instances and Exchange especially. Don't look back. Hope this helps...
ChriFue
Service Provider
Posts: 10
Liked: 1 time
Joined: Dec 09, 2015 3:34 pm
Full Name: Chris

Re: Windows Server 2019 Hyper-V VM I/O Performance Problem

Post by ChriFue »

Hello,
I am also struggling with this issues.
Customer has a software defined storage infrastructure.
Set up of the 2 node system (hv 2019, datacore sds VM on each node) and a few weeks everything was ok.
All Flash! No spinning rust. 10Gbit LAN. iSCSI connections for CSVs.

Daily veeam backup job, on-host mode, no problems. Insane Backup speeds.

Then suddenly we got Event 9 Issues on the vhdx files of the VMs.
AND the issues also sent our local SDS-VM to hell and freezed it. Which is bad, because it serves the iscsi Targets for the hyperv hosts.

The Failover-Cluster crashed and went in to a loop of trying getting VMs up on another node, because CSVs were gone.
I/O latency for VMs rising to over one minute ... after a few hours everything went back to normal, but VMs needed hard reset because they were unresponsive.

Interesting: These Software-Defined-Storage VMs are sitting on a seperate raid controller on their own volume with their own SSDs ...
But they also crash sometimes when Event 9 on the CSVs an the VMs is happening.
They also think they "loose" there local hyperv disk sometimes (eventlog). Happens always during backup window.

And that is the point i don't understand.
Why is my local VM on my local RAID also struggling with I/O problems?
It is not on the CSVs, it is not on the cluster, it is just a little Windows VM hosted locally. And this VM is not backed up!

So, a problem in the MS-HyperV Storage Stack eventually?
Maybe it says "Hey, something is wrong, i will slow down ALL I/Os on my hypervisor, i don't mind if the VMs are on CSVs or local".

BUT: For investigating, we evacuated all VMs to a non clustered system.
One VM after another, VEEAM Replication Job did its job perfectly.

Now on single serverhardware, local RAID, no Cluster, no CSVs. Just a "single hyper v host".
Again, daily VEEAM Backup in On-Host mode.

And .... we also got Event 9 errors during backup windows with VEEAM.
I/O Requests with 39969ms and longer. Yes, that are 40 seconds ....

I was surprised that the VMs survive this latency, maybe because of hyper-v i/o caching and looong timeouts.

In the meanwhile we did a complete fresh setup of our software-defined-storage cluster, serverhardware vendor and storage software vendor were in the team. Also changed raid controllers on both nodes ... who knows!
Again, some days we were perfect. After 5 days of runtime, Event 9 came back.
It did not crash our system again, because i activated VEEAM storage I/O control. Also Backup does one VM (vhdx) after another sequentially, to keep impact on storage low.
But again massive Event 9 entries on the hyperv- host. Also the eventlog of the VMs say "heeey i think this I/O took to long, but the retry was successfull). But VMs survive.

And now i am back here, sitting on expensive hardware with expensive software and crashing it when doing backups how i want to (more than 1 VM simultaneously).

Thank you all for sharing your experiences with this problem, it helped mi getting focused.


Besides my story, there is my question:
As some of you wrote, is it true that daily live-migration to another host and back helps a lot?
Then i would try to get a script which does the job.

Chris
Post Reply

Who is online

Users browsing this forum: No registered users and 11 guests