ChriFue wrote: ↑Apr 06, 2022 8:51 am
Hello,
I am also struggling with this issues.
Customer has a software defined storage infrastructure.
Set up of the 2 node system (hv 2019, datacore sds VM on each node) and a few weeks everything was ok.
All Flash! No spinning rust. 10Gbit LAN. iSCSI connections for CSVs.
Daily veeam backup job, on-host mode, no problems. Insane Backup speeds.
Then suddenly we got Event 9 Issues on the vhdx files of the VMs.
AND the issues also sent our local SDS-VM to hell and freezed it. Which is bad, because it serves the iscsi Targets for the hyperv hosts.
The Failover-Cluster crashed and went in to a loop of trying getting VMs up on another node, because CSVs were gone.
I/O latency for VMs rising to over one minute ... after a few hours everything went back to normal, but VMs needed hard reset because they were unresponsive.
Interesting: These Software-Defined-Storage VMs are sitting on a seperate raid controller on their own volume with their own SSDs ...
But they also crash sometimes when Event 9 on the CSVs an the VMs is happening.
They also think they "loose" there local hyperv disk sometimes (eventlog). Happens always during backup window.
And that is the point i don't understand.
Why is my local VM on my local RAID also struggling with I/O problems?
It is not on the CSVs, it is not on the cluster, it is just a little Windows VM hosted locally. And this VM is not backed up!
So, a problem in the MS-HyperV Storage Stack eventually?
Maybe it says "Hey, something is wrong, i will slow down ALL I/Os on my hypervisor, i don't mind if the VMs are on CSVs or local".
BUT: For investigating, we evacuated all VMs to a non clustered system.
One VM after another, VEEAM Replication Job did its job perfectly.
Now on single serverhardware, local RAID, no Cluster, no CSVs. Just a "single hyper v host".
Again, daily VEEAM Backup in On-Host mode.
And .... we also got Event 9 errors during backup windows with VEEAM.
I/O Requests with 39969ms and longer. Yes, that are 40 seconds ....
I was surprised that the VMs survive this latency, maybe because of hyper-v i/o caching and looong timeouts.
In the meanwhile we did a complete fresh setup of our software-defined-storage cluster, serverhardware vendor and storage software vendor were in the team. Also changed raid controllers on both nodes ... who knows!
Again, some days we were perfect. After 5 days of runtime, Event 9 came back.
It did not crash our system again, because i activated VEEAM storage I/O control. Also Backup does one VM (vhdx) after another sequentially, to keep impact on storage low.
But again massive Event 9 entries on the hyperv- host. Also the eventlog of the VMs say "heeey i think this I/O took to long, but the retry was successfull). But VMs survive.
And now i am back here, sitting on expensive hardware with expensive software and crashing it when doing backups how i want to (more than 1 VM simultaneously).
Thank you all for sharing your experiences with this problem, it helped mi getting focused.
Besides my story, there is my question:
As some of you wrote, is it true that daily live-migration to another host and back helps a lot?
Then i would try to get a script which does the job.
Chris