- 100 HyperV hosts (onhost proxies, 2 threads each = 200 theoretical proxy threads aka virtual disks to be backed up in parallel)
- ~900 VMs to be backed up daily and copied to a second fire zone - in addition some SQL/Oracle log-shipping
- 2 SOBRs consisting of 6 Windows extents each with 4 threads each (24 repo threads aka 24 VMs to be backed up in parallel)
- ~40 primary backup jobs to one of the SOBRs
- 3 backup copy jobs from the primary to the other SOBR
But in December last year for the first time and now once again last week we observed the following:
Backup performance suddenly became very bad. Only 3-5 proxy threads were handled at the same time, though in theory we should be able to accomodate at least 24 VMs within the repo, depending on the number of vdisks a VM carries of course.
During the issue, backup copy jobs were slowly "dripping" to the other SOBR. Primary backup jobs could only backup 2-3 VMs at the same time violating SLAs heavily.
For some reason it looks as if the ressource scheduler was in a lockdown state not able to distribute the threads in a timely manner.
Together with Veeam support (#05594646) no solution other than rebooting VBR with stopped jobs was found. After that, all the ressources were available again and the backup ran fast as before.
The core reason was estimated to be the backup copy jobs not freeing up the ressources (repo threads) due to undefined scheduler issues with overlapping primary / backup copy jobs.
As a workaround we suggested to lock out the backup copy jobs via scheduling times from the estimated primary backup window to seperate the ressource consumption.
In theory VBR should be able to handle them side by side as backup copy has lower priority than primary backup.
Support was not able to determine the root cause neither could they provide measures to prevent the issue from happening.
Has anyone else seen something alike?
Thanks.