We are struggeling with implementing our first surebackup job in a small cluster. We tried to verify 3-4 VMs in parallel but this leads to timeout failures. Most of the time the same VMs are failing, not different ones in each run. When those VMs are verified with troubleshoot option or if only 1 VM is allows to run, the test is successful. We already expanded the timeout to 1200sec = 20min.
05.11.2021 08:29:41 Waiting for OS to boot for up to 1200 seconds (stable IP algorithm)...
05.11.2021 08:29:41 Note: Will proceed to the next step at 05.11.2021 08:49:41 or earlier
05.11.2021 08:49:41 Error Results: Cannot detect VM starting because of timeout
05.11.2021 08:49:41 Error: Results: OS did not boot in the allotted time
We have SOBR on Linux XFS with Apollos, we have physical mount hosts, we have 10GbE. No other jobs are running at that time. In Linux I used historical data to check extent latency, it does not show more than ~8ms. I'll try to check this in real time next, but I can't imagine that the repository performance too low to start 3-4 VMs in a decent time.
Yeah you can just open the VM console from SureBackup session and see that the VM booted up correctly (and likely very fast with the storage you have). This will limit required troubleshooting to detection issues (VMware Tools, networking etc.)
That's the thing, we see in console that the VM is not booting, no real progress. Lets see what support can find out, it's just strange that this depends so much on the number of VMs. As only VM the VM boots pretty quick and test is succesful.
I doubt that, the server has 128 GB RAM an the VMs that fail are rather small. With longer timeouts we now have 2 VMs that still fail, the one that only fails with parallel processing. And one (vROPs) that always fails. In troubleshooting mode I can see that vROPS VM boots without any problems and very quick. I can login and ping the gateway (helper appliance), but still the job failed with not reachable. SureBackup is really a tough one.
There's nothing tough about SureBackup. But it does not carry magic that allows for establishing network connections to unreachable hosts just need to understand why this particular machine is unreachable when other are. Should not be hard to troubleshoot I hope...
After some weeks of debugging.... We were trying to find the root cause why the vROPS VM always is failing. The problem is that the SureBackup jobs immediately starts the ping test once the VMware tools are running and an IP is displayed in vSphere client. The problem here is that this VM is still not reachable for a couple of minutes (maybe iptabels rules, don't know). Anyhow, we added a larger timeout of 20min, but this timeout is not honored because its only valid for the time until the VMware tools are up. So once the vmware tools are up there is no way to tell Veeam that it should wait a little longer. Only option here is to disable ping test. The VM is reachable after ~8-10minutes.
And we still have 2 VMs that fail - and thus the whole job - even though we have disabled heartbeat and ping test because we know both VM's have issues. And I'm pretty sure that all our backup jobs have some of those VM's (jobs have 100-150 VM's).
So for me, SureBackup is a tought one where only magic can help prevent running into errors.