Comprehensive data protection for all workloads
Post Reply
pirx
Veteran
Posts: 599
Liked: 87 times
Joined: Dec 20, 2015 6:24 pm
Contact:

possible SureBackup bottlenecks

Post by pirx »

We are struggeling with implementing our first surebackup job in a small cluster. We tried to verify 3-4 VMs in parallel but this leads to timeout failures. Most of the time the same VMs are failing, not different ones in each run. When those VMs are verified with troubleshoot option or if only 1 VM is allows to run, the test is successful. We already expanded the timeout to 1200sec = 20min.

Code: Select all

05.11.2021 08:29:41          Waiting for OS to boot for up to 1200 seconds (stable IP algorithm)...
05.11.2021 08:29:41          Note: Will proceed to the next step at 05.11.2021 08:49:41 or earlier
05.11.2021 08:49:41 Error    Results: Cannot detect VM starting because of timeout
05.11.2021 08:49:41          Error: Results: OS did not boot in the allotted time

We have SOBR on Linux XFS with Apollos, we have physical mount hosts, we have 10GbE. No other jobs are running at that time. In Linux I used historical data to check extent latency, it does not show more than ~8ms. I'll try to check this in real time next, but I can't imagine that the repository performance too low to start 3-4 VMs in a decent time.

A collegue opened case 05102692 for this.
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: possible SureBackup bottlenecks

Post by Gostev »

Yeah you can just open the VM console from SureBackup session and see that the VM booted up correctly (and likely very fast with the storage you have). This will limit required troubleshooting to detection issues (VMware Tools, networking etc.)
pirx
Veteran
Posts: 599
Liked: 87 times
Joined: Dec 20, 2015 6:24 pm
Contact:

Re: possible SureBackup bottlenecks

Post by pirx »

That's the thing, we see in console that the VM is not booting, no real progress. Lets see what support can find out, it's just strange that this depends so much on the number of VMs. As only VM the VM boots pretty quick and test is succesful.
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: possible SureBackup bottlenecks

Post by Gostev »

This would point to lack of RAM on the mount server.
pirx
Veteran
Posts: 599
Liked: 87 times
Joined: Dec 20, 2015 6:24 pm
Contact:

Re: possible SureBackup bottlenecks

Post by pirx »

I doubt that, the server has 128 GB RAM an the VMs that fail are rather small. With longer timeouts we now have 2 VMs that still fail, the one that only fails with parallel processing. And one (vROPs) that always fails. In troubleshooting mode I can see that vROPS VM boots without any problems and very quick. I can login and ping the gateway (helper appliance), but still the job failed with not reachable. SureBackup is really a tough one.
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: possible SureBackup bottlenecks

Post by Gostev »

There's nothing tough about SureBackup. But it does not carry magic that allows for establishing network connections to unreachable hosts :D just need to understand why this particular machine is unreachable when other are. Should not be hard to troubleshoot I hope...
pirx
Veteran
Posts: 599
Liked: 87 times
Joined: Dec 20, 2015 6:24 pm
Contact:

Re: possible SureBackup bottlenecks

Post by pirx »

After some weeks of debugging.... We were trying to find the root cause why the vROPS VM always is failing. The problem is that the SureBackup jobs immediately starts the ping test once the VMware tools are running and an IP is displayed in vSphere client. The problem here is that this VM is still not reachable for a couple of minutes (maybe iptabels rules, don't know). Anyhow, we added a larger timeout of 20min, but this timeout is not honored because its only valid for the time until the VMware tools are up. So once the vmware tools are up there is no way to tell Veeam that it should wait a little longer. Only option here is to disable ping test. The VM is reachable after ~8-10minutes.

And we still have 2 VMs that fail - and thus the whole job - even though we have disabled heartbeat and ping test because we know both VM's have issues. And I'm pretty sure that all our backup jobs have some of those VM's (jobs have 100-150 VM's).

So for me, SureBackup is a tought one where only magic can help prevent running into errors.
Post Reply

Who is online

Users browsing this forum: Google [Bot] and 55 guests