Health check on large backups

Post by **DonZoomik** » Feb 15, 2022 4:33 pm this post

Real world is almost always more complex than best practices... Easy solution would be migrating repositories to SSD (effectively unlimited IO), but that's way too expensive for multi-hundred TB repositories, for most customers.
In this particular case, blocking health check was running on inbound backup copy job (total ~50TB) so it's less time-critical than backup jobs. Under normal circumstances backup jobs complete in ~hour, backup copies in 2-3h (partially parallel to backup, mostly intersite bandwith cap bottleneck). But as datasets are large, at some point it gets too hard to cut them down into smaller jobs.

If health check will run async from main processing in v12, some slowdown might be acceptable. After all, the main objective is to not block primary backup jobs. When that goal is accomplished, the health checks can run for much longer - even old sync read behavior might be acceptable for most cases.

Post by **Gostev** » Feb 15, 2022 4:46 pm this post

We will test lowering the I/O priority of a health check process and go from there based on the results. The change itself looks pretty simple and we might be able to add that as an option. Apparently we already have such option in the Veeam Agent for Windows (I always thought it was only affecting the process priority but apparently we set both process and I/O priorities to Low there).

Thanks a lot for bringing this up!

Post by **DonZoomik** » Feb 15, 2022 5:13 pm this post

If I had a nickel for every Veeam bug/feature deficiency/obscure edge-case, I'd have quite a few dollars by now.

Simply lowering the IO priority is not that simple as it presumes that storage system has nothing else to do (eg current queue depth for normal priority processes is 0). If you have some other relatively low IO stuff running, health check is still heavily choked.
Coming back to this particular case, I eventually had to revert IO priority to normal because background S3 offload was generating enough IO to keep health check limping at so low throughput that it would have likely taken weeks to complete. That's why I suggested occasionally switching between two priorities - changing priorities would likely have no effect if storage has nothing else to do but health check would back down regularly to give other tasks chance to make progress. Naive and not very elegant, but maybe your engineers can come up with something better.

Post by **Gostev** » Feb 15, 2022 5:24 pm this post

Anything more complex that just setting lower I/O priority would have to wait until after V12 for sure at this point.

Post by **DonZoomik** » Feb 15, 2022 5:37 pm this post

Then sounds like something to put behind an undocumented registry key, like many edge cases (whichever the default would be).

R&D Forums

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Who is online