Health check on large backups

FECV · Nov 15, 2021 4:56 pm

It seems that the health check performance is now hardware limited in a good way. My health checks used to take about 15 days, which I ran once a month so you can do the math. After updating to Veeam 11 it took 3 days and really may have been done in 2 but other factors were at play. My 1TB of RAM was maxed, my 2x 18 core 36c72t total processors (3.38Ghz) were maxed., and my 7 repositories/volumes totaling 2PB of used space, i have setup across two arrays was showing between 1.2GB to 2.4GB in read speeds (yes that is gigabytes not bits). I finally have justification for my over spec hardware, and am impressed by this much improvement. I have made multiple requests on Veeam support calls over the years to address the slow health checks and dev finally came though. Thanks!

Post by **DonZoomik** » Nov 15, 2021 8:29 pm this post

DonZoomik wrote: ↑Oct 18, 2021 7:47 pm#05083160

On the other hand, my case of parallel jobs failing got escalated to RnD.

Stephan23 · Post by **Stephan23** » Nov 16, 2021 9:29 am this post

agrob wrote: ↑Nov 15, 2021 12:08 pm Stephan, how is the volume/array configured on which the backup files are stored? We have an Volume with 50x6TB Disk.

40x4TB in one disk pool

HannesK wrote: ↑Nov 15, 2021 3:31 pm Maybe there are also other bottlenecks involved (tasks configuration, any other limits applied to the repository, whatever compute resource shortage)

Maybe. Have to take a closer look.

Shouldn't it be possible now to trigger a health check independent of the backup job? Would be so much easier to troubleshoot.

Nov 19, 2021 9:36 pm

DonZoomik wrote: ↑Nov 15, 2021 8:29 pm On the other hand, my case of parallel jobs failing got escalated to RnD.

Got a hotfix today that seems to fix the issue. So if you're having problems with other jobs failing during health checks, reference my case number when contacting support.

Post by **Mildur** » Nov 19, 2021 10:03 pm this post

Thanks for sharing

perjonsson1960 · Nov 22, 2021 1:17 pm

Folks,

I can now compare a health check between B&R v11 and v11a. The results are in.

Approx. 18 TB data.

v11: 25h 9m
v11a: 9h 25m

That is what you might call a significant improvement.

PJ

Nov 23, 2021 7:50 pm

Today our storage system of our immutable backup greeted me with a blinking warning LED...

Backup health check produced so much load (concurrent tasks set to 64) that multipathing lost alot of paths and the volume nearly went down.

We had to reduce our concurrency to 8 - i just hope our merges won't get slower because of this...
Still, i love that 11a finally is able to "kill" our storage systems!

Post by **mkretzer** » Nov 23, 2021 10:09 pm this post

Ok, our copy jobs are *much* slower now with a concurency of 8...

As before, a concurrency per task type would help immensely!!

Case 05147793

Post by **Gostev** » Nov 23, 2021 11:09 pm this post

"Concurrency per task" does not make much sense to me when talking about a single task (meaning there's no concurrency in principle). It's a single read stream, just reads are asynchronous now. May be you can clarify what you mean.

But anyway, why not start from finding some sweet spot between 64 tasks and 8 tasks... that's a change almost by an order of magnitude, may be even 32 would have been enough?

And I hope you're having your storage system vendor looking at that failure, as no enterprise storage system should go tits-up from mere 64 async read streams. For example, this would mean it would not be able to support even a tiny VMware cluster running a mere 64 VMs?

Post by **mkretzer** » Nov 23, 2021 11:30 pm this post

Sorry, concurrent tasks per repo. Repo was set to 64 tasks. With that setting immediate mode copy job get nearly copied in realtime even when backing up thousands of VMs in 3-5 hours.

Even 16 tasks now lead to the storaging showing overloaded spindles ("Response late Drive"). This is strange, i have never seen this message on G-Series Hitachi Arrays in our production - even with the non-SSD models we use which are quite heavily loaded. The firmware is also the same. The kind of IO Veeam now does seems to be somewhat special.
The only thing changed was 11a. Before, we could do 64 streams without any issues, even doing active fulls. We still can do that as long as we do not do health checks.

Post by **Gostev** » Nov 24, 2021 2:54 am this post

Before 11a health check process reads were synchronous (next data block was read only after the previous one is returned by the storage) while with 11a it became asynchronous (no waiting until the previous block is returned before requesting the next one). This is nothing special or unique though, for example VM process also reads different parts of the same VMDK file asynchronously, as guest OS and applications all require data from different parts of the disk image at the same time.

Such asynchronous reading creates an outstanding I/O queue that allows enterprise-grade RAID controllers to execute media reads in the most optimal manner, for example by retrieving data of a few outstanding adjacent blocks at once with a single read operation.

This reduces the number of IOPS required to read the same amount of data in a few times. And all the freed up IOPS capacity in turn translates into the increase of backup health check performance in a few times, as reported by virtually everyone above.

Post by **mkretzer** » Nov 24, 2021 5:13 am this post

I know.
It just seems some "enterprise-grade RAID controllers" do not seem to "like" this type of IO.

We only use Hitachi Systems and i have never seen something like this in production... Lets see what Veeam and Hitachi support find...

Nov 24, 2021 7:32 am

Funny, i run with unlimited tasks and other than than the bug requiring hotfix, a ton of parallel tasks is not a problem. Sure latency is high (dozens of ms) but it's not a problem in itself as it's backup only.

Post by **Gostev** » Nov 24, 2021 11:50 am this post

Thanks for confirming. As I've said, I never in my life saw an enterprise storage go tits-up from massive I/O load. Sure, the I/O latency will go through the roof, this is expected. But at no point any decent storage is allowed to say "OK I've had enough" and just give up completely, no matter the type of load.

mkretzer wrote: ↑Nov 24, 2021 5:13 amIt just seems some "enterprise-grade RAID controllers" do not seem to "like" this type of IO.

But a storage device are not in position to dictate applications which types of I/O are OK to do and which are not because it does not "like" certain I/O patterns? I bet all storage folks reading this are smiling right now

Post by **DonZoomik** » Nov 24, 2021 7:11 pm this post

Exactly. Lost paths would in my mind rather point to L2/transport error than SCSI/storage layer failure.

Post by **mkretzer** » Nov 24, 2021 7:53 pm this post

The XFS never went offline - we just see a path go down sometimes and alot of DEVICE RESET/Abort command issued in kernel log. The storage basically tells us "there is way to much load on the spindles, i cannot guarantee normal service" - thats why it goes into warning mode.

Hitachi says that the health check seems to have pushed the arrays drives (128 disks in RAID 60) above its limits... They are analyzing the logs now.

Post by **mkretzer** » Nov 24, 2021 8:24 pm this post

I mean - the data rate from the array is really good. Health Check is going at 1,6 - 1,8 GB/s. iostat shows 446585 reads/s on the LVM (4k blocks * 446585 = 1786340 kb/s). These lead to about 8500 physical reads.

I guess they are ok for 128 nearline disks in a RAID 6. BTW health check speed stays about the same with 12 vs. 64 streams. I only wish we could use 12 streams for health checks and 64 for everything else - we could check our job with 1600 VMs in it much more often!

Perhaps the issue lies in our special LVM setup which stripes the data over 32 logical devices created from the array... In a normal VMware environment there would never be a situation where the load is distributed over so many of the storages LUNs. It was a nice setup for Veeam 10, but it might be not so optimal for 11a....

NightBird · Jan 11, 2022 3:52 pm

v11 vs v11a
Server with 24 6To NL-SAS drive raid60 Windows 2016 ReFS
Approx 4,3TB data
Approx Time
v11: 10h20m
v11a: 1h40m
Awesome !!!
and other smallers jobs do health check the same day in //.

Full flash backup repo (10x3,84 SATA SSD raid6).
Approx 3,04TB data
v11: 1h01mn
v11a: 0h16mn !!!

Jan 11, 2022 6:26 pm

Wow, that's a fast repo there! Must be nice to do instant recoveries from it

popjls · Post by **popjls** » Jan 12, 2022 7:09 am this post

I couldn't have health check enabled on my backup copy repo, would just take to long but it's churning through around 93TB in just over 10 hours now so thanks!

NightBird · Jan 12, 2022 12:20 pm

Gostev wrote: ↑Jan 11, 2022 6:26 pm Wow, that's a fast repo there! Must be nice to do instant recoveries from it

Yup, instant recoveries are flawless and seamless

Feb 12, 2022 1:20 pm

An idea for further improvement.

While async reads have greatly improved health check performance, IMHO it's now generating so much IO that it can easily starve other tasks of any disk IO (especially on spinning media). Other jobs running are slowed down a lot (sometimes taking up to 10-20x the time to complete), despite health check only taking up a few repository tasks.

One option would be to lower the IO priority of VeeamAgent process to Low when switching to health check mode. This would leave majority of bandwidth to... normal tasks like other backups or backup copies etc. I've played around with tools to manually lower process IO priority and almost immediately other jobs start making progress, while health checks still do make progress...
However this progress is much lower (100x less sometimes) as they get only IO time when storage has nothing else to do or it gets some little IO time to prevent total starvation (if I remember doc correctly). One somewhat dumb way would be to switch all health check tasks on server (as there can be multiple logical repositories on one physical storage system) between Normal and Low for example once a minute. When there are no other tasks, it doesn't have much effect. When there are parallel tasks, they get a better chance to make progress for some time.

This would be just as important with v12 that's supposed to have async health checks as a separate task.

Post by **HannesK** » Feb 14, 2022 7:25 am this post

@DonZoomik: estimating a good IO load is a complex task... most requests I heard is to have a separate scheduling option for the health checks to run it outside the backup window... would that also solve your needs?

Post by **Gostev** » Feb 14, 2022 1:10 pm this post

Since health checks scheduled outside of the backup window will have no other activities to clash with, seems it removes the very issue.

Post by **DonZoomik** » Feb 14, 2022 5:47 pm this post

For this particular customer, they run backups every 6 hours (plus a few things every 10 minutes) plus immediate inter-DC backup copies (multiple DCs, copies in various directions). Many backup (copy) sets are dozens of TB that still take 10+ hours to check so it's effecively never outside the backup window for other jobs.

Post by **Gostev** » Feb 14, 2022 6:07 pm this post

But I imagine there's much less data changes for all these jobs to process on Sunday for example, so scheduling a health check there should not have an impact on the ability for jobs to complete the move of a little data? I mean, it's hard to imagine many workloads that generate same consistently high writes regardless of time and day of the week. So almost always there's should be the perfect gap to run a health check.

Post by **DonZoomik** » Feb 14, 2022 7:31 pm this post

On the scale of things, change rate is not that big, maybe roughy 500GB every 6 hours (backup + backup copy) per run regardless of the weekday. It's relatively beefy storage as well. 24*16T SAS RAID60, LSI 9361, SSD caching. Whenever health check is running, queue depth hovers around 100 while system remains responsive.

In fact when I checked in on Saturday for unrelated reasons, jobs that usually complete within hour or so, had been running for about 10 hours - not hung but making very slow process, all bottlenecks pointing to physical repository performing health check. When I set health check processes (7 VMs, about 30TB together) IO priority to low, all slow jobs completed within about 15 minutes. However then health check slowed down from roughly 600MB/s to 6MB/s. Queue depth is cut to low low dozens.

It's not a world ending problem, but something that could possibly be looked at.
When health check is moved to separate task, it's good that it will not block primary functions but it could still saturate IO on the system.

Post by **Gostev** » Feb 14, 2022 10:32 pm this post

OK, thanks for the clarification and these additional details.

Post by **HannesK** » Feb 15, 2022 6:56 am this post

so it's effecively never outside the backup window for other jobs.

then there seems to be a sizing issue. 8h backup, 8h copy, 8h for maintenance jobs like health checks is the recommendation. I mean, in the past almost nobody with more than a few VMs did health checks because they were to slow. now asking to make them slower sounds wrong to me

If we would reduce speed, it would probably also never finish

Post by **Gostev** » Feb 15, 2022 11:31 am this post

If the I/O priority setting allows them to run at full speed when no other I/O is present, them I'm not worried about them never finishing. Only if they start suffering permanently, this change will make no sense.

Otherwise it would be really nice for health check to suffocate itself when for example a restore starts.

R&D Forums

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Who is online