Comprehensive data protection for all workloads
FECV
Enthusiast
Posts: 41
Liked: 7 times
Joined: Mar 24, 2016 2:23 pm
Full Name: Frederick Cooper V
Contact:

Re: Health check on large backups

Post by FECV » 1 person likes this post

It seems that the health check performance is now hardware limited in a good way. My health checks used to take about 15 days, which I ran once a month so you can do the math. After updating to Veeam 11 it took 3 days and really may have been done in 2 but other factors were at play. My 1TB of RAM was maxed, my 2x 18 core 36c72t total processors (3.38Ghz) were maxed., and my 7 repositories/volumes totaling 2PB of used space, i have setup across two arrays was showing between 1.2GB to 2.4GB in read speeds (yes that is gigabytes not bits). I finally have justification for my over spec hardware, and am impressed by this much improvement. I have made multiple requests on Veeam support calls over the years to address the slow health checks and dev finally came though. Thanks!
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik »

DonZoomik wrote: Oct 18, 2021 7:47 pm#05083160
On the other hand, my case of parallel jobs failing got escalated to RnD.
Stephan23
Enthusiast
Posts: 50
Liked: 4 times
Joined: Jun 03, 2015 8:32 am
Full Name: Stephan
Contact:

Re: Health check on large backups

Post by Stephan23 »

agrob wrote: Nov 15, 2021 12:08 pm Stephan, how is the volume/array configured on which the backup files are stored? We have an Volume with 50x6TB Disk.
40x4TB in one disk pool
HannesK wrote: Nov 15, 2021 3:31 pm Maybe there are also other bottlenecks involved (tasks configuration, any other limits applied to the repository, whatever compute resource shortage)
Maybe. Have to take a closer look.

Shouldn't it be possible now to trigger a health check independent of the backup job? Would be so much easier to troubleshoot.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik » 1 person likes this post

DonZoomik wrote: Nov 15, 2021 8:29 pm On the other hand, my case of parallel jobs failing got escalated to RnD.
Got a hotfix today that seems to fix the issue. So if you're having problems with other jobs failing during health checks, reference my case number when contacting support.
Mildur
Product Manager
Posts: 8735
Liked: 2294 times
Joined: May 13, 2017 4:51 pm
Full Name: Fabian K.
Location: Switzerland
Contact:

Re: Health check on large backups

Post by Mildur »

Thanks for sharing :)
Product Management Analyst @ Veeam Software
perjonsson1960
Veteran
Posts: 463
Liked: 47 times
Joined: Jun 06, 2018 5:41 am
Full Name: Per Jonsson
Location: Sweden
Contact:

Re: Health check on large backups

Post by perjonsson1960 » 2 people like this post

Folks,

I can now compare a health check between B&R v11 and v11a. The results are in. :-)

Approx. 18 TB data.

v11: 25h 9m
v11a: 9h 25m

That is what you might call a significant improvement. ;-)

PJ
mkretzer
Veeam Legend
Posts: 1145
Liked: 388 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: Health check on large backups

Post by mkretzer » 1 person likes this post

Today our storage system of our immutable backup greeted me with a blinking warning LED...

Backup health check produced so much load (concurrent tasks set to 64) that multipathing lost alot of paths and the volume nearly went down.

We had to reduce our concurrency to 8 - i just hope our merges won't get slower because of this...
Still, i love that 11a finally is able to "kill" our storage systems!
mkretzer
Veeam Legend
Posts: 1145
Liked: 388 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: Health check on large backups

Post by mkretzer »

Ok, our copy jobs are *much* slower now with a concurency of 8...

As before, a concurrency per task type would help immensely!!


Case 05147793
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev »

"Concurrency per task" does not make much sense to me when talking about a single task (meaning there's no concurrency in principle). It's a single read stream, just reads are asynchronous now. May be you can clarify what you mean.

But anyway, why not start from finding some sweet spot between 64 tasks and 8 tasks... that's a change almost by an order of magnitude, may be even 32 would have been enough?

And I hope you're having your storage system vendor looking at that failure, as no enterprise storage system should go tits-up from mere 64 async read streams. For example, this would mean it would not be able to support even a tiny VMware cluster running a mere 64 VMs?
mkretzer
Veeam Legend
Posts: 1145
Liked: 388 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: Health check on large backups

Post by mkretzer »

Sorry, concurrent tasks per repo. Repo was set to 64 tasks. With that setting immediate mode copy job get nearly copied in realtime even when backing up thousands of VMs in 3-5 hours.

Even 16 tasks now lead to the storaging showing overloaded spindles ("Response late Drive"). This is strange, i have never seen this message on G-Series Hitachi Arrays in our production - even with the non-SSD models we use which are quite heavily loaded. The firmware is also the same. The kind of IO Veeam now does seems to be somewhat special.
The only thing changed was 11a. Before, we could do 64 streams without any issues, even doing active fulls. We still can do that as long as we do not do health checks.
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev »

Before 11a health check process reads were synchronous (next data block was read only after the previous one is returned by the storage) while with 11a it became asynchronous (no waiting until the previous block is returned before requesting the next one). This is nothing special or unique though, for example VM process also reads different parts of the same VMDK file asynchronously, as guest OS and applications all require data from different parts of the disk image at the same time.

Such asynchronous reading creates an outstanding I/O queue that allows enterprise-grade RAID controllers to execute media reads in the most optimal manner, for example by retrieving data of a few outstanding adjacent blocks at once with a single read operation.

This reduces the number of IOPS required to read the same amount of data in a few times. And all the freed up IOPS capacity in turn translates into the increase of backup health check performance in a few times, as reported by virtually everyone above.
mkretzer
Veeam Legend
Posts: 1145
Liked: 388 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: Health check on large backups

Post by mkretzer »

I know.
It just seems some "enterprise-grade RAID controllers" do not seem to "like" this type of IO.

We only use Hitachi Systems and i have never seen something like this in production... Lets see what Veeam and Hitachi support find...
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik » 1 person likes this post

Funny, i run with unlimited tasks and other than than the bug requiring hotfix, a ton of parallel tasks is not a problem. Sure latency is high (dozens of ms) but it's not a problem in itself as it's backup only.
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev »

Thanks for confirming. As I've said, I never in my life saw an enterprise storage go tits-up from massive I/O load. Sure, the I/O latency will go through the roof, this is expected. But at no point any decent storage is allowed to say "OK I've had enough" and just give up completely, no matter the type of load.
mkretzer wrote: Nov 24, 2021 5:13 amIt just seems some "enterprise-grade RAID controllers" do not seem to "like" this type of IO.
But a storage device are not in position to dictate applications which types of I/O are OK to do and which are not because it does not "like" certain I/O patterns? I bet all storage folks reading this are smiling right now :lol:
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik »

Exactly. Lost paths would in my mind rather point to L2/transport error than SCSI/storage layer failure.
mkretzer
Veeam Legend
Posts: 1145
Liked: 388 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: Health check on large backups

Post by mkretzer »

The XFS never went offline - we just see a path go down sometimes and alot of DEVICE RESET/Abort command issued in kernel log. The storage basically tells us "there is way to much load on the spindles, i cannot guarantee normal service" - thats why it goes into warning mode.

Hitachi says that the health check seems to have pushed the arrays drives (128 disks in RAID 60) above its limits... They are analyzing the logs now.
mkretzer
Veeam Legend
Posts: 1145
Liked: 388 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: Health check on large backups

Post by mkretzer »

I mean - the data rate from the array is really good. Health Check is going at 1,6 - 1,8 GB/s. iostat shows 446585 reads/s on the LVM (4k blocks * 446585 = 1786340 kb/s). These lead to about 8500 physical reads.

I guess they are ok for 128 nearline disks in a RAID 6. BTW health check speed stays about the same with 12 vs. 64 streams. I only wish we could use 12 streams for health checks and 64 for everything else - we could check our job with 1600 VMs in it much more often!

Perhaps the issue lies in our special LVM setup which stripes the data over 32 logical devices created from the array... In a normal VMware environment there would never be a situation where the load is distributed over so many of the storages LUNs. It was a nice setup for Veeam 10, but it might be not so optimal for 11a....
NightBird
Expert
Posts: 242
Liked: 57 times
Joined: Apr 28, 2009 8:33 am
Location: Strasbourg, FRANCE
Contact:

Re: Health check on large backups

Post by NightBird » 3 people like this post

v11 vs v11a
Server with 24 6To NL-SAS drive raid60 Windows 2016 ReFS
Approx 4,3TB data
Approx Time
v11: 10h20m
v11a: 1h40m
Awesome !!!
and other smallers jobs do health check the same day in //.

Full flash backup repo (10x3,84 SATA SSD raid6).
Approx 3,04TB data
v11: 1h01mn
v11a: 0h16mn !!!
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev » 1 person likes this post

Wow, that's a fast repo there! Must be nice to do instant recoveries from it ;)
popjls
Enthusiast
Posts: 55
Liked: 5 times
Joined: Jun 25, 2018 3:41 am
Contact:

Re: Health check on large backups

Post by popjls »

I couldn't have health check enabled on my backup copy repo, would just take to long but it's churning through around 93TB in just over 10 hours now so thanks!
NightBird
Expert
Posts: 242
Liked: 57 times
Joined: Apr 28, 2009 8:33 am
Location: Strasbourg, FRANCE
Contact:

Re: Health check on large backups

Post by NightBird » 1 person likes this post

Gostev wrote: Jan 11, 2022 6:26 pm Wow, that's a fast repo there! Must be nice to do instant recoveries from it ;)
Yup, instant recoveries are flawless and seamless 👍👌
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik » 1 person likes this post

An idea for further improvement.

While async reads have greatly improved health check performance, IMHO it's now generating so much IO that it can easily starve other tasks of any disk IO (especially on spinning media). Other jobs running are slowed down a lot (sometimes taking up to 10-20x the time to complete), despite health check only taking up a few repository tasks.

One option would be to lower the IO priority of VeeamAgent process to Low when switching to health check mode. This would leave majority of bandwidth to... normal tasks like other backups or backup copies etc. I've played around with tools to manually lower process IO priority and almost immediately other jobs start making progress, while health checks still do make progress...
However this progress is much lower (100x less sometimes) as they get only IO time when storage has nothing else to do or it gets some little IO time to prevent total starvation (if I remember doc correctly). One somewhat dumb way would be to switch all health check tasks on server (as there can be multiple logical repositories on one physical storage system) between Normal and Low for example once a minute. When there are no other tasks, it doesn't have much effect. When there are parallel tasks, they get a better chance to make progress for some time.

This would be just as important with v12 that's supposed to have async health checks as a separate task.
HannesK
Product Manager
Posts: 14322
Liked: 2890 times
Joined: Sep 01, 2014 11:46 am
Full Name: Hannes Kasparick
Location: Austria
Contact:

Re: Health check on large backups

Post by HannesK »

@DonZoomik: estimating a good IO load is a complex task... most requests I heard is to have a separate scheduling option for the health checks to run it outside the backup window... would that also solve your needs?
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev »

Since health checks scheduled outside of the backup window will have no other activities to clash with, seems it removes the very issue.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik »

For this particular customer, they run backups every 6 hours (plus a few things every 10 minutes) plus immediate inter-DC backup copies (multiple DCs, copies in various directions). Many backup (copy) sets are dozens of TB that still take 10+ hours to check so it's effecively never outside the backup window for other jobs.
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev »

But I imagine there's much less data changes for all these jobs to process on Sunday for example, so scheduling a health check there should not have an impact on the ability for jobs to complete the move of a little data? I mean, it's hard to imagine many workloads that generate same consistently high writes regardless of time and day of the week. So almost always there's should be the perfect gap to run a health check.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik »

On the scale of things, change rate is not that big, maybe roughy 500GB every 6 hours (backup + backup copy) per run regardless of the weekday. It's relatively beefy storage as well. 24*16T SAS RAID60, LSI 9361, SSD caching. Whenever health check is running, queue depth hovers around 100 while system remains responsive.

In fact when I checked in on Saturday for unrelated reasons, jobs that usually complete within hour or so, had been running for about 10 hours - not hung but making very slow process, all bottlenecks pointing to physical repository performing health check. When I set health check processes (7 VMs, about 30TB together) IO priority to low, all slow jobs completed within about 15 minutes. However then health check slowed down from roughly 600MB/s to 6MB/s. Queue depth is cut to low low dozens.

It's not a world ending problem, but something that could possibly be looked at.
When health check is moved to separate task, it's good that it will not block primary functions but it could still saturate IO on the system.
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev »

OK, thanks for the clarification and these additional details.
HannesK
Product Manager
Posts: 14322
Liked: 2890 times
Joined: Sep 01, 2014 11:46 am
Full Name: Hannes Kasparick
Location: Austria
Contact:

Re: Health check on large backups

Post by HannesK »

so it's effecively never outside the backup window for other jobs.
then there seems to be a sizing issue. 8h backup, 8h copy, 8h for maintenance jobs like health checks is the recommendation. I mean, in the past almost nobody with more than a few VMs did health checks because they were to slow. now asking to make them slower sounds wrong to me :-) If we would reduce speed, it would probably also never finish
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev »

If the I/O priority setting allows them to run at full speed when no other I/O is present, them I'm not worried about them never finishing. Only if they start suffering permanently, this change will make no sense.

Otherwise it would be really nice for health check to suffocate itself when for example a restore starts.
Post Reply

Who is online

Users browsing this forum: Google [Bot] and 119 guests