-
- Enthusiast
- Posts: 42
- Liked: 7 times
- Joined: Mar 24, 2016 2:23 pm
- Full Name: Frederick Cooper V
- Contact:
Re: Health check on large backups
It seems that the health check performance is now hardware limited in a good way. My health checks used to take about 15 days, which I ran once a month so you can do the math. After updating to Veeam 11 it took 3 days and really may have been done in 2 but other factors were at play. My 1TB of RAM was maxed, my 2x 18 core 36c72t total processors (3.38Ghz) were maxed., and my 7 repositories/volumes totaling 2PB of used space, i have setup across two arrays was showing between 1.2GB to 2.4GB in read speeds (yes that is gigabytes not bits). I finally have justification for my over spec hardware, and am impressed by this much improvement. I have made multiple requests on Veeam support calls over the years to address the slow health checks and dev finally came though. Thanks!
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
-
- Enthusiast
- Posts: 50
- Liked: 4 times
- Joined: Jun 03, 2015 8:32 am
- Full Name: Stephan
- Contact:
Re: Health check on large backups
40x4TB in one disk pool
Maybe. Have to take a closer look.
Shouldn't it be possible now to trigger a health check independent of the backup job? Would be so much easier to troubleshoot.
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
-
- Product Manager
- Posts: 9848
- Liked: 2607 times
- Joined: May 13, 2017 4:51 pm
- Full Name: Fabian K.
- Location: Switzerland
- Contact:
-
- Veteran
- Posts: 527
- Liked: 58 times
- Joined: Jun 06, 2018 5:41 am
- Full Name: Per Jonsson
- Location: Sweden
- Contact:
Re: Health check on large backups
Folks,
I can now compare a health check between B&R v11 and v11a. The results are in.
Approx. 18 TB data.
v11: 25h 9m
v11a: 9h 25m
That is what you might call a significant improvement.
PJ
I can now compare a health check between B&R v11 and v11a. The results are in.
Approx. 18 TB data.
v11: 25h 9m
v11a: 9h 25m
That is what you might call a significant improvement.
PJ
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: Health check on large backups
Today our storage system of our immutable backup greeted me with a blinking warning LED...
Backup health check produced so much load (concurrent tasks set to 64) that multipathing lost alot of paths and the volume nearly went down.
We had to reduce our concurrency to 8 - i just hope our merges won't get slower because of this...
Still, i love that 11a finally is able to "kill" our storage systems!
Backup health check produced so much load (concurrent tasks set to 64) that multipathing lost alot of paths and the volume nearly went down.
We had to reduce our concurrency to 8 - i just hope our merges won't get slower because of this...
Still, i love that 11a finally is able to "kill" our storage systems!
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: Health check on large backups
Ok, our copy jobs are *much* slower now with a concurency of 8...
As before, a concurrency per task type would help immensely!!
Case 05147793
As before, a concurrency per task type would help immensely!!
Case 05147793
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Health check on large backups
"Concurrency per task" does not make much sense to me when talking about a single task (meaning there's no concurrency in principle). It's a single read stream, just reads are asynchronous now. May be you can clarify what you mean.
But anyway, why not start from finding some sweet spot between 64 tasks and 8 tasks... that's a change almost by an order of magnitude, may be even 32 would have been enough?
And I hope you're having your storage system vendor looking at that failure, as no enterprise storage system should go tits-up from mere 64 async read streams. For example, this would mean it would not be able to support even a tiny VMware cluster running a mere 64 VMs?
But anyway, why not start from finding some sweet spot between 64 tasks and 8 tasks... that's a change almost by an order of magnitude, may be even 32 would have been enough?
And I hope you're having your storage system vendor looking at that failure, as no enterprise storage system should go tits-up from mere 64 async read streams. For example, this would mean it would not be able to support even a tiny VMware cluster running a mere 64 VMs?
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: Health check on large backups
Sorry, concurrent tasks per repo. Repo was set to 64 tasks. With that setting immediate mode copy job get nearly copied in realtime even when backing up thousands of VMs in 3-5 hours.
Even 16 tasks now lead to the storaging showing overloaded spindles ("Response late Drive"). This is strange, i have never seen this message on G-Series Hitachi Arrays in our production - even with the non-SSD models we use which are quite heavily loaded. The firmware is also the same. The kind of IO Veeam now does seems to be somewhat special.
The only thing changed was 11a. Before, we could do 64 streams without any issues, even doing active fulls. We still can do that as long as we do not do health checks.
Even 16 tasks now lead to the storaging showing overloaded spindles ("Response late Drive"). This is strange, i have never seen this message on G-Series Hitachi Arrays in our production - even with the non-SSD models we use which are quite heavily loaded. The firmware is also the same. The kind of IO Veeam now does seems to be somewhat special.
The only thing changed was 11a. Before, we could do 64 streams without any issues, even doing active fulls. We still can do that as long as we do not do health checks.
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Health check on large backups
Before 11a health check process reads were synchronous (next data block was read only after the previous one is returned by the storage) while with 11a it became asynchronous (no waiting until the previous block is returned before requesting the next one). This is nothing special or unique though, for example VM process also reads different parts of the same VMDK file asynchronously, as guest OS and applications all require data from different parts of the disk image at the same time.
Such asynchronous reading creates an outstanding I/O queue that allows enterprise-grade RAID controllers to execute media reads in the most optimal manner, for example by retrieving data of a few outstanding adjacent blocks at once with a single read operation.
This reduces the number of IOPS required to read the same amount of data in a few times. And all the freed up IOPS capacity in turn translates into the increase of backup health check performance in a few times, as reported by virtually everyone above.
Such asynchronous reading creates an outstanding I/O queue that allows enterprise-grade RAID controllers to execute media reads in the most optimal manner, for example by retrieving data of a few outstanding adjacent blocks at once with a single read operation.
This reduces the number of IOPS required to read the same amount of data in a few times. And all the freed up IOPS capacity in turn translates into the increase of backup health check performance in a few times, as reported by virtually everyone above.
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: Health check on large backups
I know.
It just seems some "enterprise-grade RAID controllers" do not seem to "like" this type of IO.
We only use Hitachi Systems and i have never seen something like this in production... Lets see what Veeam and Hitachi support find...
It just seems some "enterprise-grade RAID controllers" do not seem to "like" this type of IO.
We only use Hitachi Systems and i have never seen something like this in production... Lets see what Veeam and Hitachi support find...
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
Re: Health check on large backups
Funny, i run with unlimited tasks and other than than the bug requiring hotfix, a ton of parallel tasks is not a problem. Sure latency is high (dozens of ms) but it's not a problem in itself as it's backup only.
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Health check on large backups
Thanks for confirming. As I've said, I never in my life saw an enterprise storage go tits-up from massive I/O load. Sure, the I/O latency will go through the roof, this is expected. But at no point any decent storage is allowed to say "OK I've had enough" and just give up completely, no matter the type of load.
But a storage device are not in position to dictate applications which types of I/O are OK to do and which are not because it does not "like" certain I/O patterns? I bet all storage folks reading this are smiling right now
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
Re: Health check on large backups
Exactly. Lost paths would in my mind rather point to L2/transport error than SCSI/storage layer failure.
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: Health check on large backups
The XFS never went offline - we just see a path go down sometimes and alot of DEVICE RESET/Abort command issued in kernel log. The storage basically tells us "there is way to much load on the spindles, i cannot guarantee normal service" - thats why it goes into warning mode.
Hitachi says that the health check seems to have pushed the arrays drives (128 disks in RAID 60) above its limits... They are analyzing the logs now.
Hitachi says that the health check seems to have pushed the arrays drives (128 disks in RAID 60) above its limits... They are analyzing the logs now.
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: Health check on large backups
I mean - the data rate from the array is really good. Health Check is going at 1,6 - 1,8 GB/s. iostat shows 446585 reads/s on the LVM (4k blocks * 446585 = 1786340 kb/s). These lead to about 8500 physical reads.
I guess they are ok for 128 nearline disks in a RAID 6. BTW health check speed stays about the same with 12 vs. 64 streams. I only wish we could use 12 streams for health checks and 64 for everything else - we could check our job with 1600 VMs in it much more often!
Perhaps the issue lies in our special LVM setup which stripes the data over 32 logical devices created from the array... In a normal VMware environment there would never be a situation where the load is distributed over so many of the storages LUNs. It was a nice setup for Veeam 10, but it might be not so optimal for 11a....
I guess they are ok for 128 nearline disks in a RAID 6. BTW health check speed stays about the same with 12 vs. 64 streams. I only wish we could use 12 streams for health checks and 64 for everything else - we could check our job with 1600 VMs in it much more often!
Perhaps the issue lies in our special LVM setup which stripes the data over 32 logical devices created from the array... In a normal VMware environment there would never be a situation where the load is distributed over so many of the storages LUNs. It was a nice setup for Veeam 10, but it might be not so optimal for 11a....
-
- Expert
- Posts: 245
- Liked: 58 times
- Joined: Apr 28, 2009 8:33 am
- Location: Strasbourg, FRANCE
- Contact:
Re: Health check on large backups
v11 vs v11a
Server with 24 6To NL-SAS drive raid60 Windows 2016 ReFS
Approx 4,3TB data
Approx Time
v11: 10h20m
v11a: 1h40m
Awesome !!!
and other smallers jobs do health check the same day in //.
Full flash backup repo (10x3,84 SATA SSD raid6).
Approx 3,04TB data
v11: 1h01mn
v11a: 0h16mn !!!
Server with 24 6To NL-SAS drive raid60 Windows 2016 ReFS
Approx 4,3TB data
Approx Time
v11: 10h20m
v11a: 1h40m
Awesome !!!
and other smallers jobs do health check the same day in //.
Full flash backup repo (10x3,84 SATA SSD raid6).
Approx 3,04TB data
v11: 1h01mn
v11a: 0h16mn !!!
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Health check on large backups
Wow, that's a fast repo there! Must be nice to do instant recoveries from it
-
- Enthusiast
- Posts: 57
- Liked: 5 times
- Joined: Jun 25, 2018 3:41 am
- Contact:
Re: Health check on large backups
I couldn't have health check enabled on my backup copy repo, would just take to long but it's churning through around 93TB in just over 10 hours now so thanks!
-
- Expert
- Posts: 245
- Liked: 58 times
- Joined: Apr 28, 2009 8:33 am
- Location: Strasbourg, FRANCE
- Contact:
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
Re: Health check on large backups
An idea for further improvement.
While async reads have greatly improved health check performance, IMHO it's now generating so much IO that it can easily starve other tasks of any disk IO (especially on spinning media). Other jobs running are slowed down a lot (sometimes taking up to 10-20x the time to complete), despite health check only taking up a few repository tasks.
One option would be to lower the IO priority of VeeamAgent process to Low when switching to health check mode. This would leave majority of bandwidth to... normal tasks like other backups or backup copies etc. I've played around with tools to manually lower process IO priority and almost immediately other jobs start making progress, while health checks still do make progress...
However this progress is much lower (100x less sometimes) as they get only IO time when storage has nothing else to do or it gets some little IO time to prevent total starvation (if I remember doc correctly). One somewhat dumb way would be to switch all health check tasks on server (as there can be multiple logical repositories on one physical storage system) between Normal and Low for example once a minute. When there are no other tasks, it doesn't have much effect. When there are parallel tasks, they get a better chance to make progress for some time.
This would be just as important with v12 that's supposed to have async health checks as a separate task.
While async reads have greatly improved health check performance, IMHO it's now generating so much IO that it can easily starve other tasks of any disk IO (especially on spinning media). Other jobs running are slowed down a lot (sometimes taking up to 10-20x the time to complete), despite health check only taking up a few repository tasks.
One option would be to lower the IO priority of VeeamAgent process to Low when switching to health check mode. This would leave majority of bandwidth to... normal tasks like other backups or backup copies etc. I've played around with tools to manually lower process IO priority and almost immediately other jobs start making progress, while health checks still do make progress...
However this progress is much lower (100x less sometimes) as they get only IO time when storage has nothing else to do or it gets some little IO time to prevent total starvation (if I remember doc correctly). One somewhat dumb way would be to switch all health check tasks on server (as there can be multiple logical repositories on one physical storage system) between Normal and Low for example once a minute. When there are no other tasks, it doesn't have much effect. When there are parallel tasks, they get a better chance to make progress for some time.
This would be just as important with v12 that's supposed to have async health checks as a separate task.
-
- Product Manager
- Posts: 14840
- Liked: 3086 times
- Joined: Sep 01, 2014 11:46 am
- Full Name: Hannes Kasparick
- Location: Austria
- Contact:
Re: Health check on large backups
@DonZoomik: estimating a good IO load is a complex task... most requests I heard is to have a separate scheduling option for the health checks to run it outside the backup window... would that also solve your needs?
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Health check on large backups
Since health checks scheduled outside of the backup window will have no other activities to clash with, seems it removes the very issue.
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
Re: Health check on large backups
For this particular customer, they run backups every 6 hours (plus a few things every 10 minutes) plus immediate inter-DC backup copies (multiple DCs, copies in various directions). Many backup (copy) sets are dozens of TB that still take 10+ hours to check so it's effecively never outside the backup window for other jobs.
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Health check on large backups
But I imagine there's much less data changes for all these jobs to process on Sunday for example, so scheduling a health check there should not have an impact on the ability for jobs to complete the move of a little data? I mean, it's hard to imagine many workloads that generate same consistently high writes regardless of time and day of the week. So almost always there's should be the perfect gap to run a health check.
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
Re: Health check on large backups
On the scale of things, change rate is not that big, maybe roughy 500GB every 6 hours (backup + backup copy) per run regardless of the weekday. It's relatively beefy storage as well. 24*16T SAS RAID60, LSI 9361, SSD caching. Whenever health check is running, queue depth hovers around 100 while system remains responsive.
In fact when I checked in on Saturday for unrelated reasons, jobs that usually complete within hour or so, had been running for about 10 hours - not hung but making very slow process, all bottlenecks pointing to physical repository performing health check. When I set health check processes (7 VMs, about 30TB together) IO priority to low, all slow jobs completed within about 15 minutes. However then health check slowed down from roughly 600MB/s to 6MB/s. Queue depth is cut to low low dozens.
It's not a world ending problem, but something that could possibly be looked at.
When health check is moved to separate task, it's good that it will not block primary functions but it could still saturate IO on the system.
In fact when I checked in on Saturday for unrelated reasons, jobs that usually complete within hour or so, had been running for about 10 hours - not hung but making very slow process, all bottlenecks pointing to physical repository performing health check. When I set health check processes (7 VMs, about 30TB together) IO priority to low, all slow jobs completed within about 15 minutes. However then health check slowed down from roughly 600MB/s to 6MB/s. Queue depth is cut to low low dozens.
It's not a world ending problem, but something that could possibly be looked at.
When health check is moved to separate task, it's good that it will not block primary functions but it could still saturate IO on the system.
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Health check on large backups
OK, thanks for the clarification and these additional details.
-
- Product Manager
- Posts: 14840
- Liked: 3086 times
- Joined: Sep 01, 2014 11:46 am
- Full Name: Hannes Kasparick
- Location: Austria
- Contact:
Re: Health check on large backups
then there seems to be a sizing issue. 8h backup, 8h copy, 8h for maintenance jobs like health checks is the recommendation. I mean, in the past almost nobody with more than a few VMs did health checks because they were to slow. now asking to make them slower sounds wrong to me If we would reduce speed, it would probably also never finishso it's effecively never outside the backup window for other jobs.
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Health check on large backups
If the I/O priority setting allows them to run at full speed when no other I/O is present, them I'm not worried about them never finishing. Only if they start suffering permanently, this change will make no sense.
Otherwise it would be really nice for health check to suffocate itself when for example a restore starts.
Otherwise it would be really nice for health check to suffocate itself when for example a restore starts.
Who is online
Users browsing this forum: Bing [Bot] and 59 guests