-
- Service Provider
- Posts: 454
- Liked: 86 times
- Joined: Jun 09, 2015 7:08 pm
- Full Name: JaySt
- Contact:
Backup File health checks slow
we've got some pretty awesome repository servers (HPE DL380gen10) running Windows Server 2019 ReFS with local 12Gbps NLSAS disks (12) in a RAID6 config. Great backup performance, great restore performance. However, the health check on the backup files takes >3 days and stalls backups jobs that are up for the health check schedule.
I know a health check can take a while, but i try to understand why THIS long. Info:
- chain is forever incremental, 31 restore points max
- per-vm backup files enabled
- multiple jobs
- 20TB+ total data
The health check is doing something, but is not close to saturating CPU/Memory or Disk resources. it currenlty just does 20-40MBps to the local disks, CPU is happy in <30% regions, plenty of RAM available
The system is capable of doing hundreds of MBps read and write but i don't understand why it does not show that during health check.
So how does the health check operate exactly? Is it capable of doing things in parallel?
Is it likely something is not right when health check only does 20-40MBps?
Is ReFS making things extra difficult/hard for health checks?
I know a health check can take a while, but i try to understand why THIS long. Info:
- chain is forever incremental, 31 restore points max
- per-vm backup files enabled
- multiple jobs
- 20TB+ total data
The health check is doing something, but is not close to saturating CPU/Memory or Disk resources. it currenlty just does 20-40MBps to the local disks, CPU is happy in <30% regions, plenty of RAM available
The system is capable of doing hundreds of MBps read and write but i don't understand why it does not show that during health check.
So how does the health check operate exactly? Is it capable of doing things in parallel?
Is it likely something is not right when health check only does 20-40MBps?
Is ReFS making things extra difficult/hard for health checks?
Veeam Certified Engineer
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Backup File health checks slow
Please, don't forget to ALWAYS include support case ID when posting about ANY technical issue whatsoever, as requested when you click New Topic.
I would try testing your storage performance for random I/O with typical Veeam block size. You should do the "Worst case scenario" test under "Slow restore" chapter, as this should best represent health check I/O pattern on ReFS.
Thanks!
I would try testing your storage performance for random I/O with typical Veeam block size. You should do the "Worst case scenario" test under "Slow restore" chapter, as this should best represent health check I/O pattern on ReFS.
Thanks!
-
- Enthusiast
- Posts: 57
- Liked: 5 times
- Joined: Jun 25, 2018 3:41 am
- Contact:
Re: Backup File health checks slow
I also have this exact issue. Backup file checking has become incredibly slow with this new release. I don't believe it's a storage issue and or a networking issue as both are relatively untouched compared to it's capability. I'm going to test a few more ideas before opening a case but i'll say you are not alone.
-
- Product Manager
- Posts: 20415
- Liked: 2302 times
- Joined: Oct 26, 2012 3:28 pm
- Full Name: Vladimir Eremin
- Contact:
Re: Backup File health checks slow
Test you storage performance using the utility above and open the ticket, if you don't manage to confirm the issue with the storage system. Thanks!
-
- Service Provider
- Posts: 454
- Liked: 86 times
- Joined: Jun 09, 2015 7:08 pm
- Full Name: JaySt
- Contact:
Re: Backup File health checks slow
i'll do a performance test like mentioned above. If it shows significant better numbers compared to the numbers i see at times of the health check, i'll create a case. Good suggestion.
But to understand better:
The Slow restore/Worst case scenario disk test is suggested. It does random I/O. Is health check doing random I/O as a result of ReFS being used as a filesystem? Or is the health check more random regardless of the filesystem used? Where what aspect makes health check heavy on the random I/O vs sequential?
I don't have problems with restore speeds for example.
But to understand better:
The Slow restore/Worst case scenario disk test is suggested. It does random I/O. Is health check doing random I/O as a result of ReFS being used as a filesystem? Or is the health check more random regardless of the filesystem used? Where what aspect makes health check heavy on the random I/O vs sequential?
I don't have problems with restore speeds for example.
Veeam Certified Engineer
-
- Veeam Software
- Posts: 21139
- Liked: 2141 times
- Joined: Jul 11, 2011 10:22 am
- Full Name: Alexander Fogelson
- Contact:
Re: Backup File health checks slow
Health check is always random - it reads all the blocks required to build the latest restore point, which could be scattered across multiple files. Restore might be either random or sequential depending on the particular restore option - f.ex. full VM restore from a single full backup file is more sequential, while Instant Recovery is purely random.
-
- Service Provider
- Posts: 454
- Liked: 86 times
- Joined: Jun 09, 2015 7:08 pm
- Full Name: JaySt
- Contact:
Re: Backup File health checks slow
did some tests with diskspd as suggested. the r4k pattern showed maximum of 20MiB/s throughput. So that actually matched the throughput we saw during the health check. the test was performed against a single .vbk file.
This effectively means we'll probably disable the health checks. It's just too much data to process with this kind of numbers. The jobs would not be able to run for too long (multiple days). We have copy jobs in place to a second server, so that actually reads the data as well. in a different way.
I executed this command:
This effectively means we'll probably disable the health checks. It's just too much data to process with this kind of numbers. The jobs would not be able to run for too long (multiple days). We have copy jobs in place to a second server, so that actually reads the data as well. in a different way.
I executed this command:
Code: Select all
diskspd.exe -b512K -r4K -Sh -d600 D:\path\to\fullbackup.vbk
Veeam Certified Engineer
-
- Enthusiast
- Posts: 78
- Liked: 46 times
- Joined: Dec 10, 2019 3:59 pm
- Full Name: Ryan Walker
- Contact:
Re: Backup File health checks slow
Interesting.
Is there any way to move this to a multi threaded operations?
Cause I know it's not my repository server, when you consider my random IOPS are in the tens of thousands:
Even using DiskSpd worse case, it's not an ugly thing (this was run WHILE doing a health check on another backup file):
So that should be at least 168MB/s - but it's only averaging about 45-68MB/s according to my real time monitoring of the health check - peaks around 110-140MB/s from time to time.
And Latency is negligible:
Backend disk is 64KB ReFS on a R5 512Kb striped SSD array.
Is there any way to move this to a multi threaded operations?
Cause I know it's not my repository server, when you consider my random IOPS are in the tens of thousands:
Code: Select all
Random 4KiB (Q= 32, T=16): 618.499 MB/s [ 151000.7 IOPS] < 3387.40 us>
Code: Select all
Input parameters:
timespan: 1
-------------
duration: 10s
warm up time: 5s
cool down time: 0s
gathering IOPS at intervals of 600ms
random seed: 0
path: '****.vbk'
think time: 0ms
burst size: 0
software cache disabled
hardware write cache disabled, writethrough on
performing read test
block size: 524288
using random I/O (alignment: 4096)
number of outstanding I/O operations: 2
thread stride size: 0
threads per file: 1
using I/O Completion Ports
IO priority: normal
System information:
computer name: *******
start time: 2020/10/02 13:26:11 UTC
Results for timespan 1:
*******************************************************************************
actual test time: 10.01s
thread count: 1
proc count: 32
CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 77.03%| 1.25%| 75.78%| 22.97%
1| 0.47%| 0.00%| 0.47%| 99.53%
2| 18.00%| 9.55%| 8.45%| 82.00%
3| 0.00%| 0.00%| 0.00%| 100.00%
4| 4.69%| 2.19%| 2.50%| 95.31%
5| 16.88%| 0.00%| 16.88%| 83.13%
6| 2.03%| 1.10%| 0.94%| 97.97%
7| 71.88%| 0.00%| 71.88%| 28.13%
8| 11.41%| 0.47%| 10.94%| 88.59%
9| 7.19%| 2.34%| 4.84%| 92.81%
10| 15.63%| 0.31%| 15.31%| 84.38%
11| 0.00%| 0.00%| 0.00%| 100.00%
12| 4.69%| 0.47%| 4.22%| 95.31%
13| 11.08%| 11.08%| 0.00%| 88.92%
14| 8.91%| 0.00%| 8.91%| 91.09%
15| 0.47%| 0.00%| 0.47%| 99.53%
16| 1.09%| 0.00%| 1.09%| 98.91%
17| 0.31%| 0.00%| 0.31%| 99.69%
18| 0.00%| 0.00%| 0.00%| 100.00%
19| 1.72%| 0.31%| 1.41%| 98.28%
20| 0.00%| 0.00%| 0.00%| 100.00%
21| 0.16%| 0.00%| 0.16%| 99.84%
22| 0.16%| 0.00%| 0.16%| 99.84%
23| 0.00%| 0.00%| 0.00%| 100.00%
24| 0.00%| 0.00%| 0.00%| 100.00%
25| 0.00%| 0.00%| 0.00%| 100.00%
26| 0.00%| 0.00%| 0.00%| 100.00%
27| 0.00%| 0.00%| 0.00%| 100.00%
28| 0.00%| 0.00%| 0.00%| 100.00%
29| 1.56%| 0.00%| 1.56%| 98.44%
30| 0.00%| 0.00%| 0.00%| 100.00%
31| 0.16%| 0.00%| 0.16%| 99.84%
-------------------------------------------
avg.| 7.98%| 0.91%| 7.08%| 92.02%
Total IO
thread | bytes | I/Os | MiB/s | I/O per s | IopsStdDev | file
-------------------------------------------------------------------------------------------
0 | 1662517248 | 3171 | 158.46 | 316.92 | 64.70 | *****.vbk (1814GiB)
-------------------------------------------------------------------------------------------
total: 1662517248 | 3171 | 158.46 | 316.92 | 64.70
Read IO
thread | bytes | I/Os | MiB/s | I/O per s | IopsStdDev | file
-------------------------------------------------------------------------------------------
0 | 1662517248 | 3171 | 158.46 | 316.92 | 64.70 | *****.vbk (1814GiB)
-------------------------------------------------------------------------------------------
total: 1662517248 | 3171 | 158.46 | 316.92 | 64.70
And Latency is negligible:
Code: Select all
Disk 0
AVAGO SMC3108 SCSI Disk Device
Capacity: 161 TB
Formatted: 161 TB
System disk: No
Page file: No
Read speed 82.9 MB/s
Write speed 0 KB/s
Active time 5%
Average response time 0.3 ms
-
- Veeam Software
- Posts: 21139
- Liked: 2141 times
- Joined: Jul 11, 2011 10:22 am
- Full Name: Alexander Fogelson
- Contact:
Re: Backup File health checks slow
In case per-VM chains are enabled on the repository, health check should run in parallel for multiple VMs, which will load the storage more.
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
Re: Backup File health checks slow
Fragmentation makes it slower over time. Defrag helps somewhat but it has problems of it's own (blocking metadata operations for files/folders that it's analyzing during inital analyze phase, rehydrating sythetic fulls).
Per-VM chains don't help with very large servers. I have jobs that include very large VMs and it's unfeasible to check them (taking well over 36 hours). It seems to me that Veeam is checking data at QD1 (or something similar with sync IO) that spinning disks don't handle well. Agressive queueing (async read-ahead) or scanning in parallel (for example) 1TB extents could possibly increase performance a lot, especially with larger RAID arrays with SAS backends that have no problems with large queues. In my case, check for ~30TB of data with 100+ VMs starts out at 1GB/s+ (at reasonable latency, plus parallel writes) and final 1-2 large VMs drop to <100MB/s in heavily fragmented areas, dragging out the check for far too long.
Per-VM chains don't help with very large servers. I have jobs that include very large VMs and it's unfeasible to check them (taking well over 36 hours). It seems to me that Veeam is checking data at QD1 (or something similar with sync IO) that spinning disks don't handle well. Agressive queueing (async read-ahead) or scanning in parallel (for example) 1TB extents could possibly increase performance a lot, especially with larger RAID arrays with SAS backends that have no problems with large queues. In my case, check for ~30TB of data with 100+ VMs starts out at 1GB/s+ (at reasonable latency, plus parallel writes) and final 1-2 large VMs drop to <100MB/s in heavily fragmented areas, dragging out the check for far too long.
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
Re: Backup File health checks slow
Saw "async read everywhere" in v11 updates video under engine improvements. Could this be similar to the concept I described (async read-ahead)?
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Backup File health checks slow
Yes, you're talking the same thing. But it helps only with an enterprise-grade storage like the one you have... will not do miracles to low spindle count or lack of an enterprise-grade RAID controller.
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
Re: Backup File health checks slow
Great news!
Would it also apply to proxy operations (Direct SAN, HotAdd) as they also seem to suffer from low queue depth?
Would it also apply to proxy operations (Direct SAN, HotAdd) as they also seem to suffer from low queue depth?
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Backup File health checks slow
Proxy operations are like that for many years now... since v8 or something. But only in our proprietary transport modes, namely: hot add, direct NFS and backup from storage snapshots. We cannot control how VMware VDDK-based transport modes read data, except for NBD mode with ESXi 6.7 or later (where we do async reads by default). But, let's not hi-jack the current thread with this
-
- Veteran
- Posts: 298
- Liked: 85 times
- Joined: Feb 16, 2017 8:05 pm
- Contact:
Re: Backup File health checks slow
IIRC, slow Random I/O is intrinsic to RAID 5&6 - believe it's due to parity.
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
Re: Backup File health checks slow
That would explain why HotAdd is occasionally much faster than Direct SAN... but i digress.
RAID5/6 is slower in writes due to parity but reads are not affected.
RAID5/6 is slower in writes due to parity but reads are not affected.
-
- Enthusiast
- Posts: 78
- Liked: 46 times
- Joined: Dec 10, 2019 3:59 pm
- Full Name: Ryan Walker
- Contact:
Re: Backup File health checks slow
In point of fact, Raid-5/6 reads are often faster as it has more disks to read from; write yes, will be impacted.
However - zomg Gostev makes me happy - for async with an all-ssd array will be yum yum. Rust drives ... could improve but that is quasi random-io which yeah, unless you have a ton of spindles it won't improve much or might even hurt depending on the system.
Curious on how that'll impact health checks, but for now I've disabled it on my jobs; when we're talking 20-40TiB VMs, it's just not doable even when you're talking a full SSD server that can do 4-6Gbps multi-thread/queue reads.
However - zomg Gostev makes me happy - for async with an all-ssd array will be yum yum. Rust drives ... could improve but that is quasi random-io which yeah, unless you have a ton of spindles it won't improve much or might even hurt depending on the system.
Curious on how that'll impact health checks, but for now I've disabled it on my jobs; when we're talking 20-40TiB VMs, it's just not doable even when you're talking a full SSD server that can do 4-6Gbps multi-thread/queue reads.
-
- Service Provider
- Posts: 372
- Liked: 120 times
- Joined: Nov 25, 2016 1:56 pm
- Full Name: Mihkel Soomere
- Contact:
Re: Backup File health checks slow
Is it tuneable? For example use larger read-ahead window. I'm not currently seeing a lot of improvement with large files.
Who is online
Users browsing this forum: Bing [Bot] and 61 guests