Backup File health checks slow

Post by **JaySt** » Oct 28, 2019 9:31 pm this post

we've got some pretty awesome repository servers (HPE DL380gen10) running Windows Server 2019 ReFS with local 12Gbps NLSAS disks (12) in a RAID6 config. Great backup performance, great restore performance. However, the health check on the backup files takes >3 days and stalls backups jobs that are up for the health check schedule.
I know a health check can take a while, but i try to understand why THIS long. Info:
- chain is forever incremental, 31 restore points max
- per-vm backup files enabled
- multiple jobs
- 20TB+ total data

The health check is doing something, but is not close to saturating CPU/Memory or Disk resources. it currenlty just does 20-40MBps to the local disks, CPU is happy in <30% regions, plenty of RAM available
The system is capable of doing hundreds of MBps read and write but i don't understand why it does not show that during health check.

So how does the health check operate exactly? Is it capable of doing things in parallel?
Is it likely something is not right when health check only does 20-40MBps?
Is ReFS making things extra difficult/hard for health checks?

Post by **Gostev** » Oct 28, 2019 11:29 pm this post

Please, don't forget to ALWAYS include support case ID when posting about ANY technical issue whatsoever, as requested when you click New Topic.

I would try testing your storage performance for random I/O with typical Veeam block size. You should do the "Worst case scenario" test under "Slow restore" chapter, as this should best represent health check I/O pattern on ReFS.

Thanks!

popjls · Post by **popjls** » Oct 29, 2019 1:53 am this post

I also have this exact issue. Backup file checking has become incredibly slow with this new release. I don't believe it's a storage issue and or a networking issue as both are relatively untouched compared to it's capability. I'm going to test a few more ideas before opening a case but i'll say you are not alone.

Post by **veremin** » Oct 29, 2019 1:07 pm this post

Test you storage performance using the utility above and open the ticket, if you don't manage to confirm the issue with the storage system. Thanks!

Post by **JaySt** » Oct 29, 2019 10:38 pm this post

i'll do a performance test like mentioned above. If it shows significant better numbers compared to the numbers i see at times of the health check, i'll create a case. Good suggestion.
But to understand better:
The Slow restore/Worst case scenario disk test is suggested. It does random I/O. Is health check doing random I/O as a result of ReFS being used as a filesystem? Or is the health check more random regardless of the filesystem used? Where what aspect makes health check heavy on the random I/O vs sequential?
I don't have problems with restore speeds for example.

Post by **foggy** » Oct 30, 2019 7:40 am this post

Health check is always random - it reads all the blocks required to build the latest restore point, which could be scattered across multiple files. Restore might be either random or sequential depending on the particular restore option - f.ex. full VM restore from a single full backup file is more sequential, while Instant Recovery is purely random.

Oct 30, 2019 11:15 am

did some tests with diskspd as suggested. the r4k pattern showed maximum of 20MiB/s throughput. So that actually matched the throughput we saw during the health check. the test was performed against a single .vbk file.
This effectively means we'll probably disable the health checks. It's just too much data to process with this kind of numbers. The jobs would not be able to run for too long (multiple days). We have copy jobs in place to a second server, so that actually reads the data as well. in a different way.
I executed this command:

Code: Select all

diskspd.exe -b512K -r4K -Sh -d600 D:\path\to\fullbackup.vbk

JRRW · Post by **JRRW** » Oct 02, 2020 1:36 pm this post

Interesting.

Is there any way to move this to a multi threaded operations?

Cause I know it's not my repository server, when you consider my random IOPS are in the tens of thousands:

Code: Select all

 Random 4KiB (Q= 32, T=16):   618.499 MB/s [ 151000.7 IOPS] <  3387.40 us>

Even using DiskSpd worse case, it's not an ugly thing (this was run WHILE doing a health check on another backup file):

Code: Select all

Input parameters:

        timespan:   1
        -------------
        duration: 10s
        warm up time: 5s
        cool down time: 0s
        gathering IOPS at intervals of 600ms
        random seed: 0
        path: '****.vbk'
                think time: 0ms
                burst size: 0
                software cache disabled
                hardware write cache disabled, writethrough on
                performing read test
                block size: 524288
                using random I/O (alignment: 4096)
                number of outstanding I/O operations: 2
                thread stride size: 0
                threads per file: 1
                using I/O Completion Ports
                IO priority: normal

System information:

        computer name: *******
        start time: 2020/10/02 13:26:11 UTC

Results for timespan 1:
*******************************************************************************

actual test time:       10.01s
thread count:           1
proc count:             32

CPU |  Usage |  User  |  Kernel |  Idle
-------------------------------------------
   0|  77.03%|   1.25%|   75.78%|  22.97%
   1|   0.47%|   0.00%|    0.47%|  99.53%
   2|  18.00%|   9.55%|    8.45%|  82.00%
   3|   0.00%|   0.00%|    0.00%| 100.00%
   4|   4.69%|   2.19%|    2.50%|  95.31%
   5|  16.88%|   0.00%|   16.88%|  83.13%
   6|   2.03%|   1.10%|    0.94%|  97.97%
   7|  71.88%|   0.00%|   71.88%|  28.13%
   8|  11.41%|   0.47%|   10.94%|  88.59%
   9|   7.19%|   2.34%|    4.84%|  92.81%
  10|  15.63%|   0.31%|   15.31%|  84.38%
  11|   0.00%|   0.00%|    0.00%| 100.00%
  12|   4.69%|   0.47%|    4.22%|  95.31%
  13|  11.08%|  11.08%|    0.00%|  88.92%
  14|   8.91%|   0.00%|    8.91%|  91.09%
  15|   0.47%|   0.00%|    0.47%|  99.53%
  16|   1.09%|   0.00%|    1.09%|  98.91%
  17|   0.31%|   0.00%|    0.31%|  99.69%
  18|   0.00%|   0.00%|    0.00%| 100.00%
  19|   1.72%|   0.31%|    1.41%|  98.28%
  20|   0.00%|   0.00%|    0.00%| 100.00%
  21|   0.16%|   0.00%|    0.16%|  99.84%
  22|   0.16%|   0.00%|    0.16%|  99.84%
  23|   0.00%|   0.00%|    0.00%| 100.00%
  24|   0.00%|   0.00%|    0.00%| 100.00%
  25|   0.00%|   0.00%|    0.00%| 100.00%
  26|   0.00%|   0.00%|    0.00%| 100.00%
  27|   0.00%|   0.00%|    0.00%| 100.00%
  28|   0.00%|   0.00%|    0.00%| 100.00%
  29|   1.56%|   0.00%|    1.56%|  98.44%
  30|   0.00%|   0.00%|    0.00%| 100.00%
  31|   0.16%|   0.00%|    0.16%|  99.84%
-------------------------------------------
avg.|   7.98%|   0.91%|    7.08%|  92.02%

Total IO
thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s | IopsStdDev |  file
-------------------------------------------------------------------------------------------
     0 |      1662517248 |         3171 |     158.46 |     316.92 |      64.70 | *****.vbk (1814GiB)
-------------------------------------------------------------------------------------------
total:        1662517248 |         3171 |     158.46 |     316.92 |      64.70

Read IO
thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s | IopsStdDev |  file
-------------------------------------------------------------------------------------------
     0 |      1662517248 |         3171 |     158.46 |     316.92 |      64.70 | *****.vbk (1814GiB)
-------------------------------------------------------------------------------------------
total:        1662517248 |         3171 |     158.46 |     316.92 |      64.70

So that should be at least 168MB/s - but it's only averaging about 45-68MB/s according to my real time monitoring of the health check - peaks around 110-140MB/s from time to time.

And Latency is negligible:

Code: Select all

Disk 0 

	AVAGO SMC3108 SCSI Disk Device

	Capacity:	161 TB
	Formatted:	161 TB
	System disk:	No
	Page file:	No

	Read speed	82.9 MB/s
	Write speed	0 KB/s
	Active time	5%
	Average response time	0.3 ms

Backend disk is 64KB ReFS on a R5 512Kb striped SSD array.

Oct 09, 2020 5:14 pm

In case per-VM chains are enabled on the repository, health check should run in parallel for multiple VMs, which will load the storage more.

Post by **DonZoomik** » Oct 11, 2020 11:17 am this post

Fragmentation makes it slower over time. Defrag helps somewhat but it has problems of it's own (blocking metadata operations for files/folders that it's analyzing during inital analyze phase, rehydrating sythetic fulls).
Per-VM chains don't help with very large servers. I have jobs that include very large VMs and it's unfeasible to check them (taking well over 36 hours). It seems to me that Veeam is checking data at QD1 (or something similar with sync IO) that spinning disks don't handle well. Agressive queueing (async read-ahead) or scanning in parallel (for example) 1TB extents could possibly increase performance a lot, especially with larger RAID arrays with SAS backends that have no problems with large queues. In my case, check for ~30TB of data with 100+ VMs starts out at 1GB/s+ (at reasonable latency, plus parallel writes) and final 1-2 large VMs drop to <100MB/s in heavily fragmented areas, dragging out the check for far too long.

Post by **DonZoomik** » Oct 27, 2020 7:58 am this post

Saw "async read everywhere" in v11 updates video under engine improvements. Could this be similar to the concept I described (async read-ahead)?

Oct 27, 2020 1:48 pm

Yes, you're talking the same thing. But it helps only with an enterprise-grade storage like the one you have... will not do miracles to low spindle count or lack of an enterprise-grade RAID controller.

Post by **DonZoomik** » Oct 27, 2020 2:23 pm this post

Great news!
Would it also apply to proxy operations (Direct SAN, HotAdd) as they also seem to suffer from low queue depth?

Post by **Gostev** » Oct 27, 2020 4:25 pm this post

Proxy operations are like that for many years now... since v8 or something. But only in our proprietary transport modes, namely: hot add, direct NFS and backup from storage snapshots. We cannot control how VMware VDDK-based transport modes read data, except for NBD mode with ESXi 6.7 or later (where we do async reads by default). But, let's not hi-jack the current thread with this

nitramd · Post by **nitramd** » Oct 27, 2020 7:19 pm this post

IIRC, slow Random I/O is intrinsic to RAID 5&6 - believe it's due to parity.

Post by **DonZoomik** » Oct 27, 2020 8:10 pm this post

That would explain why HotAdd is occasionally much faster than Direct SAN... but i digress.

RAID5/6 is slower in writes due to parity but reads are not affected.

JRRW · Nov 06, 2020 9:53 pm

In point of fact, Raid-5/6 reads are often faster as it has more disks to read from; write yes, will be impacted.

However - zomg Gostev makes me happy - for async with an all-ssd array will be yum yum. Rust drives ... could improve but that is quasi random-io which yeah, unless you have a ton of spindles it won't improve much or might even hurt depending on the system.

Curious on how that'll impact health checks, but for now I've disabled it on my jobs; when we're talking 20-40TiB VMs, it's just not doable even when you're talking a full SSD server that can do 4-6Gbps multi-thread/queue reads.

Post by **DonZoomik** » Apr 21, 2021 8:51 pm this post

Gostev wrote: ↑Oct 27, 2020 1:48 pm Yes, you're talking the same thing. But it helps only with an enterprise-grade storage like the one you have... will not do miracles to low spindle count or lack of an enterprise-grade RAID controller.

Is it tuneable? For example use larger read-ahead window. I'm not currently seeing a lot of improvement with large files.

R&D Forums

Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Re: Backup File health checks slow

Who is online