Comprehensive data protection for all workloads
Post Reply
JaySt
Service Provider
Posts: 415
Liked: 75 times
Joined: Jun 09, 2015 7:08 pm
Full Name: JaySt
Contact:

Backup File health checks slow

Post by JaySt »

we've got some pretty awesome repository servers (HPE DL380gen10) running Windows Server 2019 ReFS with local 12Gbps NLSAS disks (12) in a RAID6 config. Great backup performance, great restore performance. However, the health check on the backup files takes >3 days and stalls backups jobs that are up for the health check schedule.
I know a health check can take a while, but i try to understand why THIS long. Info:
- chain is forever incremental, 31 restore points max
- per-vm backup files enabled
- multiple jobs
- 20TB+ total data

The health check is doing something, but is not close to saturating CPU/Memory or Disk resources. it currenlty just does 20-40MBps to the local disks, CPU is happy in <30% regions, plenty of RAM available
The system is capable of doing hundreds of MBps read and write but i don't understand why it does not show that during health check.

So how does the health check operate exactly? Is it capable of doing things in parallel?
Is it likely something is not right when health check only does 20-40MBps?
Is ReFS making things extra difficult/hard for health checks?
Veeam Certified Engineer
Gostev
Chief Product Officer
Posts: 31559
Liked: 6724 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Backup File health checks slow

Post by Gostev »

Please, don't forget to ALWAYS include support case ID when posting about ANY technical issue whatsoever, as requested when you click New Topic.

I would try testing your storage performance for random I/O with typical Veeam block size. You should do the "Worst case scenario" test under "Slow restore" chapter, as this should best represent health check I/O pattern on ReFS.

Thanks!
popjls
Enthusiast
Posts: 55
Liked: 5 times
Joined: Jun 25, 2018 3:41 am
Contact:

Re: Backup File health checks slow

Post by popjls »

I also have this exact issue. Backup file checking has become incredibly slow with this new release. I don't believe it's a storage issue and or a networking issue as both are relatively untouched compared to it's capability. I'm going to test a few more ideas before opening a case but i'll say you are not alone.
veremin
Product Manager
Posts: 20284
Liked: 2258 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: Backup File health checks slow

Post by veremin »

Test you storage performance using the utility above and open the ticket, if you don't manage to confirm the issue with the storage system. Thanks!
JaySt
Service Provider
Posts: 415
Liked: 75 times
Joined: Jun 09, 2015 7:08 pm
Full Name: JaySt
Contact:

Re: Backup File health checks slow

Post by JaySt »

i'll do a performance test like mentioned above. If it shows significant better numbers compared to the numbers i see at times of the health check, i'll create a case. Good suggestion.
But to understand better:
The Slow restore/Worst case scenario disk test is suggested. It does random I/O. Is health check doing random I/O as a result of ReFS being used as a filesystem? Or is the health check more random regardless of the filesystem used? Where what aspect makes health check heavy on the random I/O vs sequential?
I don't have problems with restore speeds for example.
Veeam Certified Engineer
foggy
Veeam Software
Posts: 21071
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Backup File health checks slow

Post by foggy »

Health check is always random - it reads all the blocks required to build the latest restore point, which could be scattered across multiple files. Restore might be either random or sequential depending on the particular restore option - f.ex. full VM restore from a single full backup file is more sequential, while Instant Recovery is purely random.
JaySt
Service Provider
Posts: 415
Liked: 75 times
Joined: Jun 09, 2015 7:08 pm
Full Name: JaySt
Contact:

Re: Backup File health checks slow

Post by JaySt » 1 person likes this post

did some tests with diskspd as suggested. the r4k pattern showed maximum of 20MiB/s throughput. So that actually matched the throughput we saw during the health check. the test was performed against a single .vbk file.
This effectively means we'll probably disable the health checks. It's just too much data to process with this kind of numbers. The jobs would not be able to run for too long (multiple days). We have copy jobs in place to a second server, so that actually reads the data as well. in a different way.
I executed this command:

Code: Select all

diskspd.exe -b512K -r4K -Sh -d600 D:\path\to\fullbackup.vbk 
Veeam Certified Engineer
JRRW
Enthusiast
Posts: 76
Liked: 45 times
Joined: Dec 10, 2019 3:59 pm
Full Name: Ryan Walker
Contact:

Re: Backup File health checks slow

Post by JRRW »

Interesting.

Is there any way to move this to a multi threaded operations?

Cause I know it's not my repository server, when you consider my random IOPS are in the tens of thousands:

Code: Select all

 Random 4KiB (Q= 32, T=16):   618.499 MB/s [ 151000.7 IOPS] <  3387.40 us>
Even using DiskSpd worse case, it's not an ugly thing (this was run WHILE doing a health check on another backup file):

Code: Select all

Input parameters:

        timespan:   1
        -------------
        duration: 10s
        warm up time: 5s
        cool down time: 0s
        gathering IOPS at intervals of 600ms
        random seed: 0
        path: '****.vbk'
                think time: 0ms
                burst size: 0
                software cache disabled
                hardware write cache disabled, writethrough on
                performing read test
                block size: 524288
                using random I/O (alignment: 4096)
                number of outstanding I/O operations: 2
                thread stride size: 0
                threads per file: 1
                using I/O Completion Ports
                IO priority: normal

System information:

        computer name: *******
        start time: 2020/10/02 13:26:11 UTC

Results for timespan 1:
*******************************************************************************

actual test time:       10.01s
thread count:           1
proc count:             32

CPU |  Usage |  User  |  Kernel |  Idle
-------------------------------------------
   0|  77.03%|   1.25%|   75.78%|  22.97%
   1|   0.47%|   0.00%|    0.47%|  99.53%
   2|  18.00%|   9.55%|    8.45%|  82.00%
   3|   0.00%|   0.00%|    0.00%| 100.00%
   4|   4.69%|   2.19%|    2.50%|  95.31%
   5|  16.88%|   0.00%|   16.88%|  83.13%
   6|   2.03%|   1.10%|    0.94%|  97.97%
   7|  71.88%|   0.00%|   71.88%|  28.13%
   8|  11.41%|   0.47%|   10.94%|  88.59%
   9|   7.19%|   2.34%|    4.84%|  92.81%
  10|  15.63%|   0.31%|   15.31%|  84.38%
  11|   0.00%|   0.00%|    0.00%| 100.00%
  12|   4.69%|   0.47%|    4.22%|  95.31%
  13|  11.08%|  11.08%|    0.00%|  88.92%
  14|   8.91%|   0.00%|    8.91%|  91.09%
  15|   0.47%|   0.00%|    0.47%|  99.53%
  16|   1.09%|   0.00%|    1.09%|  98.91%
  17|   0.31%|   0.00%|    0.31%|  99.69%
  18|   0.00%|   0.00%|    0.00%| 100.00%
  19|   1.72%|   0.31%|    1.41%|  98.28%
  20|   0.00%|   0.00%|    0.00%| 100.00%
  21|   0.16%|   0.00%|    0.16%|  99.84%
  22|   0.16%|   0.00%|    0.16%|  99.84%
  23|   0.00%|   0.00%|    0.00%| 100.00%
  24|   0.00%|   0.00%|    0.00%| 100.00%
  25|   0.00%|   0.00%|    0.00%| 100.00%
  26|   0.00%|   0.00%|    0.00%| 100.00%
  27|   0.00%|   0.00%|    0.00%| 100.00%
  28|   0.00%|   0.00%|    0.00%| 100.00%
  29|   1.56%|   0.00%|    1.56%|  98.44%
  30|   0.00%|   0.00%|    0.00%| 100.00%
  31|   0.16%|   0.00%|    0.16%|  99.84%
-------------------------------------------
avg.|   7.98%|   0.91%|    7.08%|  92.02%

Total IO
thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s | IopsStdDev |  file
-------------------------------------------------------------------------------------------
     0 |      1662517248 |         3171 |     158.46 |     316.92 |      64.70 | *****.vbk (1814GiB)
-------------------------------------------------------------------------------------------
total:        1662517248 |         3171 |     158.46 |     316.92 |      64.70

Read IO
thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s | IopsStdDev |  file
-------------------------------------------------------------------------------------------
     0 |      1662517248 |         3171 |     158.46 |     316.92 |      64.70 | *****.vbk (1814GiB)
-------------------------------------------------------------------------------------------
total:        1662517248 |         3171 |     158.46 |     316.92 |      64.70
So that should be at least 168MB/s - but it's only averaging about 45-68MB/s according to my real time monitoring of the health check - peaks around 110-140MB/s from time to time.

And Latency is negligible:

Code: Select all

Disk 0 

	AVAGO SMC3108 SCSI Disk Device

	Capacity:	161 TB
	Formatted:	161 TB
	System disk:	No
	Page file:	No

	Read speed	82.9 MB/s
	Write speed	0 KB/s
	Active time	5%
	Average response time	0.3 ms
Backend disk is 64KB ReFS on a R5 512Kb striped SSD array.
foggy
Veeam Software
Posts: 21071
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Backup File health checks slow

Post by foggy » 2 people like this post

In case per-VM chains are enabled on the repository, health check should run in parallel for multiple VMs, which will load the storage more.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Backup File health checks slow

Post by DonZoomik »

Fragmentation makes it slower over time. Defrag helps somewhat but it has problems of it's own (blocking metadata operations for files/folders that it's analyzing during inital analyze phase, rehydrating sythetic fulls).
Per-VM chains don't help with very large servers. I have jobs that include very large VMs and it's unfeasible to check them (taking well over 36 hours). It seems to me that Veeam is checking data at QD1 (or something similar with sync IO) that spinning disks don't handle well. Agressive queueing (async read-ahead) or scanning in parallel (for example) 1TB extents could possibly increase performance a lot, especially with larger RAID arrays with SAS backends that have no problems with large queues. In my case, check for ~30TB of data with 100+ VMs starts out at 1GB/s+ (at reasonable latency, plus parallel writes) and final 1-2 large VMs drop to <100MB/s in heavily fragmented areas, dragging out the check for far too long.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Backup File health checks slow

Post by DonZoomik »

Saw "async read everywhere" in v11 updates video under engine improvements. Could this be similar to the concept I described (async read-ahead)?
Gostev
Chief Product Officer
Posts: 31559
Liked: 6724 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Backup File health checks slow

Post by Gostev » 3 people like this post

Yes, you're talking the same thing. But it helps only with an enterprise-grade storage like the one you have... will not do miracles to low spindle count or lack of an enterprise-grade RAID controller.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Backup File health checks slow

Post by DonZoomik »

Great news!
Would it also apply to proxy operations (Direct SAN, HotAdd) as they also seem to suffer from low queue depth?
Gostev
Chief Product Officer
Posts: 31559
Liked: 6724 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Backup File health checks slow

Post by Gostev »

Proxy operations are like that for many years now... since v8 or something. But only in our proprietary transport modes, namely: hot add, direct NFS and backup from storage snapshots. We cannot control how VMware VDDK-based transport modes read data, except for NBD mode with ESXi 6.7 or later (where we do async reads by default). But, let's not hi-jack the current thread with this :D
nitramd
Veteran
Posts: 297
Liked: 85 times
Joined: Feb 16, 2017 8:05 pm
Contact:

Re: Backup File health checks slow

Post by nitramd »

IIRC, slow Random I/O is intrinsic to RAID 5&6 - believe it's due to parity.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Backup File health checks slow

Post by DonZoomik »

That would explain why HotAdd is occasionally much faster than Direct SAN... but i digress.

RAID5/6 is slower in writes due to parity but reads are not affected.
JRRW
Enthusiast
Posts: 76
Liked: 45 times
Joined: Dec 10, 2019 3:59 pm
Full Name: Ryan Walker
Contact:

Re: Backup File health checks slow

Post by JRRW » 1 person likes this post

In point of fact, Raid-5/6 reads are often faster as it has more disks to read from; write yes, will be impacted.

However - zomg Gostev makes me happy - for async with an all-ssd array will be yum yum. Rust drives ... could improve but that is quasi random-io which yeah, unless you have a ton of spindles it won't improve much or might even hurt depending on the system.

Curious on how that'll impact health checks, but for now I've disabled it on my jobs; when we're talking 20-40TiB VMs, it's just not doable even when you're talking a full SSD server that can do 4-6Gbps multi-thread/queue reads.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Backup File health checks slow

Post by DonZoomik »

Gostev wrote: Oct 27, 2020 1:48 pm Yes, you're talking the same thing. But it helps only with an enterprise-grade storage like the one you have... will not do miracles to low spindle count or lack of an enterprise-grade RAID controller.
Is it tuneable? For example use larger read-ahead window. I'm not currently seeing a lot of improvement with large files.
Post Reply

Who is online

Users browsing this forum: stevetNOVUS and 150 guests