Health check on large backups

perjonsson1960 · May 19, 2021 6:43 am

Folks,

We have a file server cluster that we recently began backing up with B&R. The cluster consists of two physical servers, and is backed up as a failover cluster. The amount of data right now is approx. 17 TB, and the number of files is just over 18 million. I don't know if that is to be considered a large backup, or if it perhaps is small compared to some other installations.

The initial full backup took around seven hours, which is okay. An incremental backup takes about 15 minutes, which of course is also okay. And the Merge operation takes around 30 minutes, also okay. However, on Monday the job started a health check, and that process took 26 hours and 28 minutes. And that ruins the backup window on that day of the month when the health check is performed, even if we run the backup job only once every 24 hours. But we are considering running the job twice, once during the night and once mid-day at lunch time.

What is the health check doing that takes so long? Is the long time due to the amount of data or the large number of files?

Right now the job is forever forward incremental. If we change the job to forward incremental with synthetic fulls, can we then safely skip the health check since there are new fulls created regularly? Or are the errors in the latest full backup file, if any, copied to the new full backup file? I can imagine that creating a synthetic full is much faster than doing a health check... We have not done a defrag and compact, so I don't know how long that takes, but if we change to forward incremental with synthetic fulls, we don't need to do a defrag and compact.

Regards,
PJ

Post by **Mildur** » May 19, 2021 6:48 am this post

I recommend to skip the healthcheck only when you are doing surebackup Jobs.
A synthetic full will help you, if you are using FAST Cloning with refs or xfs. But it will reuse the blocks on the disk. If you have corrupted data, your backups cannot be restored.

You can read more about health check and what it does here. I think, It‘s the amount of data in your case.
https://helpcenter.veeam.com/docs/backu ... ml?ver=110

What are you using as a backup storage? If it‘s to slow, then health check can take some time for your amount of data.

perjonsson1960 · May 19, 2021 8:42 am

Thanks for your reply! So, we cannot skip the health check. Right.

The backup storage: We have two identical HPE DL380 servers with a ridiculously large amount of RAM, and two disk enclosures each of about 90 TB. So, four backup repositories included in a scale-out rep. of approx. 360 TB. Each disk enclosure is a RAID 5 array with 11 "live" disks and one spare. Our supplier recommended RAID 5 over RAID 6 with the motivation that RAID 5 is faster. The file system is ReFS with dedup. and all the fancy stuff.

Post by **foggy** » May 19, 2021 9:50 am this post

RAM doesn't really matter here, storage random I/O capabilities are the most important. Health check is heavy random as it reads all the blocks required to build the latest restore point, which are scattered across multiple files.

perjonsson1960 · May 19, 2021 11:44 am

Is there anything that can be done with the controller cache settings in order to speed up things in general, and health check in particular? For example, the cache ratio is set to 10% read and 90% write, which apparently is the default, because I don't remember ever changing it. There are also a number of other controller parameters that can be tweaked:

Selected Performance Profile - Default Settings
Parity RAID Degraded Mode Performance Optimization - Disabled
Physical Drive Request Elevator Sort - Enabled
Maximum Drive Request Queue Depth - Automatic
Monitor and Performance Analysis Delay - 60
HDD Flexible Latency Optimization - Disabled

The controllers are "HPE Smart Array P408e-p SR Gen10" and "P408i-a SR Gen10". For some reason the internal controller has only 2048 MB cache, while the external controller has 4096 MB. I don't know if that is normal.

Post by **PetrM** » May 19, 2021 12:50 pm this post

Hello,

Basically, you should follow vendor recommendations to improve random I/O. The only idea which comes to me is to play with different block sizes and compression levels to investigate dependency of health check speed on these parameters. You may review this page on our help center.

Thanks!

Post by **DonZoomik** » May 19, 2021 8:09 pm this post

veeam-backup-replication-f2/backup-file ... 62680.html This thread also covered the same problem (maybe merge?). Not sure what your version is but v11 was supposed to have improvements - personally I don't see the difference. Nobody answered my question about increasing readahead window as well...
The problem is real with spinning media (especially with very large backup files, 17T is quite big) and it gets worse and worse over time with fragmentation. You really don't have many other options than to improve latency and the only way to do that is to use faster media like 10k/15k disks or SSD. Or maybe get an option to increase async readahead even more and keep storage busy.

The file system is ReFS with dedup

Don't do dedupe, it makes it even worse with more random IO.

perjonsson1960 · May 20, 2021 5:58 am

Don't do dedupe, you say. But it saves diskspace. Bigtime. There are pros and cons with pretty much everything.

Post by **Mildur** » May 20, 2021 6:04 am this post

I‘d prefer todo reFS Fast Clone without Dedup. The space savings and synthetic Full runs are very good.

perjonsson1960 · May 20, 2021 6:29 am

And how do I turn off dedupe? Is that the "Enable inline data deduplication" checkbox in the backup job? It says "recommended" there... It is probably even the default setting. Can I turn that off for a job that has an existing backup chain?

Post by **DonZoomik** » May 20, 2021 9:33 am this post

I thought you meant ReFS level deduplication (the one with experimental support). Veeam's integrated one is fine.

perjonsson1960 · May 20, 2021 12:07 pm

I really don't know what I mean, but I guess that the ReFS level dedupe is not something that you can turn off, and live to talk about it?

Post by **Mildur** » May 20, 2021 12:11 pm this post

ReFS level dedupe = Windows Server Data Deduplication
ReFS level dedupe = FAST Clone (ReFS Block Cloning API)

Which of this two are you referring to?

Disabling "Enable inline data deduplication" is only needed, if you saving your backup job or backup copy job to an deduplication Appliance.

Post by **foggy** » May 20, 2021 12:17 pm this post

(and not even to any deduplication appliance)

perjonsson1960 · May 20, 2021 1:25 pm

This is very confusing... So the "inline data deduplication" referred to in the backup job configuration is not the same as in the Windows ReFS filesystem? What I know is that Fast Clone is used, and that "inline data deduplication" is activated in the backup jobs. And I am pretty sure that the Windows Server (ReFS) dedupe function is also activated. I don't know how to check if it is, but if it is, then I suppose that it was activated back in the day when the servers and the RAID arrays were installed, right? I guess that the Windows (ReFS) dedupe cannot be turned off without losing all the data on the repositories?

perjonsson1960 · May 20, 2021 2:16 pm

I am looking in the Server Manager under File and Storage Services -> Volumes, and the columns "Deduplication Rate" and "Deduplication Savings" are both empty. Does that mean that deduplication is not used? I cannot find any reference to deduplication anywhere else.

Post by **foggy** » May 20, 2021 3:22 pm this post

Yes, this is the right place to check the status and enable/disable the Windows Data Deduplication feature. As for Veeam B&R inline deduplication and Fast Clone functionality, then please review the referred user guide sections for better understanding of those.

perjonsson1960 · May 21, 2021 6:47 am

I have checked now. The Data Deduplication feature is NOT installed in our repository servers. I guess that is a good thing? So, only the integrated dedup in B&R is used.

May 21, 2021 8:26 am

Yes, that's the recommended configuration Fabian was talking about.

perjonsson1960 · May 21, 2021 9:08 am

Great!

Perhaps a semi-related question; What is generally the recommended read/right ratio in the RAID controllers? I have now changed it to 30% read and 70% write, in order to see if I can detect any difference in the speeds in various operations. But I guess that any differences will be minor at best. But for example, I guess that a health check is mainly random reads?

Post by **foggy** » May 21, 2021 9:47 am this post

But for example, I guess that a health check is mainly random reads?

That's correct.

Post by **DonZoomik** » May 21, 2021 11:07 am this post

My recommendation would be set it to 100% writes or 10/90 read/write. With modern amounts of memory, it doesn't make much sense to cache reads on controller. Small amount of read cache *may* allow controller-side read-ahead cache but I haven't seen any noticeable difference.

perjonsson1960 · May 21, 2021 11:18 am

Okay, thanks! It was set at 10/90 read/write before. I thought that maybe 30/70 would speed up a health check somewhat, since that is virtually only reads. But if the difference in performance is so slim, perhaps it will not show any significant change in the time needed, even if the health check took over 26 hours with the 10/90 setting?

May 22, 2021 1:22 pm

Read cache mostly exists to keep already read data in cache. If application tries to access it again, it is served from cache. However with modern amounts of memory and heavy OS side caching, IMHO this effect is practically nonexistant. RAID controllers also do read-ahead but I doubt that these windows are very large and it has no effect with random IO. Also OS (and Veeam itself) does readahead so there's little point in doing it on controller side, especially as OS/Application actually know what they need next.

IMHO RAID caching only makes sense for writes especially on parity RAIDs where it helps a lot. Uncached parity RAID writes are painful.

perjonsson1960 · May 24, 2021 7:09 am

Thanks! I might just as well go back to the 10/90 setting then.

garrettt12 · Post by **garrettt12** » May 24, 2021 1:45 pm this post

If you can spare the space, I would consider experimenting with something not mentioned here yet - periodic Active Fulls. These would be more of sequential write workload than a random I/O one, which may be very beneficial for your spinning disks configuration.

I know this isn't particularly an agreed good practice, but you may be able to forgo Health Checks with this method too, as your timeframe for backup corruption due to bit rot, etc, to corrupt backups would only be the time between each Active Full. One you hit Health Checks that are so long you are starting to miss restore points, it becomes a serious consideration IMHO. I'd be willing to see what someone more knowledgeable on the corruption side would say on this however.

Post by **mkh** » May 25, 2021 6:55 am this post

perjonsson1960 wrote: ↑May 19, 2021 8:42 am Thanks for your reply! So, we cannot skip the health check. Right.

The backup storage: We have two identical HPE DL380 servers with a ridiculously large amount of RAM, and two disk enclosures each of about 90 TB. So, four backup repositories included in a scale-out rep. of approx. 360 TB. Each disk enclosure is a RAID 5 array with 11 "live" disks and one spare. Our supplier recommended RAID 5 over RAID 6 with the motivation that RAID 5 is faster. The file system is ReFS with dedup. and all the fancy stuff.

Hi Per

unrelated to the health check/performance etc part, with disks of this size please do RAID 6 instead of RAID 5

one link to info about raid 5 rebuilds with large disks, see the tables at the end - https://www.digistor.com.au/the-latest/ ... e-in-2019/

ejenner · Post by **ejenner** » May 25, 2021 2:45 pm this post

Had to turn off health checking on our large file server jobs for the same reason. I don't think our file server jobs are quite so large either. Big enough, but maybe only 80% of what you're backing up. We also have the HPE servers with direct attached disk.

Simple choice for us. It was either a backup without a health check or no backup at all.

bryanvaneeden · Post by **bryanvaneeden** » May 25, 2021 3:03 pm this post

We actually have the same issue with an also 17TB large VM. The backup always succeedes within minutes, but the health check takes days. This is unacceptable. Like the other guys here we have a very large high performaning repository server with ReFS and all bells and whistles. Fast Cloning is being used and the backups run through with over 2GB/s of speed. Not quite sure what else we can do to fix the health check speed.

perjonsson1960 · May 26, 2021 6:03 am

mkh wrote: ↑May 25, 2021 6:55 am unrelated to the health check/performance etc part, with disks of this size please do RAID 6 instead of RAID 5

I am not sure if it is possible to convert without losing all the data? I have googled a little, and some say that it is possible, and some that it is not. And isn't RAID 6 slower than RAID 5 due to the parity calculations in all writes?

R&D Forums

Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Who is online