Health check on large backups

DonZoomik · Post by **DonZoomik** » May 27, 2021 3:49 pm this post

Can someone from Veeam comment on async read tuning opportunities? Larger readahead window could work around eventual fragmentation and allow for faster health checks on capable hardware.

Post by **mkh** » May 31, 2021 7:33 am this post

perjonsson1960 wrote: ↑May 26, 2021 6:03 am I am not sure if it is possible to convert without losing all the data? I have googled a little, and some say that it is possible, and some that it is not. And isn't RAID 6 slower than RAID 5 due to the parity calculations in all writes?

Sadly neither of these questions i can answer for you, it depends on what your raid card can do.
I'm my experience a capable enough raid card means there is little difference between raid 5 and 6 speed, since it has to calculate parity on raid 5 anyway.
In the end, in my view there is no choice, with the high risk of not being able to rebuild the array after a failed disk. A bit slower beats the risk of total failure.

perjonsson1960 · May 31, 2021 8:43 am

We have actually replaced one failed disk already, maybe a year ago. The spare disk kicked in, and when the failed disk was replaced, it was rebuilt successfully. I don't remember how long it took, but there were no complications.

18436572 · Jun 01, 2021 2:28 pm

We had these issues with Veeam over a year ago. It was our main 14TB file share for the company. I decided to do Windows DFS and split the data up into 5 file servers. Best decision ever! Now our backups/copies/replicas are easy as pie. We also re-did our project folders to put each year into one folder \\server\share\project\year with one virtual server per year. After 5 years the projects go inactive... That's a major change and it took a few months to accomplish. It was definitely worth it!

DonZoomik · Post by **DonZoomik** » Jun 09, 2021 5:08 pm this post

I was observing Veeam behavior today and it seems to me that there is no read-ahead at all during health check.

A few days ago I was evacutating a SOBR extent and noticed that almost all IO was attributed to System process - this means that Windows itself was doing read-ahead. However this repository was very fragmented and I was seeing wildly fluctuating throughput from a low-end NAS. During high fragmentation QD went up (with lower throughput down to 50MB/s) while low fragmentation saw QD drop (with higher throughput up to 5-600MB/s).
Today I observed a health test and it was barely doing ~100MB/s on a quite high-end system (HPE Apollo 4200, P816 hardware RAID, 24*16TB disks in RAID60, SSD cache) that has been in operation for only two weeks (so low fragmentation). QD around one, all IO attributed to Veeam Process itself. I bet that if I checked handle attributes, the file was opened with flags that disable read-ahead. This chain is also reverse incremental so it is my understanding that health check largely just reads VBK file from start to end.

So no read-ahead at all or Veeam's async read is worse than Windows read-ahead?

Post by **Gostev** » Jun 09, 2021 6:45 pm this post

Looks like you found a bug and async read is not initialized for health check.

DonZoomik · Post by **DonZoomik** » Jun 09, 2021 8:50 pm this post

Umm... does that mean that the bug is a fact or just a theory? Should I push through support or can you confirm through QA/RD?

If it's a theory then I think v10 behaved the same. I don't have a v10 similar system to check but I pointed to similar behavior in another thread.

Jun 09, 2021 9:17 pm

It's a fact. The dev behind async engine has already reviewed the source code for me and confirmed that health check job does not initialize the required parameter. So no need for push through support as the bug has already been logged.

V10 did not have the async engine at all (outside of virtual full to tape where we piloted this functionality) so yeah V11 behaves the same as V10 in that sense.

DonZoomik · Post by **DonZoomik** » Jun 09, 2021 9:37 pm this post

Well that's good news (as in the situation isn't hopeless).
Sounds like a minor thing that could get a fix in now quite frequent patches in reasonable timeframe (months)?

Post by **Gostev** » Jun 09, 2021 9:37 pm this post

That's my hope too.

pirx · Post by **pirx** » Jun 10, 2021 5:12 am this post

+1 Today my first health check of a single VM with 8 TB finished, it took ~8h (v10), in iotop I checked that the Veeam Agent was reading the data with ~300MB/s. Our regular jobs are 20-40 TB, so this will take more than one day.

I'm still looking into surebackup but it's also not that fast, depending of the tests you run. And a simple boot + ping will not detect all possible bad blocks, right?

Post by **Gostev** » Jun 10, 2021 11:59 am this post

Yes, you are right.

Post by **Gostev** » Jun 10, 2021 12:04 pm this post

@pirx however, I should mention that SureBackup job also has an option to check all blocks. See the first checkbox on this step of the wizard. Thanks!

pirx · Post by **pirx** » Jun 10, 2021 12:54 pm this post

@Ghostev yeah, I've seen this and even used in my test job (also took ages, but I guess there's not much that can be changed there). It would still be nice to have a tool where I can randomly check single VM's. Something like right click -> check, or Validator for Linux.

It would be perfect to be able to check everything, but it seems just not possible. So my idea would be to randomly pick a number of VM's and let them be checked. That's not really possible with surebackup, right? Instead of checking a whole job once a month which then runs very long and RPO gets violated, let a some VM's from different jobs be checked every day.

Post by **Gostev** » Jun 10, 2021 1:04 pm this post

This is a good idea.

pirx · Post by **pirx** » Jun 10, 2021 1:10 pm this post

Can I count that as yes for a feature request? Very high level description would be: let Veeam check a random number of x VM's each day that will be rechecked only after all other VM's were checked. So every VM (for which this check is activated) will be checked at some point. Maybe every 2 weeks, maybe only every 2 months. A priority list could have VM's that should be check daily/weekly.

Jun 10, 2021 3:26 pm

In the interim, there's this great script from about 5 years ago that automates Surebackup via Powershell to effectively provide this capability within the existing feature set of Veeam. Basically you provide a list of systems in a text file and it modifies the Surebackup job every day with X number of systems per day so that, in the defined about of time, say over the course of a month, all of them are tested, by testing a different set each day.

https://www.virtualtothecore.com/can-te ... urebackup/

It's been quite a while, and I suppose the script might not work without modification on the latest VBR, but most of the commands look pretty straightforward so if something is broken it would hopefully be easy to modify.

pirx · Post by **pirx** » Jun 10, 2021 5:40 pm this post

Yes, this looks very interesting, I'll give it a try after my vacation.

pirx · Post by **pirx** » Jun 11, 2021 11:58 am this post

I checked one other job with 117 VM's where I enabled health checks and was surprised that this job finished in just 35min.

VM size: 10,4 TB
Backup files health check has been completed 35:48

Those are jobs with just one VM.

VM size: 8,1 TB
Backup files health check has been completed 07:38:38

VM size: 3,2 TB
Backup files health check has been completed 02:32:54

Why is the one job that fast and others much slower?

Post by **Gostev** » Jun 11, 2021 12:06 pm this post

Due to the presence of many empty or repeating blocks perhaps. Remember Veeam has built-in deduplication and only stores such blocks once.

DonZoomik · Post by **DonZoomik** » Jun 12, 2021 9:06 am this post

Also maybe a lot of similarly sized backups combined with per-vm chains? This provides additional parallelism, greatly improving throughput on capable storage devices.

dweide · Post by **dweide** » Jun 14, 2021 8:10 am this post

Just to add another example of real slow backup verification:

- 17.8 TB VM
- Backup files health check has been completed 67:07:03

perjonsson1960 · Jun 18, 2021 1:57 pm

What happens if the health check of a backup copy job is still running when the copy interval expires?

Post by **foggy** » Jun 30, 2021 10:51 pm this post

The interval is extended and the health check continues.

dejan.ilic · Post by **dejan.ilic** » Jul 02, 2021 8:47 pm this post

Just a quick question, why can't the health check be implemented in a separate job/process so that it won't break the normal backup job schedule?

It it would find an error it wouldn't matter if it is signaled later, Veeam B&R would have to do the error handling anyway.
Worst case is that any backups afterwards the health check is started are invalid (which the health check could detect)
If it doesn't find (the normal case) it wouldn't interfere with the next backup run and pick up backup data that the current implementation of "syncronous health check" jobs misses.

So in best case, the backup jobs are not interferred and in worst case is no worse that the current implementation where all the jobs that should run but are missed due to health check dont do any backups.

(we had a fileserver with 21TB+ data in one filesystem, healthchecks 60 hours)

Jul 03, 2021 12:15 am

No reason why it cannot be, in fact we're working on implementing this change right now

garrettt12 · Jul 05, 2021 4:32 pm

Will this separation change allow us to have health checks that run after the original job's backup window would gave terminated it?

We have customers with monster VMs and very "value engineered" virtual environments that need strict backup windows, but these cut off healthchecks which we'd have no problem running against backup storage during business hours otherwise.

DonZoomik · Post by **DonZoomik** » Jul 14, 2021 12:58 pm this post

Gostev wrote: ↑Jun 09, 2021 9:37 pm That's my hope too.

Any news?

DonZoomik · Post by **DonZoomik** » Aug 25, 2021 5:39 pm this post

Gostev, any hope of this making into v11a?

Post by **Gostev** » Aug 25, 2021 5:49 pm this post

I think so. @HannesK could you please check this did not get lost?

R&D Forums

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Who is online