Health check on large backups

Aug 25, 2021 11:39 pm

AFAIK, this is being implemented within v12.

Aug 26, 2021 11:35 am

just to clarify: what foggy meant is a health check process that is separate from the job itself.

for async read: yes, that should make it into V11a. Internal tests showed up to 5x performance improvement for a 15TB backup file

DonZoomik · Aug 26, 2021 3:29 pm

5x...

Thomas N. · Post by **Thomas N.** » Aug 30, 2021 3:32 pm this post

Can the health check and backup job be running simultaneously?

Aug 30, 2021 4:06 pm

Health check now is a part of running backup jobs. First all vm's are processed, then so called "post-processing" starts which does all work regarding retention, healthchecks, etc.

Thomas N. · Post by **Thomas N.** » Aug 30, 2021 4:20 pm this post

Does the Backup window restriction setting in Schedule applies to the "post-processing" process? I don't want any VMs backup running during production hours but OK with merge, healthchecks, etc on the job.

Aug 30, 2021 4:21 pm

Yes, it does.
Example: Healthcheck will be cancelled, if it takes longer as the configured allowed window.

Post by **Gostev** » Aug 30, 2021 5:04 pm this post

Actually that would be a bug if so, because the allowed window was supposed to only restrict activities that touch production environment. While all of the above-mentioned activities are isolated to a backup repository.

Post by **Mildur** » Aug 30, 2021 5:08 pm this post

Anton, i had that on V9 and V10 with some customers.
The backup window restriction setting has cancelled the backup job.

Ok, only the transport and Health check.
Good to know

https://helpcenter.veeam.com/docs/backu ... ml?ver=110

The backup window affects only the data transport process and health check operations. Other transformation processes can be performed in the target repository outside the backup window.

Aug 30, 2021 5:19 pm

Bugs can also be documented

but it is not right that the health check process is included, as just like other transformation processes it does not touch a production environment.

The logic by devs was probably that the health check process MAY result in a job retry at the end to obtain data of corrupted blocks from the source. But since corruptions happen so rarely, there's actually no point to restrict health check from running outside of the allowed window merely based on this possibility. If should be allowed to run, but if job retry is needed - then it can be failed with the corresponding error.

@Egor Yakovlev FYI, this is especially important as we uncouple the health check process from backup jobs, since chances they will be scheduled outside of the backup window are pretty high

Stephan23 · Post by **Stephan23** » Sep 13, 2021 7:34 am this post

A health check prevents tape jobs from running, which can mess with the schedule, when the tape job is so much delayed, it can't finish before the source jobs starts again the next day. In that case the tape job fails. Happens once a month here. In that case it would be helpful to respect the backup windows restrictions right?
But I'm hoping the other changes will be great for me in that case.

DonZoomik · Oct 05, 2021 6:55 pm

Having tested v11a...

When verification is running alone, it can easily hit 1,5GB/s+, about 4-5x faster than before . That's actually much faster than reading data from quite good SAN (after disk extension for example).

Oct 05, 2021 8:11 pm

Haha!! Thanks for sharing

sounds like you have a pretty decent backup storage there.

DonZoomik · Oct 06, 2021 5:52 pm

It's not *that* beefy (24*16TB SAS RAID60 on some MegaRAID, SSD cache)... One of my customers recently bought a Dell XE7100 that would have 48-disk RAID60, that might show some more interesting numbers, but it's not delivered yet.

Post by **Regnor** » Oct 07, 2021 3:16 pm this post

That sounds promising. We do have some customers who experience very long health checks, some with RAID60 arrays, so I'm looking forward to see their results.

DonZoomik · Post by **DonZoomik** » Oct 10, 2021 9:50 pm this post

It seems that improved verification throughput seems to hit performance bad enough to cause some backup semi-failures.
After enabling verification on some very large jobs, I've seen many errors on other jobs while verification runs:

Code: Select all

Error: Failed to call RPC function 'FcWriteFileEx': The supplied user buffer is not valid for the requested operation. Failed to write data to the file [<always a temporary VBM file path>].

However job seems to actually succeed as last retry has status "Nothing to process...". So far I've tried reducing number of tasks on repository, no visible improvement. I've got a busy week ahead but I'll try to find time to create a support case...

Oct 10, 2021 10:52 pm

Judging on the error I'm not sure if this is related. If the error was due to a backup storage now being too busy, I would expect timeouts as opposed to buffer errors. Actually, you would see "repository is too busy" warnings in the action log first, even before those I/O timeout errors start to appear. But let's see what support finds out.

DonZoomik · Post by **DonZoomik** » Oct 16, 2021 2:15 pm this post

While I'm getting around to creating a support case (soon!), I noticed that DirectSAN is quite slow and went to reread some docs (a bit offtopic but the same case of missing ADF).
Advanced Data Fetcher is still not used for DirectSAN mode, right? Ironically now reads (more specifically re-read after extending a 25TB VMDK) from a powerful hybrid SAN (with DirectSAN) are much slower than verification (same symptom, queue depth of exactly 1).
Seems like a low hanging fruit as for Storage Snapshots Veeam would also have to query snapshoted VMDK layout in VMFS (or parse it from snapshot file system) to perform ADF reads, almost the same for DirectSAN. And a bit weird that DirectSAN has higher priority than HotAdd in that case, I'm not sure if it has any benefits over HotAdd at all from throughput perspective (ignoring CPU/Mem usage in VMware).

Post by **Gostev** » Oct 16, 2021 5:56 pm this post

This is not going to work for DirectSAN (of course we tried this when we developed ADF). But let's not hi-jack the thread with this off-topic

DonZoomik · Post by **DonZoomik** » Oct 18, 2021 7:47 pm this post

#05083160

DonZoomik · Post by **DonZoomik** » Oct 22, 2021 4:05 pm this post

Support guy found an interesting remark in Win32 API WriteFile doc: "The WriteFile function may fail with ERROR_INVALID_USER_BUFFER or ERROR_NOT_ENOUGH_MEMORY whenever there are too many outstanding asynchronous I/O requests."

Unsure why it is happening only on our largest and highest-performance repository though, investigation in progress.

Oct 26, 2021 11:17 am

I have to say, I'm really impressed by the new health-check speeds.

Some example from one of our infrastructures:

Job with 16 VMs and 13.7TB Source Data:
Before V11a: 8.5h
With V11a: 1.5h

Job with 55 VMs and 12.8TB Source Data:
Before V11a: 5.5h
With V11a: 1.5h

Job with 15 VMs and 8.6TB Source Data:
Before V11a: 8.5h
With V11a: 1h

NightBird · Post by **NightBird** » Oct 26, 2021 1:41 pm this post

Impressive !!! What kind of backend storage ?

Post by **Mildur** » Oct 26, 2021 1:53 pm this post

HPE Apollo as a Linux Hardened Repo.
We are really happy about that product.
I am sure, that other vendors are getting same results after updating to V11a

agrob · Nov 11, 2021 7:57 am

I can confirm, health check is much faster in V11a.
Backup File Size (vbk): ~5TB
Check Duration before 11a: ~6h
Check Duration with 11a: ~50min
Thanks Veeam Team

Stephan23 · Post by **Stephan23** » Nov 15, 2021 11:47 am this post

Unfortunately I cannot confirm those high rates. For one particular job it went from 12h to 8h, making other jobs still fail because of timeout.
Edit: Just checked again and the 12h was an outliner, was getting 8-9h even before the upgrade. So no noticeable change at all.

Post by **Mildur** » Nov 15, 2021 12:07 pm this post

Unfortunately I cannot confirm those high rates. For one particular job it went from 12h to 8h, making other jobs still fail because of timeout.
Edit: Just checked again and the 12h was an outliner, was getting 8-9h even before the upgrade. So no noticeable change at all.

Are you still using a NetApp E-Series as a Backup Target over FC?
As far as I understand this implementation (System cache bypass), it only works with enterprise grade raid controller with direct attached disks and not with iSCSI or FC connected LUNs.
But I'm not 100 percent sure.

agrob · Post by **agrob** » Nov 15, 2021 12:08 pm this post

Stephan, how is the volume/array configured on which the backup files are stored? We have an Volume with 50x6TB Disk.

Nov 15, 2021 3:31 pm

the performance gain for health check comes from "asynchronous read". that is expected to improve performance on all kind of storage systems.

I mean, if the storage is completely overloaded with other tasks, then the impact might be irrelevant. In general, also storage systems connected via FC / iSCSI or whatever protocol profit from async read. Maybe there are also other bottlenecks involved (tasks configuration, any other limits applied to the repository, whatever compute resource shortage)

Post by **Mildur** » Nov 15, 2021 3:35 pm this post

Thanks Hannes for the clarification.

R&D Forums

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Re: Health check on large backups

Who is online