Comprehensive data protection for all workloads
foggy
Veeam Software
Posts: 21073
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Health check on large backups

Post by foggy » 1 person likes this post

AFAIK, this is being implemented within v12.
HannesK
Product Manager
Posts: 14322
Liked: 2890 times
Joined: Sep 01, 2014 11:46 am
Full Name: Hannes Kasparick
Location: Austria
Contact:

Re: Health check on large backups

Post by HannesK » 10 people like this post

just to clarify: what foggy meant is a health check process that is separate from the job itself.

for async read: yes, that should make it into V11a. Internal tests showed up to 5x performance improvement for a 15TB backup file
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik » 6 people like this post

5x...
Image
Thomas N.
Novice
Posts: 8
Liked: 1 time
Joined: Apr 23, 2020 11:10 pm
Full Name: Thomas Ng
Contact:

Re: Health check on large backups

Post by Thomas N. »

Can the health check and backup job be running simultaneously?
Dima V.
Veeam Software
Posts: 50
Liked: 12 times
Joined: Oct 21, 2010 8:54 am
Full Name: Dmitry Vedyakov
Contact:

Re: Health check on large backups

Post by Dima V. » 1 person likes this post

Health check now is a part of running backup jobs. First all vm's are processed, then so called "post-processing" starts which does all work regarding retention, healthchecks, etc.
Thomas N.
Novice
Posts: 8
Liked: 1 time
Joined: Apr 23, 2020 11:10 pm
Full Name: Thomas Ng
Contact:

Re: Health check on large backups

Post by Thomas N. »

Does the Backup window restriction setting in Schedule applies to the "post-processing" process? I don't want any VMs backup running during production hours but OK with merge, healthchecks, etc on the job.
Mildur
Product Manager
Posts: 8735
Liked: 2294 times
Joined: May 13, 2017 4:51 pm
Full Name: Fabian K.
Location: Switzerland
Contact:

Re: Health check on large backups

Post by Mildur » 1 person likes this post

Yes, it does.
Example: Healthcheck will be cancelled, if it takes longer as the configured allowed window.
Product Management Analyst @ Veeam Software
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev »

Actually that would be a bug if so, because the allowed window was supposed to only restrict activities that touch production environment. While all of the above-mentioned activities are isolated to a backup repository.
Mildur
Product Manager
Posts: 8735
Liked: 2294 times
Joined: May 13, 2017 4:51 pm
Full Name: Fabian K.
Location: Switzerland
Contact:

Re: Health check on large backups

Post by Mildur »

Anton, i had that on V9 and V10 with some customers.
The backup window restriction setting has cancelled the backup job.

Ok, only the transport and Health check.
Good to know :)
https://helpcenter.veeam.com/docs/backu ... ml?ver=110
The backup window affects only the data transport process and health check operations. Other transformation processes can be performed in the target repository outside the backup window.
Product Management Analyst @ Veeam Software
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev » 3 people like this post

Bugs can also be documented :D but it is not right that the health check process is included, as just like other transformation processes it does not touch a production environment.

The logic by devs was probably that the health check process MAY result in a job retry at the end to obtain data of corrupted blocks from the source. But since corruptions happen so rarely, there's actually no point to restrict health check from running outside of the allowed window merely based on this possibility. If should be allowed to run, but if job retry is needed - then it can be failed with the corresponding error.

@Egor Yakovlev FYI, this is especially important as we uncouple the health check process from backup jobs, since chances they will be scheduled outside of the backup window are pretty high ;)
Stephan23
Enthusiast
Posts: 50
Liked: 4 times
Joined: Jun 03, 2015 8:32 am
Full Name: Stephan
Contact:

Re: Health check on large backups

Post by Stephan23 »

A health check prevents tape jobs from running, which can mess with the schedule, when the tape job is so much delayed, it can't finish before the source jobs starts again the next day. In that case the tape job fails. Happens once a month here. In that case it would be helpful to respect the backup windows restrictions right?
But I'm hoping the other changes will be great for me in that case.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik » 8 people like this post

Having tested v11a...
Image
When verification is running alone, it can easily hit 1,5GB/s+, about 4-5x faster than before . That's actually much faster than reading data from quite good SAN (after disk extension for example).
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev » 1 person likes this post

Haha!! Thanks for sharing :D sounds like you have a pretty decent backup storage there.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik » 1 person likes this post

It's not *that* beefy (24*16TB SAS RAID60 on some MegaRAID, SSD cache)... One of my customers recently bought a Dell XE7100 that would have 48-disk RAID60, that might show some more interesting numbers, but it's not delivered yet.
Regnor
VeeaMVP
Posts: 940
Liked: 291 times
Joined: Jan 31, 2011 11:17 am
Full Name: Max
Contact:

Re: Health check on large backups

Post by Regnor »

That sounds promising. We do have some customers who experience very long health checks, some with RAID60 arrays, so I'm looking forward to see their results.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik »

It seems that improved verification throughput seems to hit performance bad enough to cause some backup semi-failures.
After enabling verification on some very large jobs, I've seen many errors on other jobs while verification runs:

Code: Select all

Error: Failed to call RPC function 'FcWriteFileEx': The supplied user buffer is not valid for the requested operation. Failed to write data to the file [<always a temporary VBM file path>].
However job seems to actually succeed as last retry has status "Nothing to process...". So far I've tried reducing number of tasks on repository, no visible improvement. I've got a busy week ahead but I'll try to find time to create a support case...
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev » 1 person likes this post

Judging on the error I'm not sure if this is related. If the error was due to a backup storage now being too busy, I would expect timeouts as opposed to buffer errors. Actually, you would see "repository is too busy" warnings in the action log first, even before those I/O timeout errors start to appear. But let's see what support finds out.
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik »

While I'm getting around to creating a support case (soon!), I noticed that DirectSAN is quite slow and went to reread some docs (a bit offtopic but the same case of missing ADF).
Advanced Data Fetcher is still not used for DirectSAN mode, right? Ironically now reads (more specifically re-read after extending a 25TB VMDK) from a powerful hybrid SAN (with DirectSAN) are much slower than verification (same symptom, queue depth of exactly 1).
Seems like a low hanging fruit as for Storage Snapshots Veeam would also have to query snapshoted VMDK layout in VMFS (or parse it from snapshot file system) to perform ADF reads, almost the same for DirectSAN. And a bit weird that DirectSAN has higher priority than HotAdd in that case, I'm not sure if it has any benefits over HotAdd at all from throughput perspective (ignoring CPU/Mem usage in VMware).
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health check on large backups

Post by Gostev »

This is not going to work for DirectSAN (of course we tried this when we developed ADF). But let's not hi-jack the thread with this off-topic :)
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik »

#05083160
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: Health check on large backups

Post by DonZoomik »

Support guy found an interesting remark in Win32 API WriteFile doc: "The WriteFile function may fail with ERROR_INVALID_USER_BUFFER or ERROR_NOT_ENOUGH_MEMORY whenever there are too many outstanding asynchronous I/O requests."

Unsure why it is happening only on our largest and highest-performance repository though, investigation in progress.
Mildur
Product Manager
Posts: 8735
Liked: 2294 times
Joined: May 13, 2017 4:51 pm
Full Name: Fabian K.
Location: Switzerland
Contact:

Re: Health check on large backups

Post by Mildur » 4 people like this post

I have to say, I'm really impressed by the new health-check speeds. 8)
Some example from one of our infrastructures:

Job with 16 VMs and 13.7TB Source Data:
Before V11a: 8.5h
With V11a: 1.5h

Job with 55 VMs and 12.8TB Source Data:
Before V11a: 5.5h
With V11a: 1.5h

Job with 15 VMs and 8.6TB Source Data:
Before V11a: 8.5h
With V11a: 1h
Product Management Analyst @ Veeam Software
NightBird
Expert
Posts: 242
Liked: 57 times
Joined: Apr 28, 2009 8:33 am
Location: Strasbourg, FRANCE
Contact:

Re: Health check on large backups

Post by NightBird »

Impressive !!! What kind of backend storage ?
Mildur
Product Manager
Posts: 8735
Liked: 2294 times
Joined: May 13, 2017 4:51 pm
Full Name: Fabian K.
Location: Switzerland
Contact:

Re: Health check on large backups

Post by Mildur »

HPE Apollo as a Linux Hardened Repo.
We are really happy about that product.
I am sure, that other vendors are getting same results after updating to V11a :-)
Product Management Analyst @ Veeam Software
agrob
Veteran
Posts: 383
Liked: 53 times
Joined: Sep 05, 2011 1:31 pm
Full Name: Andre
Contact:

Re: Health check on large backups

Post by agrob » 5 people like this post

I can confirm, health check is much faster in V11a.
Backup File Size (vbk): ~5TB
Check Duration before 11a: ~6h
Check Duration with 11a: ~50min
Thanks Veeam Team :-)
Stephan23
Enthusiast
Posts: 50
Liked: 4 times
Joined: Jun 03, 2015 8:32 am
Full Name: Stephan
Contact:

Re: Health check on large backups

Post by Stephan23 »

Unfortunately I cannot confirm those high rates. For one particular job it went from 12h to 8h, making other jobs still fail because of timeout.
Edit: Just checked again and the 12h was an outliner, was getting 8-9h even before the upgrade. So no noticeable change at all.
Mildur
Product Manager
Posts: 8735
Liked: 2294 times
Joined: May 13, 2017 4:51 pm
Full Name: Fabian K.
Location: Switzerland
Contact:

Re: Health check on large backups

Post by Mildur »

Unfortunately I cannot confirm those high rates. For one particular job it went from 12h to 8h, making other jobs still fail because of timeout.
Edit: Just checked again and the 12h was an outliner, was getting 8-9h even before the upgrade. So no noticeable change at all.
Are you still using a NetApp E-Series as a Backup Target over FC?
As far as I understand this implementation (System cache bypass), it only works with enterprise grade raid controller with direct attached disks and not with iSCSI or FC connected LUNs.
But I'm not 100 percent sure.
Product Management Analyst @ Veeam Software
agrob
Veteran
Posts: 383
Liked: 53 times
Joined: Sep 05, 2011 1:31 pm
Full Name: Andre
Contact:

Re: Health check on large backups

Post by agrob »

Stephan, how is the volume/array configured on which the backup files are stored? We have an Volume with 50x6TB Disk.
HannesK
Product Manager
Posts: 14322
Liked: 2890 times
Joined: Sep 01, 2014 11:46 am
Full Name: Hannes Kasparick
Location: Austria
Contact:

Re: Health check on large backups

Post by HannesK » 1 person likes this post

the performance gain for health check comes from "asynchronous read". that is expected to improve performance on all kind of storage systems.

I mean, if the storage is completely overloaded with other tasks, then the impact might be irrelevant. In general, also storage systems connected via FC / iSCSI or whatever protocol profit from async read. Maybe there are also other bottlenecks involved (tasks configuration, any other limits applied to the repository, whatever compute resource shortage)
Mildur
Product Manager
Posts: 8735
Liked: 2294 times
Joined: May 13, 2017 4:51 pm
Full Name: Fabian K.
Location: Switzerland
Contact:

Re: Health check on large backups

Post by Mildur »

Thanks Hannes for the clarification.
Product Management Analyst @ Veeam Software
Post Reply

Who is online

Users browsing this forum: Semrush [Bot] and 113 guests