Comprehensive data protection for all workloads
Post Reply
JPMS
Expert
Posts: 105
Liked: 31 times
Joined: Nov 02, 2019 6:19 pm
Contact:

Confused by v12 storage-level corruption guard

Post by JPMS »

I have been investigating why storage-level corruption guard hasn't run since we installed the v12 RTM. I see that v12 now has a time setting as part of storage-level corruption guard but the Veeam documentation for this is rather threadbare.

Looking at other sites, it now seems that it is possible to run storage-level corruption guard as a separate process, independent of a backup job. But is there any way to run or configure it to work the 'legacy' way and just run at the end of the backup job? Can I just 'fudge' it by setting the storage-level corruption guard time a couple of minutes after the backup job, so it will see there is a job running, wait for it to complete and then run? Or will it just see the files as not available and just not run?

Alternatively, as it is possible to run storage-level corruption guard as a separate process, are there now any powershell commands to run it directly?
Mildur
Product Manager
Posts: 8735
Liked: 2294 times
Joined: May 13, 2017 4:51 pm
Full Name: Fabian K.
Location: Switzerland
Contact:

Re: Confused by v12 storage-level corruption guard

Post by Mildur » 1 person likes this post

Looking at other sites, it now seems that it is possible to run storage-level corruption guard as a separate process, independent of a backup job. But is there any way to run or configure it to work the 'legacy' way and just run at the end of the backup job?
No, can you please share why it should run "the legacy way?
Can I just 'fudge' it by setting the storage-level corruption guard time a couple of minutes after the backup job, so it will see there is a job running, wait for it to complete and then run? Or will it just see the files as not available and just not run?
The health check job will wait until a backup job has finished. Then it will start processing the latest restore point.
Alternatively, as it is possible to run storage-level corruption guard as a separate process, are there now any powershell commands to run it directly?
No, not yet. I put in a feature request to have options to manually run a health check.

Best,
Fabian
Product Management Analyst @ Veeam Software
JPMS
Expert
Posts: 105
Liked: 31 times
Joined: Nov 02, 2019 6:19 pm
Contact:

Re: Confused by v12 storage-level corruption guard

Post by JPMS » 1 person likes this post

No, can you please share why it should run "the legacy way?
Chaining

We have limited time availability for our repo. The most efficient way to use that time is to chain all our backup requirements so one job starts as soon as the last one is finished and there is no 'dead' unused time. We first tried doing this with the built in facilities in the console but found greater flexibility in running all our jobs from a powershell script which we have been doing since v9.

Veeam have now 'broken' that because the only way to run the health job is at a specific time. There is no way to chain it. Although it wasn't a separate job, the previous version was effectively chained because it was part of the backup job and our next job would not proceed until the backup job (including health check) was completed.

I can understand why some people may have requested the ability to run the health check separately but your solution seems half-baked and badly thought out. If you were going to implement this, why not do it properly and make the health check a proper job type in its own right, with the ability to link it other jobs and with powershell commands. At the very least, don't take away the ability to run it as it has been in previous versions, just add a timed facility for those who want it.

As you may gather, I'm pretty annoyed about this. My initial thought, to set a time similar to the backup job, will cause a health check to run after the backup job but it will now overlap with our next scripted job which will start as soon as the backup job has finished. I can't see a way to make this work for us.
JPMS
Expert
Posts: 105
Liked: 31 times
Joined: Nov 02, 2019 6:19 pm
Contact:

Re: Confused by v12 storage-level corruption guard

Post by JPMS » 1 person likes this post

I have also noticed that this doesn't just affect our VM backup jobs but our Protection Group backups too.

This in itself is no surprise but does highlight another major failing of health checks, the lack of notifications! I'm not sure what happened with v11 as I don't think I ever saw a health check failure. I would guess that it would show a backup failure if the health check part of it failed but I don't know.

Whatever the case with v11, v12 now runs as a separate job (I even sat and watched it run in the early hours of this morning) and it seems that the only place you can find what happened is to look at the history in the console under a new section, 'System->Health Check'. I'm looking at 10 weeks of health check failures (because our repo is offline at the default time of 22:00 setup by Veeam) and have never received a single notification about the failures!

Maybe I am missing something and you can put me right but for part of a backup system to fail and not notify me is unacceptable. As I said before, the v12 implementation of heath checks seems half-baked and badly thought out.
Mildur
Product Manager
Posts: 8735
Liked: 2294 times
Joined: May 13, 2017 4:51 pm
Full Name: Fabian K.
Location: Switzerland
Contact:

Re: Confused by v12 storage-level corruption guard

Post by Mildur »

Hi JPMS

Thanks for your feedback and elaboration. I will think about it and discuss it with the team.
Are you seeing performance issues with health checks for one job running at the same time as another job is doing a backup?
This in itself is no surprise but does highlight another major failing of health checks, the lack of notifications!
We are aware of the missing notifications. It was a mistake on our side. Notifications will come back as soon as possible:
post482695.html#p482695
We first tried doing this with the built in facilities in the console but found greater flexibility in running all our jobs from a powershell script which we have been doing since v9.
Maybe you can open a new topic and elaborate a bit, why you need to schedule your jobs by PowerShell instead of using the internal Veeam scheduler? I want to understand the use case and what are you missing from our scheduler.

Best,
Fabian
Product Management Analyst @ Veeam Software
JPMS
Expert
Posts: 105
Liked: 31 times
Joined: Nov 02, 2019 6:19 pm
Contact:

Re: Confused by v12 storage-level corruption guard

Post by JPMS » 1 person likes this post

Are you seeing performance issues with health checks for one job running at the same time as another job is doing a backup?
It's early days, we have only just managed to successfully run our first health check last night! TBH it probably won't have a major impact for us but as we are still in an age of mechanical storage (100TB of SSD way too expensive for a small company like us), we like to avoid overlapping disk intensive operations as a matter of good practice. Obviously we are not your only customer and I would be surprised if this doesn't impact some sites.
We are aware of the missing notifications.
Thanks for the link, which I wasn't aware of. Normally when a new release comes out I try and regularly browse the forum to see what issues are arising but I have been just too busy to do this with the v12 release.
Maybe you can open a new topic and elaborate a bit, why you need to schedule your jobs by PowerShell instead of using the internal Veeam scheduler?
I will try and find the time but will make these general comments.

I first started down this route with v9. I couldn't chain jobs together how I wanted. I can't remember what the limitation was and it may not even still exist in v12. However, now doing it this way, I would never want to go back to using the console to do it. It is so much easier when it is all laid out in a script, than working your way through all the different jobs in the console and working out which are dependant on which and in what order they occur. We can also intersperse other, non-Veeam, operations within our schedule. I know you can link pre and post scripts with Veeam jobs but keeping track of it all is not easy. So much simpler with a single script that we can easily read through and controls all the Veeam and non-Veeam jobs each evening. Please don't make me go back to using the console for this :wink:

I think this is a common experience for people with a lot of systems. We program nearly all our network equipment via the command line (rather than a GUI web interface) because it is easier to understand the configuration, faster to program and more flexible.

As I said before, I think the current health check change is half baked. You moved health check to a separate job but it is still configured from within a backup job and you only have a very basic scheduling option with no ability to chain it to other jobs and wait for its completion before processing other jobs. The lack of notifications is maybe a reflection that changes in this feature may not have received the full analysis and attention they deserve.

For me, all I need now, is that Start-VBRHealthCheckJob command!

P.S. Will you change the timeouts on this website! Every time I write a post that requires a bit of thought and effort, when I click 'submit', I have to login again and I'm taken to a blank post page. Fortunately I have discovered that using the browser 'back' button takes me back to my 'creation' and I can copy and paste it back in but I wonder how many people don't realise this and just give up.
jmc
Service Provider
Posts: 91
Liked: 8 times
Joined: Sep 12, 2011 11:49 am
Full Name: jmc
Location: Duisburg - Germany
Contact:

Re: Confused by v12 storage-level corruption guard

Post by jmc » 1 person likes this post

hi,

i would like to join the function on independent health checks. our backup is done at night and during the day the server and the storages twiddle their thumbs. if it was a separate job, then i could move the health checks to the day.

mfg
jmc
Everybody ask why the dinosaurs are gone - nobody ask why they are lived so long
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Confused by v12 storage-level corruption guard

Post by Gostev »

@jmc but health check *is* a separate job that you can move to the day?
jmc
Service Provider
Posts: 91
Liked: 8 times
Joined: Sep 12, 2011 11:49 am
Full Name: jmc
Location: Duisburg - Germany
Contact:

Re: Confused by v12 storage-level corruption guard

Post by jmc »

hello gostev,

sorry. my mistake. you are right. but what i meant was to define it as a standalone job like a backup or bc or replica and build a chain. at the moment i can say on what days and at what time.

great it would be as a standalone job to hang the health one after the other to e.g. not waste time, or not let them overlap.

thanks
jmc
Everybody ask why the dinosaurs are gone - nobody ask why they are lived so long
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Confused by v12 storage-level corruption guard

Post by Gostev »

You don't really need to introduce the complexity of managing separate jobs to achieve this goal. Health check has lower priority and will receive task slots only if there are no competing backup/restore activities. So if you schedule your health checks to start some time during your backup window, they will only receive task slots after ALL backups finish, which achieves "not waste time, or not let them overlap".

And more generally speaking, we recommend strongly against leveraging job chaining for any use cases at all. It just causes all sorts of problems in real-world environments when something unexpected happens in an infrastructure, messing up your perfect plan of chained jobs. You really want your backups to be as much "fire and forget" as possible so introducing artificial dependencies is never a good idea.
JPMS
Expert
Posts: 105
Liked: 31 times
Joined: Nov 02, 2019 6:19 pm
Contact:

Re: Confused by v12 storage-level corruption guard

Post by JPMS »

This works fine if you have a repo that is available 24 hours a day and dedicated to B&R. I appreciate we are probably unusual in that that is not the case for us but the new implementation is a big retrograde step for us.
You don't really need to introduce the complexity of managing separate jobs to achieve this goal.
One person's complexity is another person's flexibility :wink:

I actually think you have made it more complicated because Health Checks now function differently to other jobs. Leaving aside my own requirements, I think it still would be simpler if they worked the same way as other jobs because there would be consistent way of doing things.
JPMS
Expert
Posts: 105
Liked: 31 times
Joined: Nov 02, 2019 6:19 pm
Contact:

Re: Confused by v12 storage-level corruption guard

Post by JPMS »

Gostev wrote: May 02, 2023 10:16 am And more generally speaking, we recommend strongly against leveraging job chaining for any use cases at all. It just causes all sorts of problems in real-world environments when something unexpected happens in an infrastructure, messing up your perfect plan of chained jobs. You really want your backups to be as much "fire and forget" as possible so introducing artificial dependencies is never a good idea.
It would be interesting to hear some real world examples of this. I am struggling to think of an example where not chaining would fare any better than job chaining if "something unexpected happens in an infrastructure". Recovery of a chained job is as easy as seeing what was the last successfully completed jog and then continuing the chain from there.

Furthermore, B&R offers chaining of just about any job via the console anyway; Backup, Tape, and Surebackup jobs can all be scheduled 'After this job'. Are you saying you 'recommend strongly against' using the features you provide in your own software?
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Confused by v12 storage-level corruption guard

Post by Gostev »

One example is some job taking 10x longer due to API calls against some infrastructure component timing out due to overload, or because suddenly 10x more disk changes in protected VMs, or because of a network congestion (due to a firewall bottlenecking or some network port falling back to 100Mbps) etc... we've seen it all in 15 years. But one thing is always common: as a result, our customers end up without latest backups on dependent jobs at the worst possible moment.

And yes, that is exactly what I am saying. I was strongly against introducing this feature in the first place for the reasons I've mentioned above, however I was not the ultimate decision maker at the time (this was over 10 years ago) but the CTO insisted. It has been an endless source of support issues ever since due to customers chaining primary jobs... but I also can't remove it due to significant number of people using it. Plus, optional non-critical activities can be fine to chain, so I it would not be right to remove this functionality completely anyway.
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Confused by v12 storage-level corruption guard

Post by Gostev »

JPMS wrote: May 02, 2023 10:34 amI actually think you have made it more complicated because Health Checks now function differently to other jobs. Leaving aside my own requirements, I think it still would be simpler if they worked the same way as other jobs because there would be consistent way of doing things.
Considering it functions exactly the same way as other jobs: you just schedule health check to run at the certain day/time and it does, is "complicated" really a right word?

Having said that, now that we made Health Check a completely independent process in V12, we can actually consider adding a health check scheduling option to make it start after the job has finished executing. This was causing a massive number of support cases before due to performance issues (backup jobs slowing down and even timing out due to concurrent health check processes of already completed jobs overloading backup storage) because health check was an integral part of a backup job that had to process the entire backup at once. But starting from V12, it became a dedicated process with the lowest priority in terms of backup infrastructure resources access, and (I assume) it works on per-machine level now as everything else in V12.
JPMS
Expert
Posts: 105
Liked: 31 times
Joined: Nov 02, 2019 6:19 pm
Contact:

Re: Confused by v12 storage-level corruption guard

Post by JPMS »

Considering it functions exactly the same way as other jobs: you just schedule health check to run at the certain day/time and it does, is "complicated" really a right word?
'Inconsistent' would be a better word. Rather than setting it up as a job in itself (like other jobs), it is a setting within a backup job, in the 'storage' section, in advanced settings, even though it now runs independently of the backup job. So it is now a separate job but isn't configured like a separate job. Personally I would have given it it's own job section, like Backup, Surebackup and Tape. You've also put the 'History' under 'System', rather than 'Jobs', which again I think is inconsistent (and I only noticed by accident when trying to find out what was happening with my Health Checks). I also won't dwell on the lack of notifications, which I assume was an oversight, and will be corrected at some stage, rather than a design decision.

I can understand your reasons for making the Health Check a separate process. I just you should have implemented in the same way as other B&R jobs.

With regards to chaining, I'm still unclear how chaining jobs is a worse solution than scheduling jobs, in the event of an infrastructure failure. Clearly I do not have the depth of experience that you have with dealing with failures and I am missing something. I also have the advantage of dealing with a comparatively small, simple, environment.
RubinCompServ
Service Provider
Posts: 261
Liked: 66 times
Joined: Mar 16, 2015 4:00 pm
Full Name: David Rubin
Contact:

Re: Confused by v12 storage-level corruption guard

Post by RubinCompServ » 2 people like this post

Gostev wrote: May 02, 2023 11:46 am One example is some job taking 10x longer due to API calls against some infrastructure component timing out due to overload, or because suddenly 10x more disk changes in protected VMs, or because of a network congestion (due to a firewall bottlenecking or some network port falling back to 100Mbps) etc... we've seen it all in 15 years. But one thing is always common: as a result, our customers end up without latest backups on dependent jobs at the worst possible moment.

And yes, that is exactly what I am saying. I was strongly against introducing this feature in the first place for the reasons I've mentioned above, however I was not the ultimate decision maker at the time (this was over 10 years ago) but the CTO insisted. It has been an endless source of support issues ever since due to customers chaining primary jobs... but I also can't remove it due to significant number of people using it. Plus, optional non-critical activities can be fine to chain, so I it would not be right to remove this functionality completely anyway.
OTOH, we have a customer with many backup jobs scheduled out so that the previous one should complete before the next one runs, but if the first one runs long (because, as you said, things happen), the second one starts competing for resources. So while they are both fighting, the third one then kicks off. And so on until the morning when they realize they've received no backup notifications (success or otherwise) and discover that there are 25 jobs running, with 15 of them still sitting at 0% and the other 10 sitting anywhere from 1%-99%, but can't be canceled because that leads to corrupt restore points and, occasionally, breaks the chain entirely.
JPMS
Expert
Posts: 105
Liked: 31 times
Joined: Nov 02, 2019 6:19 pm
Contact:

Re: Confused by v12 storage-level corruption guard

Post by JPMS » 1 person likes this post

Surebackup error tonight...

6/9/2023 8:22:17 PM Error [MyServer]: Error: Item [MyServer.934cea6e-4e05-48ec-8668-445e9d2257eD2023-06-09T200515_1C7A.vbk] is locked by running session HealthCheck MyBackup VMs [Backup Health Check]

Never had this issue when Health Check was part of the backup and wouldn't have it if we could chain the Health Check.
JPMS
Expert
Posts: 105
Liked: 31 times
Joined: Nov 02, 2019 6:19 pm
Contact:

Re: Confused by v12 storage-level corruption guard

Post by JPMS »

Any update on future plans for Health Check?

Most important is the lack of email notification for success/failure. I had expected this would have got patched pretty quickly once it was brought to your attention because it is not acceptable to have to check this manually. Manual processes always have the potential to get missed/forgotten and that is the last thing you need with a backup solution.

It would also be good to know if PowerShell commands will be made available and what plans there are (if any) to make Health Check more controllable as a job within the GUI.
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Confused by v12 storage-level corruption guard

Post by Gostev »

Let's please not mix everything - from SureBackup issues to different feature requests - into the same thread, as this is totally unmanageable.

If you're interested about the status of the health check email notifications, then there's the dedicated thread about this functionality (linked above).

If you have a feature request for our PowerShell SDK, please post into the PowerShell subforum where the corresponding PM responsible for evolving our PowerShell capabilities can see it and comment. And same for any other feature requests.
Post Reply

Who is online

Users browsing this forum: Bing [Bot], ericschott_OF, Google [Bot] and 75 guests