Health Check Woes

Post by **RobMiller86** » Mar 06, 2024 3:21 pm this post

First, I just discovered our health checks haven’t been running. Apparently with the changes in V12, and selecting a time for the health check, that’s not entirely independent of the backup job schedule. I thought the schedule of the health check was now separate. Before it just ran on the same day as it didn’t have a specific time. Turns out, it now follows the same blackout window for the backup job. So if you have a maint window of 12am to 5am for the backup job, and you set your health check to run at 4am, it will simply never run, never alert you, and all the while you are thinking health checks are running but they aren’t. Personally, I think that should be more obvious or intuitive.

Then I was examining changing things up and only using SureBackup jobs for health checks so that we can truly independently schedule and monitor our verification jobs. I opened a ticket and specifically asked, with screenshots, if the integrity check on the SureBackup job is the same as the storage level corruption guard check on a backup job as the wording was similar but different, and if this is simply two different ways to schedule the same routine. I was told no, that the SureBackup integrity check option doesn’t heal corruption. I then asked “well what does it do then, simply alert you to corruption?”. Some days later I received a response that it does in fact heal corruption and the first response was incorrect. They stated the description is different, but the routine is the same. Unfortunately, I had already made a training video based on the first response, which now has incorrect info and I will have to adjust our standards and go through that again.

So now I am attempting to standardize on a health check regime that will function the way we desire and not be subject to silent failures as well as being robust. I plan on using a SureBackup job instead, test booting VMs, and adding the SureBackup integrity option to that job, rather than using storage level corruption guard on the backup job itself. It seems this is the only way to have better visibility into health checks and have more control over scheduling. I also need to develop training docs and videos based on our standards, so I want to be sure we are using the best possible options based on accurate information.

Veeam can you confirm that the two different methods of running a health/integrity check are in fact the same? If they are, I think it would be helpful if the name and description were the same in the UI as apparently it even confuses Veeam support.

Thanks,
Rob

Mar 06, 2024 4:11 pm

Yes, it's the same method: reading data blocks from backup and checking their content against their hashes. This is not necessarily the same functionality though, as SureBackup merely reports an issue, while as far as I remember the job activity in many cases will act further to try and fix the issue automatically by collecting invalid data anew during the next job run.

@Egor Yakovlev it is of course totally unexpected that the blackout window (designed to prevent snapshot creation spilling into production hours) applies to the health check activity, especially considering it is scheduled separately now and the user specifically chooses the "good" time to do perform it. I would guess they forgot to remove the blackout window check when separating health check from a job. Could you ask them to comment this check out in 12.1.2 if easy change?

Mar 06, 2024 4:33 pm

Thanks Gostev. I'm glad I posted as Veeam support is giving out some incorrect information and these operations are critical to understand. So in that case, if a SureBackup job with the integrity check option selected detects corruption, will it then error out the SureBackup job so that an alarm is generated in VSPC? At that point, could we then add a storage level corruption guard check to the backup job to have it heal the corruption? I'm weary of using the scheduled storage level corruption guard check due to potentially not running like I just experienced, and we have a large team all learning Veeam as we roll out more of it.

For that matter, if the SureBackup job, without an integrity check, only booting the VM, passes and the VM boots. How does Veeam feel about that check alone rather than running lengthy corruption guard checks? How likely is it that a VM restore point of a Windows server boots properly, but there is corruption in the backup files anyway? Personally I'd love to only use SureBackup booting VMs as a test, rather than the lengthy health checks, but don't want to leave us high and dry with corrupted backups at some point.We do sync all backups to object storage as well, and I don't feel it's likely they would be corrupted on both location if something did occur.

So SureBackup boot tests, without any integrity checks, while also syncing backups to object storage, good enough? Yay or nay?

Post by **RobMiller86** » Mar 06, 2024 4:51 pm this post

Btw, the exact final answer from Veeam support on this question was:

"It appears to be the same process, just attached to different jobs. If the integrity check on SureBackup fails, it should attempt to repair it just the same."

Post by **Gostev** » Mar 06, 2024 5:06 pm this post

General speaking, neither recoverability nor backup integrity tests are "good enough" on their own.

For example, successful recoverability test does not guarantee that the full machine restore will be successful, as it may fail due to one corrupted data block... so you need a backup integrity check to guarantee success for such restores. The opposite is also true: you may have a successful backup integrity test and a successful full machine restore, but the machine will not boot (or some app will not start) due to a corruption that pre-existed in the production machine when a backup was taken.

So ideally you should do both. Backup integrity checks are critically important for non-enterprise grade backup storage, as we see it lose data all the time in Support. Enterprise-grade storage is quite unlikely to have such issues on the other hand, just because it has many technologies built-in to ensure you don't have to worry about it... so if you trust your storage vendor to do the One Job they have well, then you can skip integrity checks. As for example, "patrol reads" feature of enterprise-grade RAID controllers is conceptually close to our integrity check.

Thanks for sharing the response from support, perhaps this has evolved in the recent versions. I'm sure the PM from SureBackup will comment if this is not so.

Mar 06, 2024 5:33 pm

Thanks. All of our new repos are Dell servers with SAS drives and BBU PERC H755 controllers. Many of the old ones are Thinkmate Supermicro boxes with BBU LSI megaraid controllers and sata drives which we will be running until they give up the ghost (purchased before my time here). I'm guessing neither of these are considered Ent enough to forego health checks. We know not to use NAS for backups, but I'm also thinking these servers simply aren't good enough to truly trust on their own. Perhaps we will isolate large servers to their own backup and surebackup jobs to help the situation.

I'm also thinking to only use the SureBackup integrity and boot tests, and then opening a support ticket if corruption is found to determine the best way to preserve the backups prior to the point of corruption. I do like having more visibility into the verification jobs.

Mar 07, 2024 12:57 pm

Hi Rob.

Case where health check session depends on a set backup job window is of course an unexpected one. Our QA could not reproduce it and health check starts just fine no matter of the restrictions in the backup job schedule.
Please open a support case for that job and share it's number with me, we will have to validate the scenario with more details from the logs.

Post by **RobMiller86** » Mar 07, 2024 4:16 pm this post

Well, another change I missed I guess. I may be going crazy, but I remember I used to see the health check just in the job logs, session stats or reports, but now it appears you have to go to a different section. Click on the tiny history button at bottom, and check the health check section there.

That then leads me to the next issue. The health checks have been attempting to run. Health Check kicks off at 4am, runs, still running at 5am, then the backup jobs start at 5am, and they stop the health check. "stopped by job XXX". It's not clear when you check the VBR console that health checks are failing unless you dig down into that specific section of the logs. Moreover, VSPC is giving no notification or alarms about the health checks failing.

So it appears that I stated the problem incorrectly. How does one set it so that a backup job doesn't cancel a running health check? How do we monitor this with VSPC? The job session alarm in VSPC isn't showing us any failures. The only health check related alarm I can see to configure in VSPC is for SureBackup jobs. So do we need to use SureBackup jobs instead if we want alarms in VSPC for health check failures? It's entirely possible that I'm missing something else here. Thanks.

Post by **Gostev** » Mar 07, 2024 6:03 pm this post

Backup job will always cancel optional activities like health check. Backup is simply a higher priority action, you never want to end up with no backups because for example your health check takes a week due to some switch port dropping to 10Mbps.

As for VSPC questions, please create a separate topic in the VSPC subforum, so that VSPC PMs can see it and answer.

Mar 07, 2024 6:11 pm

Reasoning aside, the end result with the way Veeam works currently, is that we have failing checks that we didn't know about and it makes it very hard to detect without constant manual checks and apparently very careful scheduling attempting to balance a verification window with a backup window. This wasn't a problem before as the backup job would wait for the health check to complete. Yes that means you could have delays in backups, but your health checks didn't silently fail. Would SureBackup job integrity checks also be stopped by backup jobs? This seems highly limiting. I will make a post in the forum about the VSPC alarms. But this lack of visibility, combined with scheduling inflexibility is tough to swallow.

Mar 08, 2024 1:48 pm

Personally, I struggle to understand the logic behind this change to have all health check related operations including health checks or SureBackup jobs to be considered lowest priority and allow any other operation to cancel them. If someone schedules a health op, it’s because they value that health op and want it to run. If the operation is causing other jobs to be delayed in a way that impacts their production backup schedules negatively, then that should be on them to adjust their windows to fit it in, or simply not schedule a health op as their desired backup schedule doesn’t allow for the necessary time.

Instead Veeam has decided for us that these operations are of the lowest priority. Veeam has flipped this around and instead made it where we can have even less backups running, as we must carefully schedule a wide backup blackout window to cover a wide maint window to ensure those health operations complete.

As a leader in the data protection space, I’m a bit surprised that Veeam has taken the stance that verifications are considered “optional activities”. I find this to be the wrong approach to the situation. The onus should be on those scheduling health checks at inopportune times impacting their backups. The onus should not be on those who value health checks, and now must sacrifice additional backup time to allow for very wide health op windows. I couldn’t possibly disagree with this decision more. At a minimum, this should be selectable by the operator, even if via a registry key if necessary.

That said, I haven’t received an answer about the lack of VSPC alarms, but I do have some questions.
1. Will a backup blackout window, set to stop all backups even if not completed, still allow SOBR offloads to run during the blackout window? I assume so since SOBR offloads are not tied to the backup job.
2. Will SOBR offloads also interrupt health check operations?
3. Why is it that Veeam can't allow some form of health check, on previous backups, while allowing new backups to roll? Even if you have to exclude the most recent restore point, or make that selectable. "Test backup points excluding the most recent X number of restore points". Or something like that so we can run some form of health check that doesn't impede new backups?

Post by **jordi-simtec** » Mar 21, 2024 11:44 am this post

Hi,

I agree with what most people have commented on this thread. It seems that this new design has been implemented without taking into consideration the importance of the health checks, which is a feature that sets VB&R appart from other solutions.

In order to prevent the health check tasks to silently abort without any notification, what would be the recommended best practise? Can't we really NOT monitor this situation through the VSPC?

JPMS · Apr 29, 2024 9:23 am

I wrote about this a year ago here - veeam-backup-replication-f2/confused-by ... 86604.html

The TLDR version is that Veeam doesn't really seem to have thought this through. When Health Checks were part of the backup job, yes they made the backup job longer, but at least you knew they were going to complete and not interfere with other jobs. Although they moved it to a separate job, they didn't give it all the controls that you can have over other jobs and only allowed it to be scheduled at a specific time, with no coordination possible as to what is happening with other jobs at the same time. The issue is not just Health Checks getting cancelled by backups but Sureback jobs failing because of files in use by Health Checks.

We run all our jobs chained through a Powershell script. We have a limited backup window and this has been those most efficient way to utilise that time. However, because there is no control over when the Health Check runs (apart from a set time), We often end up with Health Checks being cancelled or Sureback jobs failing because the files are in use by a Health Check. We have made a request for Powershell commands to be able to control the running of Health Checks but there is no commitment to do it yet - powershell-f26/ability-to-run-a-health- ... 93528.html

I'm disappointed and frustrated that a year on and Veeam don't seem to acknowledge that they have caused a problem by this change in v12 and made a commitment to address the issues it has caused.

Apr 29, 2024 4:32 pm

Hi all.

In terms of notification\awareness\monitoring of Health Check sessions:
- we have added a Health Check Daily Summary email (part of global email notification settings)
- new events 41700\41710 were added to Windows Event Log\Syslog stating status per-session with start\finish details

As for the health check task priority, I am positive there was a service registry key added in one of the latest updates. Please, let me check that with the team tomorrow.

/Thanks!

JPMS · Post by **JPMS** » Apr 29, 2024 4:46 pm this post

It came up in the post I linked to.

BackupHealthCheckPreventInterruption = 1

I've asked for clarification of how it works but haven't had a reply yet. What happens if a Backup Job tries to run? Does it wait for the Health Check to complete and run or does it fail the Backup Job and schedule a retry?

It also wouldn't help with one of the issues I have. If I am running a Surebackup job and a Health Check is running, the Surebackup job fails because the files are locked by the Health Check. I would really like the same control over Health Checks that I have with any other B&R job, particularly in regards to when and how I choose to run it. It all worked without any of these issues the way it was implemented in V11!

JPMS · Apr 30, 2024 9:07 am

The other thing it would be useful to know is does it work with all jobs (Surebackup, Tape etc) or just Disk Backup jobs?

Apr 30, 2024 11:11 am

Yes, that key is the one to bump health check priority. If health check starts backup processing, other jobs will fail to start (exclusive lock). Backup job will wait for health check to end in order to start processing new restore points.
SOBR offload will not even start in that case and wise versa, if SOBR offload is already working(locked the backup), health check will not take over it.

JPMS · Apr 30, 2024 11:19 am

So all it achieves is making sure Health Checks progress to conclusion at the cost of failing other jobs. So we only move the problem elsewhere. This is not a solution.

Can you see why, as a user, I am not very happy at all with the changes made in v12 when the way it worked in v11 gave no problems at all?

As I said before, if you want to move Health Checks to a separate process (from backups) then give us the tools to run it like any other job with proper flexible scheduling options and Powershell support.

Post by **Egor Yakovlev** » May 02, 2024 9:22 am this post

Totally, and we will continue to tune health check behavior onwards, based on feedback like one we see in this thread.

May 03, 2024 7:12 pm

JPMS wrote: ↑Apr 30, 2024 11:19 am So all it achieves is making sure Health Checks progress to conclusion at the cost of failing other jobs. So we only move the problem elsewhere. This is not a solution.

Can you see why, as a user, I am not very happy at all with the changes made in v12 when the way it worked in v11 gave no problems at all?

As I said before, if you want to move Health Checks to a separate process (from backups) then give us the tools to run it like any other job with proper flexible scheduling options and Powershell support.

The whole reason they separated health checks was planning ability, instead of allowing the old style to run directly after job they just chose the scheduling option with no option to run it directly after backup. A separate job type would bring both benefits.

This created a lot of headaches for us as well as we had to to reevaluate the whole schedule.

JPMS · May 04, 2024 10:45 am

While you are also considering feedback about Health Checks, can you please revisit email notifications?

@Gostev posted this in another discussion...

Gostev wrote: ↑Dec 11, 2023 5:14 pm I guess the team was acting on my general guidance to not flood customers with instant emails for every single thing that happens. Because we have so many features by now with their own reports that customers tell us they stop reading them altogether!

But perhaps this can be made an option for those who want to be notified instantly.

Me personally, I think I would schedule health check for some weekend hours and would prefer to receive a summary email before I wake up on Monday morning, which is when I'm going to action it.

I don't want notifications Monday morning. I need notifications straight away, when the job has completed. If there is an issue, I need to address it straight away (even at weekends). It is also much easier to keep track of jobs if the notifications are grouped together not spaced over a couple of days.

ITP-Stan · May 06, 2024 8:38 am

In smaller environments it makes more sense to prioritize a backup over a health check.
So make it an option to configure to keep enterprise and SMB happy.

R&D Forums

Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Re: Health Check Woes

Who is online