Comprehensive data protection for all workloads
Post Reply
RobMiller86
Service Provider
Posts: 142
Liked: 21 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller
Contact:

Health Check Woes

Post by RobMiller86 »

First, I just discovered our health checks haven’t been running. Apparently with the changes in V12, and selecting a time for the health check, that’s not entirely independent of the backup job schedule. I thought the schedule of the health check was now separate. Before it just ran on the same day as it didn’t have a specific time. Turns out, it now follows the same blackout window for the backup job. So if you have a maint window of 12am to 5am for the backup job, and you set your health check to run at 4am, it will simply never run, never alert you, and all the while you are thinking health checks are running but they aren’t. Personally, I think that should be more obvious or intuitive.

Then I was examining changing things up and only using SureBackup jobs for health checks so that we can truly independently schedule and monitor our verification jobs. I opened a ticket and specifically asked, with screenshots, if the integrity check on the SureBackup job is the same as the storage level corruption guard check on a backup job as the wording was similar but different, and if this is simply two different ways to schedule the same routine. I was told no, that the SureBackup integrity check option doesn’t heal corruption. I then asked “well what does it do then, simply alert you to corruption?”. Some days later I received a response that it does in fact heal corruption and the first response was incorrect. They stated the description is different, but the routine is the same. Unfortunately, I had already made a training video based on the first response, which now has incorrect info and I will have to adjust our standards and go through that again.

So now I am attempting to standardize on a health check regime that will function the way we desire and not be subject to silent failures as well as being robust. I plan on using a SureBackup job instead, test booting VMs, and adding the SureBackup integrity option to that job, rather than using storage level corruption guard on the backup job itself. It seems this is the only way to have better visibility into health checks and have more control over scheduling. I also need to develop training docs and videos based on our standards, so I want to be sure we are using the best possible options based on accurate information.

Veeam can you confirm that the two different methods of running a health/integrity check are in fact the same? If they are, I think it would be helpful if the name and description were the same in the UI as apparently it even confuses Veeam support.

Thanks,
Rob
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health Check Woes

Post by Gostev »

Yes, it's the same method: reading data blocks from backup and checking their content against their hashes. This is not necessarily the same functionality though, as SureBackup merely reports an issue, while as far as I remember the job activity in many cases will act further to try and fix the issue automatically by collecting invalid data anew during the next job run.

@Egor Yakovlev it is of course totally unexpected that the blackout window (designed to prevent snapshot creation spilling into production hours) applies to the health check activity, especially considering it is scheduled separately now and the user specifically chooses the "good" time to do perform it. I would guess they forgot to remove the blackout window check when separating health check from a job. Could you ask them to comment this check out in 12.1.2 if easy change?
RobMiller86
Service Provider
Posts: 142
Liked: 21 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller
Contact:

Re: Health Check Woes

Post by RobMiller86 »

Thanks Gostev. I'm glad I posted as Veeam support is giving out some incorrect information and these operations are critical to understand. So in that case, if a SureBackup job with the integrity check option selected detects corruption, will it then error out the SureBackup job so that an alarm is generated in VSPC? At that point, could we then add a storage level corruption guard check to the backup job to have it heal the corruption? I'm weary of using the scheduled storage level corruption guard check due to potentially not running like I just experienced, and we have a large team all learning Veeam as we roll out more of it.

For that matter, if the SureBackup job, without an integrity check, only booting the VM, passes and the VM boots. How does Veeam feel about that check alone rather than running lengthy corruption guard checks? How likely is it that a VM restore point of a Windows server boots properly, but there is corruption in the backup files anyway? Personally I'd love to only use SureBackup booting VMs as a test, rather than the lengthy health checks, but don't want to leave us high and dry with corrupted backups at some point.We do sync all backups to object storage as well, and I don't feel it's likely they would be corrupted on both location if something did occur.

So SureBackup boot tests, without any integrity checks, while also syncing backups to object storage, good enough? Yay or nay?
RobMiller86
Service Provider
Posts: 142
Liked: 21 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller
Contact:

Re: Health Check Woes

Post by RobMiller86 »

Btw, the exact final answer from Veeam support on this question was:

"It appears to be the same process, just attached to different jobs. If the integrity check on SureBackup fails, it should attempt to repair it just the same."
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health Check Woes

Post by Gostev »

General speaking, neither recoverability nor backup integrity tests are "good enough" on their own.

For example, successful recoverability test does not guarantee that the full machine restore will be successful, as it may fail due to one corrupted data block... so you need a backup integrity check to guarantee success for such restores. The opposite is also true: you may have a successful backup integrity test and a successful full machine restore, but the machine will not boot (or some app will not start) due to a corruption that pre-existed in the production machine when a backup was taken.

So ideally you should do both. Backup integrity checks are critically important for non-enterprise grade backup storage, as we see it lose data all the time in Support. Enterprise-grade storage is quite unlikely to have such issues on the other hand, just because it has many technologies built-in to ensure you don't have to worry about it... so if you trust your storage vendor to do the One Job they have well, then you can skip integrity checks. As for example, "patrol reads" feature of enterprise-grade RAID controllers is conceptually close to our integrity check.

Thanks for sharing the response from support, perhaps this has evolved in the recent versions. I'm sure the PM from SureBackup will comment if this is not so.
RobMiller86
Service Provider
Posts: 142
Liked: 21 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller
Contact:

Re: Health Check Woes

Post by RobMiller86 » 1 person likes this post

Thanks. All of our new repos are Dell servers with SAS drives and BBU PERC H755 controllers. Many of the old ones are Thinkmate Supermicro boxes with BBU LSI megaraid controllers and sata drives which we will be running until they give up the ghost (purchased before my time here). I'm guessing neither of these are considered Ent enough to forego health checks. We know not to use NAS for backups, but I'm also thinking these servers simply aren't good enough to truly trust on their own. Perhaps we will isolate large servers to their own backup and surebackup jobs to help the situation.

I'm also thinking to only use the SureBackup integrity and boot tests, and then opening a support ticket if corruption is found to determine the best way to preserve the backups prior to the point of corruption. I do like having more visibility into the verification jobs.
Egor Yakovlev
Veeam Software
Posts: 2537
Liked: 683 times
Joined: Jun 14, 2013 9:30 am
Full Name: Egor Yakovlev
Location: Prague, Czech Republic
Contact:

Re: Health Check Woes

Post by Egor Yakovlev » 1 person likes this post

Hi Rob.

Case where health check session depends on a set backup job window is of course an unexpected one. Our QA could not reproduce it and health check starts just fine no matter of the restrictions in the backup job schedule.
Please open a support case for that job and share it's number with me, we will have to validate the scenario with more details from the logs.
RobMiller86
Service Provider
Posts: 142
Liked: 21 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller
Contact:

Re: Health Check Woes

Post by RobMiller86 »

Well, another change I missed I guess. I may be going crazy, but I remember I used to see the health check just in the job logs, session stats or reports, but now it appears you have to go to a different section. Click on the tiny history button at bottom, and check the health check section there.

That then leads me to the next issue. The health checks have been attempting to run. Health Check kicks off at 4am, runs, still running at 5am, then the backup jobs start at 5am, and they stop the health check. "stopped by job XXX". It's not clear when you check the VBR console that health checks are failing unless you dig down into that specific section of the logs. Moreover, VSPC is giving no notification or alarms about the health checks failing.

So it appears that I stated the problem incorrectly. How does one set it so that a backup job doesn't cancel a running health check? How do we monitor this with VSPC? The job session alarm in VSPC isn't showing us any failures. The only health check related alarm I can see to configure in VSPC is for SureBackup jobs. So do we need to use SureBackup jobs instead if we want alarms in VSPC for health check failures? It's entirely possible that I'm missing something else here. Thanks.
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Health Check Woes

Post by Gostev »

Backup job will always cancel optional activities like health check. Backup is simply a higher priority action, you never want to end up with no backups because for example your health check takes a week due to some switch port dropping to 10Mbps.

As for VSPC questions, please create a separate topic in the VSPC subforum, so that VSPC PMs can see it and answer.
RobMiller86
Service Provider
Posts: 142
Liked: 21 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller
Contact:

Re: Health Check Woes

Post by RobMiller86 »

Reasoning aside, the end result with the way Veeam works currently, is that we have failing checks that we didn't know about and it makes it very hard to detect without constant manual checks and apparently very careful scheduling attempting to balance a verification window with a backup window. This wasn't a problem before as the backup job would wait for the health check to complete. Yes that means you could have delays in backups, but your health checks didn't silently fail. Would SureBackup job integrity checks also be stopped by backup jobs? This seems highly limiting. I will make a post in the forum about the VSPC alarms. But this lack of visibility, combined with scheduling inflexibility is tough to swallow.
RobMiller86
Service Provider
Posts: 142
Liked: 21 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller
Contact:

Re: Health Check Woes

Post by RobMiller86 »

Personally, I struggle to understand the logic behind this change to have all health check related operations including health checks or SureBackup jobs to be considered lowest priority and allow any other operation to cancel them. If someone schedules a health op, it’s because they value that health op and want it to run. If the operation is causing other jobs to be delayed in a way that impacts their production backup schedules negatively, then that should be on them to adjust their windows to fit it in, or simply not schedule a health op as their desired backup schedule doesn’t allow for the necessary time.

Instead Veeam has decided for us that these operations are of the lowest priority. Veeam has flipped this around and instead made it where we can have even less backups running, as we must carefully schedule a wide backup blackout window to cover a wide maint window to ensure those health operations complete.

As a leader in the data protection space, I’m a bit surprised that Veeam has taken the stance that verifications are considered “optional activities”. I find this to be the wrong approach to the situation. The onus should be on those scheduling health checks at inopportune times impacting their backups. The onus should not be on those who value health checks, and now must sacrifice additional backup time to allow for very wide health op windows. I couldn’t possibly disagree with this decision more. At a minimum, this should be selectable by the operator, even if via a registry key if necessary.

That said, I haven’t received an answer about the lack of VSPC alarms, but I do have some questions.
1. Will a backup blackout window, set to stop all backups even if not completed, still allow SOBR offloads to run during the blackout window? I assume so since SOBR offloads are not tied to the backup job.
2. Will SOBR offloads also interrupt health check operations?
3. Why is it that Veeam can't allow some form of health check, on previous backups, while allowing new backups to roll? Even if you have to exclude the most recent restore point, or make that selectable. "Test backup points excluding the most recent X number of restore points". Or something like that so we can run some form of health check that doesn't impede new backups?
jordi-simtec
Service Provider
Posts: 3
Liked: 1 time
Joined: Sep 20, 2019 9:21 pm
Full Name: Jordi Casas
Contact:

Re: Health Check Woes

Post by jordi-simtec »

Hi,

I agree with what most people have commented on this thread. It seems that this new design has been implemented without taking into consideration the importance of the health checks, which is a feature that sets VB&R appart from other solutions.

In order to prevent the health check tasks to silently abort without any notification, what would be the recommended best practise? Can't we really NOT monitor this situation through the VSPC?
Post Reply

Who is online

Users browsing this forum: Semrush [Bot] and 103 guests