Backup Health Checks Randomly Failing

cgsm · Post by **cgsm** » Oct 05, 2021 4:22 pm this post

My backup health checks are randomly failing. This health check is run daily and will fail on an incremental backup. However, when a synthetic full occurs (weekly on Fridays), the synthetic full is created successfully and subsequent backup health checks succeed (for a while). I have verified the result of the backup health check (failing or success) by trying to restore the VM the health check fails on (VM determined by running the Veeam.Backup.Validator.exe manually).

It is confusing to me that the synthetic fulls are created successfully, and that the next backup health check succeeds, given that the synthetic full is created from prior incrementals where the backup health check fails.

My backups are stored on a ReFS formatted (64k) drive mounted via iSCSI. The underlying storage is a Synology device with 8x HDDs in a RAID 10 served over a 10Gb connection. My VBR server is running as a VM and I do not see any resource starvation issues. Proxy appliances are also not maxed out.

I have had a few cases open about this and we cannot figure out what is going on. Current open case: 05058401. Previous cases: 05027708 & 04991159.

The exact error message: Failed to perform backup file verification Error: A data integrity checksum error occurred. Data in the file stream is corrupt.
Asynchronous request operation has failed. [readsize = 1114112] [offset = 864611926016]
Agent failed to process method {Signature.FullRecheckBackup}.

Post by **Gostev** » Oct 05, 2021 6:25 pm this post

Well, clearly it's a storage issue. Our Health Check functionality was specifically created to catch those

Please note that we don't recommend low-end NAS as a backup storage specifically due to reliability considerations.
These devices are simply not built for enterprise IT... they don't even have proper hardware RAID controllers.

They are built for consumer/SOHO because almost no applications used by such customers check for data integrity, and so all data losses just go unnoticed.
For example, it's next to impossible for a human to spot a tiny data corruption in most of the media files formats...

You should definitely look at replacing your NAS. We recommend general-purpose servers with enterprise-grade RAID controllers.

cgsm · Post by **cgsm** » Oct 05, 2021 10:10 pm this post

Gostev,

First, my Synology unit is mid-range in my mind (RP2818RP+) given it has 18 bays, 10Gb ethernet, dual PSUs, and above entry level CPU and memory. Regardless, I have a hard time saying "its a storage issue" since this issue is intermittent and quite odd since synthetic fulls are fine after corrupt incrementals and the fact that backups, restores, SureBackups, etc. run successfully 99.9% of the time.

Second, I ran a SureBackup job on the associated backup and its completes successfully with Veeam detecting a heartbeat from the marked-as-corrupt VM. Since SureBackup is booting the VM from the same backup that is marked as corrupt by the backup health check, this just doesn't make sense.

Third, I could successfully perform an Instant Recovery of the marked-as-corrupt VM (to a new VM). This raises a similar concern as my second point, why does Instant Recovery work by Entire VM restore not work? This leads me to think this isn't a storage hardware issue!

Lastly, just for reference, I tried to manually restore the marked-as-corrupt VM (Entire VM restore) and it failed. I successfully restored another VM from the same backup.

Oct 05, 2021 10:16 pm

Since SureBackup is booting the VM from the same backup that is marked as corrupt by the backup health check, this just doesn't make sense.

An instant or surebackup session doesn‘t read all blocks from the backup file. Only the ones they are needed to boot up windows. All other blocks are not touched until something in the vm needs that data. It‘s possible, that the corrupted blocks are not needed to boot up the vm.

In a entire vm restore, every block from the backup needs to be copied from the backup repo back to the production datastore.

If you want to test every block in a surebackup session, run the „Backup File Integrity Scan“ in the surebackup Job:
https://helpcenter.veeam.com/docs/backu ... ml?ver=110

Oct 05, 2021 11:55 pm

I would only add to an excellent explanation by Fabian that the only thing that really matters in the bigger picture is that your storage device cannot be trusted with storing data. The error about the checksum not matching the data means it returns different content than what was written to it. We checksum every block (and verify these checksums during restores and health checks) specifically because we do not trust any storage by default, and want to be able to catch it red handed when it starts to silently corrupt data.

Personally I would be shopping for a new storage the moment I saw such error for the first time, because using such storage is like using a bank that randomly looses money from your account. It's the classic "You Had One Job" situation and eventually it will bite you in the worst possible moment.

cgsm · Post by **cgsm** » Oct 06, 2021 1:22 pm this post

Gostev,
So by your reasoning, do you not trust any Synology device? Or any NAS? Even business oriented models? I ask because we do like to consolidate our hardware for space and cost efficiencies and we have other needs for a NAS besides backups.

Now, in regards to new storage, I see recommendations that any general purpose server will do. Is it acceptable to use something on the lower-end of the server spectrum, like a Dell R340, or are higher-end servers a must?

Post by **Gostev** » Oct 06, 2021 1:43 pm this post

Yes, our support statistics made me stop trusting ANY non-enterprise class NAS long ago, regardless of the vendor. In other words, anything that is built for Consumer or SOHO markets (you call the later "business oriented").

Higher-end servers are only required if you want a fat all-in-one Veeam appliance with all backups infrastructure components installed on this same server. This is an extremely popular Veeam deployment approach these days. You noted above you like to consolidate hardware for space efficiency, so it would be an excellent option for you to. But for such an appliance you'd better talk to your local Veeam specialist to recommend you the correct specification based on your environment size, so that you don't under- or over-spec.

If you only want a backup repository, then these days it would be hard to find a server that does not meet our System Requirements. So primarily, you should choose a model based on the number of LFF drives it fits, RAID controller options, the vendor you trust, and the price.

HPE and Cisco servers are what most Veeam customers are using (since they are our resell partners) so they are "safe choices" which will give you least surprises. But it does not mean other vendors should be avoided. Plenty of customers go with SuperMicro due to the price, for example. Your Dell R340 looks fine as well, the only question is whether it has an enterprise-grade RAID controller option so you'd have to research this aspect yourself. But this is really one of the keys to reliability. We highly recommend RAID controllers with BBWC (Battery Backed Write Cache).

cgsm · Post by **cgsm** » Oct 06, 2021 3:01 pm this post

Gostev,
Thanks for the reply. I will have to investigate new server/storage options further. I am still quite confused/disappointed that my "business oriented" Synology is seemingly the cause of the issues although we cannot determine this until we have some other storage to compare against!

Post by **Gostev** » Oct 06, 2021 3:18 pm this post

I think your confusion comes from underestimating the importance of enterprise-grade hardware RAID controllers. These don't have features like patrol reads and BBWC randomly for no good reason... rather, they come from decades of experience in supporting Enterprise IT environments.

We have many of our largest customers and service providers storing PBs of data on general-purpose servers like Cisco S3260 or HPE Apollo 4510 for years with not a single silent data corruption experienced which led to a data loss event, so we're very confident recommending this approach. And contrary to that, the stream of customers with corrupted/unrecoverable backups sitting on low-end NAS is never ending in our Customer Support.

cgsm · Oct 07, 2021 4:59 pm

Gostev,
I have also noticed another detail, the corruption only occurs when using ReFS. I have a Backup Copy job copying my backups to an NTFS formatted drive (just another iSCSI device on the same Synology box). Maybe this is more a ReFS issue? Or is the corruption simply not being detected on the NTFS formatted volume? Quite odd to me, hopefully you can provide some insight.

Post by **Gostev** » Oct 08, 2021 4:12 pm this post

Actually, most of those customers I referenced in the previous post run ReFS

as its benefits are just too huge to even consider anything else for Windows-based repository.

NTFS is quite basic in terms of functionality, as you would expect from soon 30 years old file system. For example, it does not have metadata structures that enable block cloning functionality. In other words, there are fewer things in NTFS for misbehaving storage to damage. As such, you're less likely to experience the impact of such corruption.

fsuchta@sitel.sk · Aug 05, 2022 8:09 am

hello cgsm. i would love to ask you, any progress about the synology data corruption? still no corruption with refs after disabling hdd write cache? ntfs still ok?

cgsm · Post by **cgsm** » Aug 05, 2022 12:56 pm this post

hello cgsm. i would love to ask you, any progress about the synology data corruption? still no corruption with refs after disabling hdd write cache? ntfs still ok?

Simply put, I am no longer using Synology to store any Veeam data.

However, I did do a lot of testing with Synology before I switched away from it. The two big things I noticed that improved the stability of the Synology hosted repositories were (1) use NTFS instead of ReFS, and (2) turn off HDD write cache. With these two settings I did not encounter any errors, if I recall. I did play around with some iSCSI and btrfs settings on the Synology to turn off a lot of "extra" stuff in attempts to get to the most "basic" or "raw" state for serving the volume, but I don't remember my exact settings and if any improved stability. I do remember NTFS and disabled write cache helped the most.

Note, I saw your private message as well.

fsuchta@sitel.sk · Aug 05, 2022 1:41 pm

thank you for your answer.

R&D Forums

Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Re: Backup Health Checks Randomly Failing

Who is online