Understanding health check corruption on backup copy job

lucius_the · Nov 23, 2021 11:19 am

I am having a strange issue with health checks on a backup copy job. Let me first explain the setup shortly.
- 1 larger local repo (Windows 10 Veeam Backup Proxy VM with iSCSI to a larger QNAP 12 drives RAID6) 10 Gbps link to proxies
- 1 offsite repo (Linux Hardened + immutable repo, Debian 10 on a PC box with XFS+reflink, with iSCSI to a smaller QNAP 4 drives RAID10) 100 Mbps link to main site

(Anton made me well aware that QNAPs are not the best choice for a repo, but that's what they still use there)

Now the problem. As I have a linux box in offsite location now, I've put health checks to run each day. As it no longer consumes network bandwidth (previously I had NFS that was mounted directly from main site to offsite location, health checks were slow and I only run them on weekends, now I can run them each day).

On my main backup I run active fulls on saturdays, also on my offsite I keep GFS weekly on saturdays.
Now, health check in offsite repo runs each day, and everything is green and ok. Last saturday the health check returns this:

Code: Select all

20.11.2021. 23:49:33 :: Failed to perform backup file verification Error: Backup file contains corrupted blocks, an active full is required.
Agent failed to process method {Signature.FullRecheckBackup}.

Immediately I went to see what backups are marked as Corrupted. I found one VM that is marked as Corrupted, but interestingly - the whole backup chain is marked as corrupted. Not just the last increment. Heal check found the last Full to be corrupted. This would imply that there was a bit flip on the storage. Ok. Next thing, I use the Veeam Backup Validator to check precisely that increment of that VM on the offsite repo. It runs and returns that everything is fine !

Now the part that I don't get - and can't seem to find any indication is:
- did the health check run another session to replace the failed chunks ? If so, how do I know for sure if that is what happened ?
- also if the above happened, why did Veeam tell me that an Active Full is required ? That makes no sense. If the backup is ok now, why do I need another Active Full.

The thing that bothers me is that I can not find enough data to answer this question:
- did the backup repair itself ?
- or was there a trasient, one time, read error while the health check was run, that didn't show again next time I ran a manual health check via command line (using backup validator) ?

How can I know for sure what happened here ?

I do understand that using a QNAP NAS is something where Veeam sees most problems in recovery in their suport, I know this already. I am also aware that using this storage is risky and that it doesn't really matter if there was a one-time read error, because that shouldn't be happening and my backup copy is in trouble. But I still want to understand how exactly health check runs, did it repair the backup or not -> and how do I know this ? Currently there is not enough visibility into this. At least not through the UI...

Any insight ?

Btw, I had this same problem about a month ago (or maybe 2 months, not really sure). That time also, health check returned a corruption, but after that validator returned that everything is fine.
For any point I've tried the validator returned that everything is fine. At that time I had a much smaller VMs backup marked as corrupted, so I could run validator many times, and each time everything was ok. This time I ran it only once, as the corruption was detected on a bugger VM and it takes 2 hours to complete.

Post by **veremin** » Nov 23, 2021 12:33 pm this post

Have you already checked the User Guide that talks in details about transform process?

I think the information you are looking for is described in "Retries" section:

If the health check detects corrupted data, Veeam Backup & Replication completes the backup job with the Error status and starts the health check retry process. The health check retry starts as a separate backup job session. During the health check retry, Veeam Backup & Replication attempts to transport data blocks for the corrupted restore point from the source datastore.

Basically you need to find the other session and check the repair status there.

Thanks!

lucius_the · Post by **lucius_the** » Nov 23, 2021 1:05 pm this post

Well that's the thing. Where is that other "separate backup job session" and how do I find it ? I go through all job history and this is not mentioned anywhere. This is what I mean by "not enough visibility".

It said in the manual that a new session will be started, but I don't see it. I don't see any extra files on the backup storage either (looking at the files in the offsite repo in Linux) on that day when health check detected corruption, or any other day. So, how do I know is repair was done.

Nov 23, 2021 3:11 pm

Sorry, I missed the fact that you are talking about backup copy jobs (not backup ones), so I provided the description of how transform process works for backup job.

In case of backup copy transform repair process looks a bit different:

- it is a part of health check activity (so no separate session)
- once the corrupted blocks found, it nullifies them (open backup file, open backup file metadata, write there something like "the selected blocks are null)
- during next session it copies valid blocks from production backup, as a result, the backup chain gets “fixed”, and you get a possibility to restore data from restore points

The second operation is not possible in your situation, cause files are located on immutable storage, so only new active full cycle can be of any help here.

The fact that health check failed and identified some changes to immutable restore points suggests that there was a certain hardware level problem, so might be worth reviewing it as well.

Thanks!

lucius_the · Post by **lucius_the** » Nov 23, 2021 5:00 pm this post

Thank you for this additional information.
But, why does Veeam Backup Validator then report that everything is ok with the backup ? How can it be ok, if nothing was written/rewritten in existing files and no new files were added.

If so, the only explanation I have is: there was a silent read error when data was being read from storage (but somehow wasn't reported as error by the storage system, only wrong data was returned) just this one time when this health check reported corruption. This I find a bit problematic to swallow :/ This is why I opened this topic. All previous health checks were ok, except on Friday when corruption was detected (in the whole chain running back to the last full), but then all health checks after Friday return everything is ok again. Also, Veeam Backup Validator returns everything is ok.

How can it show that everything is ok now if there was corruption *and* no remediation was done to copy new files over to it.
Are these storage devices really that bad ? WD Red Pros are inside. QNAP, I think, is using Linux Software RAID for this. But RAID10 is not really that complex. So how can this be happening. Can bit flips manifest themselves in such way, on a RAID-10 ? One time you get wrong data, next time you get good data back, just like that ? My iSCSI setup is using checksums for both header and data (it's also using MPIO, but that's should do anything bad, it's just for bandwidth).

This is more of an investigatory question. I just want to know more about what's going on. This looks too strange.

Could it be the Linux box itself, is that likely ? Because it's just a normal PC box (Lenovo SFF desktop box) so there's no ECC memory inside it, just 8 GB of plain DDR RAM.
I don't know if there's any way I can know for sure what was the true reason for this strange case of transient corruption. It happened twice in this setup already, since I build it and put it to use, some 2 months ago. Both times it was the same: corruption was detected, but then I can no longer detect corruption - and data is on immutable storage.

Post by **veremin** » Nov 23, 2021 6:01 pm this post

There can be various reasons why this situation happened, but to validate the assumptions we would need debug logs and potentially access to the infrastructure. This can be done within the support ticket, so kindly open one - health check identifying storage corruption is good reason to bring additional assistance and expertise.

Also, backup validator is not equal to health check operation.

Thanks!

Nov 23, 2021 6:05 pm

David, in case you can't "reproduce" the issue, then from your description I would also suspect an intermittent network equipment issue resulting in data loss during data transfer. Last time I checked our support big data, it showed this happening in 3-4 environments out of 1000 - so not that uncommon. This is one of many reasons we recommend against backing up to NAS: SMB/NFS protocols provide no guaranteed intact data delivery.

lucius_the · Post by **lucius_the** » Nov 23, 2021 6:24 pm this post

Hi Anton, we are using iSCSI here with both header and data digest (confirmed to be in use). Communication between main and offsite repo is using Veeam components, I'm not sure it could be a network related issue, as everything that goes over the network here has additional checksum validation in above layers.

I'll open a ticket tomorrow (we have a licence) if you are willing to take a look. Remote access to the system can also be organized if needed.

Nov 23, 2021 6:37 pm

Ah, I did not catch the iSCSI part. This is good that you use it and have digests enabled, as this should exclude the possibility of any transport-layer data losses.

Dec 09, 2021 5:27 pm

Just curious - any update here? I too am curious as to what would cause this scenario (health check reports corruption, corruption is not fixed due to immutability, then backup validator still says all is A-OK).

Thanks!

lucius_the · Post by **lucius_the** » Dec 10, 2021 4:49 pm this post

No updates since I didn't open a support ticket yet :/ will try next week (if logs in export will still contain everythign they need to figure this out...)

R&D Forums

Understanding health check corruption on backup copy job

Re: Understanding health check corruption on backup copy job

Re: Understanding health check corruption on backup copy job

Re: Understanding health check corruption on backup copy job

Re: Understanding health check corruption on backup copy job

Re: Understanding health check corruption on backup copy job

Re: Understanding health check corruption on backup copy job

Re: Understanding health check corruption on backup copy job

Re: Understanding health check corruption on backup copy job

Re: Understanding health check corruption on backup copy job

Re: Understanding health check corruption on backup copy job

Who is online