- 1 larger local repo (Windows 10 Veeam Backup Proxy VM with iSCSI to a larger QNAP 12 drives RAID6) 10 Gbps link to proxies
- 1 offsite repo (Linux Hardened + immutable repo, Debian 10 on a PC box with XFS+reflink, with iSCSI to a smaller QNAP 4 drives RAID10) 100 Mbps link to main site
(Anton made me well aware that QNAPs are not the best choice for a repo, but that's what they still use there)
Now the problem. As I have a linux box in offsite location now, I've put health checks to run each day. As it no longer consumes network bandwidth (previously I had NFS that was mounted directly from main site to offsite location, health checks were slow and I only run them on weekends, now I can run them each day).
On my main backup I run active fulls on saturdays, also on my offsite I keep GFS weekly on saturdays.
Now, health check in offsite repo runs each day, and everything is green and ok. Last saturday the health check returns this:
Code: Select all
20.11.2021. 23:49:33 :: Failed to perform backup file verification Error: Backup file contains corrupted blocks, an active full is required.
Agent failed to process method {Signature.FullRecheckBackup}.
Now the part that I don't get - and can't seem to find any indication is:
- did the health check run another session to replace the failed chunks ? If so, how do I know for sure if that is what happened ?
- also if the above happened, why did Veeam tell me that an Active Full is required ? That makes no sense. If the backup is ok now, why do I need another Active Full.
The thing that bothers me is that I can not find enough data to answer this question:
- did the backup repair itself ?
- or was there a trasient, one time, read error while the health check was run, that didn't show again next time I ran a manual health check via command line (using backup validator) ?
How can I know for sure what happened here ?
I do understand that using a QNAP NAS is something where Veeam sees most problems in recovery in their suport, I know this already. I am also aware that using this storage is risky and that it doesn't really matter if there was a one-time read error, because that shouldn't be happening and my backup copy is in trouble. But I still want to understand how exactly health check runs, did it repair the backup or not -> and how do I know this ? Currently there is not enough visibility into this. At least not through the UI...
Any insight ?
Btw, I had this same problem about a month ago (or maybe 2 months, not really sure). That time also, health check returned a corruption, but after that validator returned that everything is fine.
For any point I've tried the validator returned that everything is fine. At that time I had a much smaller VMs backup marked as corrupted, so I could run validator many times, and each time everything was ok. This time I ran it only once, as the corruption was detected on a bugger VM and it takes 2 hours to complete.