A short summary of the case I'm talking about: VM backups produced by Veeam (both Active Full and Incremental) are corrupt, probably due to a VMware CBT-related issue. A discussion on this issue could be found at the following forum topic: https://forums.veeam.com/post277653.html.
What I propose is a way for Veeam to provide an eventually bulletproof solution to such issues, would they be caused by VMware CBT, Microsoft Hyper-V RCT or whatever other change tracking technology - for simplicity, I'll just use the term "CBT" for all of them from now on. In fact, what I'm proposing is a way to (eventually) ensure the backup is 100% identical to the source.
First lets examine the steps Veeam currently takes to perform a VM backup:
- Veeam instructs the underlying Virtualization infrastructure to perform a VM snapshot
- For each Virtual Disk, Veeam queries the CBT to return the list of blocks that either were ever written to (for the first or the Active Full backups) or were changed since the previous backup (for the Incremental backups)
- Veeam reads only the blocks reported by the CBT query and writes them to the Backup Repository
- Veeam instructs the underlying Virtualization infrastructure to remove the VM snapshot it used for backup
But what happens if CBT returns an incomplete list of blocks? As currently Veeam just blindly relies on the CBT technology, it would produce a corrupt backup, which is exactly what our case is all about.
Can anything be done about it? I strongly believe the answer is a clear YES! And the nice thing it shouldn't even be too hard to implement.
What are all the possible outcomes of running the CBT query?
- The query runs correctly, returning the list of all (and only) the blocks Veeam needs to read in order to produce a consistent backup
- The query returns a list of all the blocks Veeam really needs to read and some extra blocks - in this case the backup will still be consistent, but extra (and unnecessary) IO operations will be performed
- The query returns an incomplete list of blocks - relying on the incomplete list, Veeam would silently produce a corrupt backup
- A mix of #2 and #3
#3 (and as such is #4) are the most critical ones: for the customer, silent backup corruptions are extremely hard to catch in some occasions. I've explained here why: https://forums.veeam.com/vmware-vsphere-f24/relevant-system-characteristics-affected-by-vsphere-cbt-bug-t49960.html#p277267. Fortunately, not so for the backup software vendors: it seems like the easiest one for them to catch, would they implement the following proposal, that is.
So back to the steps Veeam takes in order to perform a VM backup: at "Step 3" it reads the blocks according to the list provided by the the CBT query and at "Step 4" it removes the VM snapshot. My proposal is not to remove the VM snapshot at this point, but to use it for backup verification. The reason behind this is the fact that at this stage, the backup and the snapshot MUST be identical (apart from the blocks skipped by the BitLooker, of course). Veeam can read some blocks, those NOT included in the list returned by the CBT query and compare them to those it has in its Backup Repository. If they are not identical (and not skipped by the BitLooker), then the backup (or the entire backup chain) could immediately be considered as corrupt (or tainted at least). Of course it's impractical to read the entire content of the disk this way, otherwise it's easier to just completely disable CBT. To make this approach practical, it must be possible for the customer to control the amount of blocks Veeam reads for verification purposes - it should be possible to limit by bytes/percentage of disk size/IO operations/time consumed/desired time window, etc. Say for non-critical VMs, Veeam could be configured to read just a few percent of each virtual disk on each job run, but for the critical ones a higher percentage could make sense. So say for your critical VMs, you set this to 10%. Considering the daily change rate, chances are Veeam would read all the blocks within a week (daily change rate of a few percent + 10% of blocks read for verification purposes). A more sophisticated block selection algorithm could even prefer blocks not read for the longest time, thus making the corruption detection even faster.
After a successful verification, the VM snapshot could be removed, but if a backup corruption was found, the VM snapshot could be left for further investigation.
My proposal would produce additional IO (to an extent controllable by the customer), but it will eventually for sure catch any difference between the backup and the source. And the beauty of it is the fact it's completely data agnostic - whatever the guest OS is, whatever filesystem, whatever files, DB or applications reside on the VM - as the verification is at the block level, ANY difference (apart from those caused by the BitLooker) must be considered a corruption - because at this stage, the VM snapshot must be identical to the backup - if it's not, there is a corruption.
So at the expense of some additional IO, Veeam could provide an eventually-ensured bulletproof backup verification. "Eventually" because corruptions may not be detected right away, but will for sure be detected within a timeframe controlled by the customer, balancing between the amount of additional IO (and a longer backup window) and the data safety requirements.
Going even further, Veeam could initiate a CBT reset for VMs it found backup corruption for and force an Active Full for the next backup run (or even initiate an immediate Active Full run), starting a new clean backup chain.
Re possible outcome #2: although far less critical than #3 (at least it causes no backup corruption, only unnecessary IO) - it's possible to count the amount of blocks reported by CBT as changed, but in fact are identical to those currently exist in the backup repository. Even though this approach is not as bulletproof as the one I've proposed for dealing with outcome #3, but those statistics could help determining cases when CBT returns excessive block ranges in its list.
My proposal does not completely replace backup testing, but it (eventually) ensures the backup is 100% identical to the source, which is a lot considering it's almost free from customer's perspective - customer only pays with additional IO to the amount acceptable on per-VM (or per Backup Job) basis.
As a byproduct, much higher detection rate of cases when the backup produced is not identical to the source will help the vendors (both Veeam and others like VMware, Microsoft, etc.) identify (and thus fix) the related bugs in their products.