Feature Request: Selective Block-Level Backup Verification

staskorz · Apr 09, 2018 2:04 pm

Just for reference: I've proposed this one to the VMware support (case 02260663) about 7 months ago.

A short summary of the case I'm talking about: VM backups produced by Veeam (both Active Full and Incremental) are corrupt, probably due to a VMware CBT-related issue. A discussion on this issue could be found at the following forum topic: https://forums.veeam.com/post277653.html.

What I propose is a way for Veeam to provide an eventually bulletproof solution to such issues, would they be caused by VMware CBT, Microsoft Hyper-V RCT or whatever other change tracking technology - for simplicity, I'll just use the term "CBT" for all of them from now on. In fact, what I'm proposing is a way to (eventually) ensure the backup is 100% identical to the source.

First lets examine the steps Veeam currently takes to perform a VM backup:

Veeam instructs the underlying Virtualization infrastructure to perform a VM snapshot
For each Virtual Disk, Veeam queries the CBT to return the list of blocks that either were ever written to (for the first or the Active Full backups) or were changed since the previous backup (for the Incremental backups)
Veeam reads only the blocks reported by the CBT query and writes them to the Backup Repository
Veeam instructs the underlying Virtualization infrastructure to remove the VM snapshot it used for backup

But what happens if CBT returns an incomplete list of blocks? As currently Veeam just blindly relies on the CBT technology, it would produce a corrupt backup, which is exactly what our case is all about.

Can anything be done about it? I strongly believe the answer is a clear YES! And the nice thing it shouldn't even be too hard to implement.

What are all the possible outcomes of running the CBT query?

The query runs correctly, returning the list of all (and only) the blocks Veeam needs to read in order to produce a consistent backup
The query returns a list of all the blocks Veeam really needs to read and some extra blocks - in this case the backup will still be consistent, but extra (and unnecessary) IO operations will be performed
The query returns an incomplete list of blocks - relying on the incomplete list, Veeam would silently produce a corrupt backup
A mix of #2 and #3

#3 (and as such is #4) are the most critical ones: for the customer, silent backup corruptions are extremely hard to catch in some occasions. I've explained here why: https://forums.veeam.com/vmware-vsphere ... ml#p277267. Fortunately, not so for the backup software vendors: it seems like the easiest one for them to catch, would they implement the following proposal, that is.

So back to the steps Veeam takes in order to perform a VM backup: at "Step 3" it reads the blocks according to the list provided by the the CBT query and at "Step 4" it removes the VM snapshot. My proposal is not to remove the VM snapshot at this point, but to use it for backup verification. The reason behind this is the fact that at this stage, the backup and the snapshot MUST be identical (apart from the blocks skipped by the BitLooker, of course). Veeam can read some blocks, those NOT included in the list returned by the CBT query and compare them to those it has in its Backup Repository. If they are not identical (and not skipped by the BitLooker), then the backup (or the entire backup chain) could immediately be considered as corrupt (or tainted at least). Of course it's impractical to read the entire content of the disk this way, otherwise it's easier to just completely disable CBT. To make this approach practical, it must be possible for the customer to control the amount of blocks Veeam reads for verification purposes - it should be possible to limit by bytes/percentage of disk size/IO operations/time consumed/desired time window, etc. Say for non-critical VMs, Veeam could be configured to read just a few percent of each virtual disk on each job run, but for the critical ones a higher percentage could make sense. So say for your critical VMs, you set this to 10%. Considering the daily change rate, chances are Veeam would read all the blocks within a week (daily change rate of a few percent + 10% of blocks read for verification purposes). A more sophisticated block selection algorithm could even prefer blocks not read for the longest time, thus making the corruption detection even faster.

After a successful verification, the VM snapshot could be removed, but if a backup corruption was found, the VM snapshot could be left for further investigation.

My proposal would produce additional IO (to an extent controllable by the customer), but it will eventually for sure catch any difference between the backup and the source. And the beauty of it is the fact it's completely data agnostic - whatever the guest OS is, whatever filesystem, whatever files, DB or applications reside on the VM - as the verification is at the block level, ANY difference (apart from those caused by the BitLooker) must be considered a corruption - because at this stage, the VM snapshot must be identical to the backup - if it's not, there is a corruption.

So at the expense of some additional IO, Veeam could provide an eventually-ensured bulletproof backup verification. "Eventually" because corruptions may not be detected right away, but will for sure be detected within a timeframe controlled by the customer, balancing between the amount of additional IO (and a longer backup window) and the data safety requirements.

Going even further, Veeam could initiate a CBT reset for VMs it found backup corruption for and force an Active Full for the next backup run (or even initiate an immediate Active Full run), starting a new clean backup chain.

Re possible outcome #2: although far less critical than #3 (at least it causes no backup corruption, only unnecessary IO) - it's possible to count the amount of blocks reported by CBT as changed, but in fact are identical to those currently exist in the backup repository. Even though this approach is not as bulletproof as the one I've proposed for dealing with outcome #3, but those statistics could help determining cases when CBT returns excessive block ranges in its list.

My proposal does not completely replace backup testing, but it (eventually) ensures the backup is 100% identical to the source, which is a lot considering it's almost free from customer's perspective - customer only pays with additional IO to the amount acceptable on per-VM (or per Backup Job) basis.

As a byproduct, much higher detection rate of cases when the backup produced is not identical to the source will help the vendors (both Veeam and others like VMware, Microsoft, etc.) identify (and thus fix) the related bugs in their products.

Apr 10, 2018 8:22 pm

I second this request and have also suggested it albeit informally when we've had CBT problems. In fact we worked on more than one case with CBT corruption and were one of the first customers to experience the CBT bug when resizing a volume (2090639). So I feel your pain not knowing where the issue is. I mention all this to say that because of our experiences, we've run backups without CBT for almost 2 years until the last couple months. We could no longer read the entire production side overnight after we got to about 110+ TB. If it weren't for that, I'd disable CBT again after reading your posts. Fortunately, we currently do not see any CBT corruption and have also been running replicas on the entire environment using CBT without a single problem. We also test using SureBackup which I think needs major improvements to be effective (at least for us). But that's another conversation altogether.

Here's what my version would be. I think a schedule could be put on the job to periodically read the entire VM during a cycle. The same type of scheduling options available for Active Full Backup would work for us. So when a job runs, it would still query changed blocks, but also read the rest of the VM and compare those checksums to the backup repository. In fact this is similar to what happens when you don't use CBT or you reset CBT. It has to read the entire VM but it only transfers the changed blocks according to the backup repository. During a job cycle with verify, the job can simply fail when a non changed block does not compare. I would also want a verify option to go ahead and update the repository with the corrected block but still fail the job.

I think you're right, this seems like a very simple feature. I'd be interested to know why something like this has not been implemented yet. Maybe there's some technical reason for it.

staskorz · Post by **staskorz** » Apr 11, 2018 6:22 am this post

Agree, it's also a valid verification technique. In this case, I would prefer to have a setting in place for "budgeting" the complete disk reads on per machine or per job basis. Say limit the number of Virtual Disks or VMs to read end to end (considering their size, of course) or just limit by total byte reads for every job run. Veeam will just have to keep track of when each Virtual Disk was last verified, so it could prefer the ones not verified for the longest period. It could also make sense to set the limit for groups of jobs (or for the entire system), otherwise jobs with fewer VMs will get their VMs verified more often than those with larger amount of VMs.

Apr 15, 2018 9:12 pm

Actually, I would not call this "bulletproof backup verification". What it really is, is a purpose-built solution specifically targeted at the particular data collection bug in VMware API that has only been observed once in 10 years. This is not by any means to make the issue look unimportant and small, but to show how this makes it the least likely cause of failed recoveries.

The whole reason we had to create SureBackup 7 years ago is seeing our customers (which we had 10 times less at the time) too often having failed recoveries from bit-identical copies of productions machines... so, simply ensuring that the copy of a production VM is bit-identical can never make it "bulletproof backup verification" by itself.

Due to this fact in particular, we always treated physical backup data verification as absolutely essential, but a separate task. We call this functionality "storage-level corruption guard" and this is something you schedule in the advanced backup job settings. And on one hand, it would seem that enhancing this existing functionality with what you are suggesting (optionally re-reading the source snapshot, and not just the target backup files) is a good idea in the world where even core data collection APIs cannot be trusted. However, one could argue that in such a world, we should also not trust the source VM snapshot to correctly represent the state of a production VM? And we should not trust API that reads that snapshot either? Do you see where I am going with this? Where do you draw the line to define that low level "trusted API" which CAN be used to verify higher level, "untrusted" APIs output? And what if that API you said you ARE willing to trust will still have a bug?

Well, now you know what Alice felt when she was falling into that rabbit hole... anyway, let's wait for the final conclusion from VMware first and foremost - as this may give us a good direction.

staskorz · Post by **staskorz** » Apr 16, 2018 9:10 am this post

Ideally I would prefer the backup job to always read the entire source and not rely on tricks such as CBT. But as the data grows, reading the entire source became impractical, causing the customers to ditch the traditional backup methodologies in favor of the more tricky ones, such as the CBT-based "Incremental Forever". But customers do expect the backup to be bit-identical to the source, regardless of whether a traditional or a modern approach was used. It's the most basic assumption about a successfully complete backup.

Where do you draw the line to define that low level "trusted API" which CAN be used to verify higher level, "untrusted" APIs output?

Traditional backup - that's where I draw the line. Once a traditional backup job completes successfully, at least you can always trust it's bit-identical to the source. Being bit-identical is a prerequisite to any further testing. Right, it doesn't mean for sure you do have a working backup copy, but if it's not bit-identical in first place, it doesn't really matter whether you were able to boot the VM or not - cause there is a corruption hiding somewhere - it just so happens the corrupt location was not read from during the test.

So if you're backing up a working VM and the backup completes successfully and the backup is verified to be bit-identical and it has passed some basic testing, say by SureBackup, then it's pretty safe to consider this backup copy as a working one. Unfortunately, SureBackup cannot replace block-level verification, but together they could provide much higher level of confidence.

What it really is, is a purpose-built solution specifically targeted at the particular data collection bug in VMware API that has only been observed once in 10 years.

Well, not really. Take a look at the following posts:
https://forums.veeam.com/veeam-backup-r ... ml#p277941
https://forums.veeam.com/vmware-vsphere ... ml#p278451

The first post talks about an environment hit by the CBT bug https://kb.vmware.com/s/article/2090639. The second mentions not being able to restore 2 VMs, but they are not sure about the reason for the restore failure - maybe it's because of a CBT bug, maybe it's due to some other issue - the thing is, unfortunately, they will just never know.

The ONLY way to determine whether the issue is caused by a CBT bug or not is to keep the VM snapshot the backup was read from. Otherwise, you have nothing to compare against. If the snapshot is immediately removed after the backup completes, one could only speculate about the reasons behind a corruption - it's just impossible to provide a hard evidence without the source snapshot.

The thing is, when using VMFS, VM snapshots cause severe performance degradation, so Veeam removes them immediately after finishing reading. Fortunately for us, we are using VVOLs, so a VM snapshot is actually a storage snapshot and our Nimble storage is much much better at snapshots than VMFS. At some point, during the troubleshooting of this case, our most heavily loaded production SAP VMs had 31(!) snapshots for several days, without any performance impact whatsoever. Without that, the troubleshooting of this issue was not possible.

There are many reports about corrupt backups (both here on Veeam forum, other backup vendor forums and just generally over the internet). In some cases CBT was clearly determined to be the cause, for some it wasn't. But the moment the source snapshot is removed, there is no way to tell for sure. There is even no way to troubleshoot. So bad luck, better test your backups next time, pal.

I strongly believe we are not that unique, chances are many more are hit by this (and possibly other) CBT issues - they are just completely unaware of that. I mean if a backup completes successfully, the restored VM can be powered on, the application comes up and passes some basic checks - why not just consider the backup copy as a working one at this point? How many organizations go much deeper than that?? And for VMs used as file servers, you just CAN'T go any deeper.

But if you, Veeam, could make sure the backup is bit-identical to the source, a large portion of those issues would be revealed at very early point. And VMware would have much more data to work with, thus quicker resolving such issues.

Apr 16, 2018 10:48 am

staskorz wrote:But if you, Veeam, could make sure the backup is bit-identical to the source, a large portion of those issues would be revealed at very early point.

I think you're just missing the main point of my post. As I noted, yes it would be quite trivial to enhance our existing backup health check functionality top optionally re-read data from the original working snapshot in additional to re-reading the backup file's content. But, this can never achieve what you're looking to get (bulletproof proof that backup file is matching the production VMs). Here's why - I will try to explain this in other words to make it easier to understand.

Essentially, all these 5 images could potentially be non-matching:
I1: production VMDK
I2: production VMDK snapshot (different due to bugs in the snapshot logic)
I3: production VMDK snapshot content representation to backup vendor through API (different due to bugs in VADP)
I4: production VMDK snapshot representation of backup application (different due to bugs in the backup application logic)
I5: production VMDK in backup file (different due to backup storage-level corruption)

Our current backup health check for primary backup jobs ensures I4 and I5 are the same (I4=I5). Essentially, it verifies that backup files content is what we expect one to be. This is a nice harmless test to do regularly, because it does not put any load on the production storage, and yet catches a very commonly observed corruption types.

Now, your suggested enhancement of this test will ensure I3=I5. However, what you are really looking to get is the confirmation that I1=I5. And this can be only achieving by putting trust into assuming that I3=I2=I1. However, why would you assume that?

If you simply think anything other than I3=I2=I1 is simply unrealistic - then think again, because now that VMware support has excluded some variables from the equation, there's actually a good chance that the bug that triggered this whole discussion does sit right there in I3.

staskorz · Post by **staskorz** » Apr 16, 2018 11:02 am this post

Mr. Gostev, first of all many thanks for the detailed explanation.

Essentially, there following could potentially all be 5 non-matching images:
I1: production VMDK
I2: production VMDK snapshot (different due to bugs in the snapshot logic)
I3: production VMDK snapshot representation to backup vendor through API (different due to bugs in VADP)
I4: production VMDK snapshot representation of backup application (different due to bugs in the backup application logic)
I5: production VMDK in backup file (different due to backup storage-level corruption)

We have tested that if we clone a VM from snapshot (I2), it's fine (and I mean the same snapshot the backup was read from) - so the snapshot is fine, but the backup produced from it is corrupt.

We have also tested that if we disable CBT, the backup is fine. So if CBT is disabled, I3 is also fine.

Does it mean I3 only goes wrong when CBT is enabled??

Post by **Gostev** » Apr 16, 2018 11:04 pm this post

It would appear so based on the fact that it is the VADP team (API responsible for I3) that is now in charge of researching the issue. But, let's see what they find out.

Apr 17, 2018 5:42 am

Gostev, I respectfully disagree this has to be a habit hole. This feature is simply something that verifies CBT (I3) has been returning consistent information about the VMDK. CBT has been a constant problem and worry for a lot of folks not just running VMware but HyperV as well. I for one would sleep better at night knowing CBT has been verified at least periodically even at the cost of some storage IOs. You never know when a new CBT bug will get introduced or discovered.

Apr 17, 2018 5:29 pm

I don't disagree (see my first post above). But we were discussing the bigger issue here (achieving bulletproof verification), and the rabit hole reference was regards of that specifically. Which by the way can be done, but only by reading VMDK from inside of the virtual machine with an agent.

staskorz · Post by **staskorz** » Apr 17, 2018 6:55 pm this post

reading VMDK from inside of the virtual machine with an agent

I like the idea with the agent A LOT!

But as the VM is "live", it keeps changing, making the results non-deterministic (even if it finds a non-identical block, you can't tell whether it REALLY indicates a problem OR this block was written to by a VM just after the backup).

UPDATE: the following only good for VVOLs, otherwise cloning a VM is probably impractical... sorry!

In order to workaround this, you could:

clone a VM from the snapshot the backup was read from
reconfigure the cloned VM to boot from, say, an ISO with a custom Linux supplied by Veeam with an agent and a pre-configured IP address (and maybe some other modifications to the cloned VM: say reducing the number of vCPUs to 1, reducing the amount of RAM, different network, etc.)
like you said, read from inside the VM using an agent (the one supplied in the ISO from the previous step)

That approach would produce deterministic results.

And, maybe you don't even need to transfer the data the agent reads to one of the Veeam servers for comparison - instead just calculate the checksums and compare them to those of the data in Backup Repository.

A bonus: maybe it's possible to remove the snapshot the verification VM was cloned from just after the clone operation completes, leaving an active snapshot on the live VM for less time.

One more bonus: also possible to read from the cloned VM not only for verification, but perform the entire backup from it, leaving the active snapshot just enough time to produce a cloned VM (resulting in even shorter snapshot consolidation time).

The problem with cloning VMs is the fact those operations are a bit resource-consuming, and thus should be minimized. To mitigate this, maybe it's better to use @mloeckle's approach which is reading the entire VM (but NOT after EACH backup) and not as I first proposed (reading a portion of the VM after EACH backup) - so you only clone a few VMs for each job, instead of all of them.

Only took us a few iterations to make it REALLY bulletproof.

UPDATE: smiley removed as probably only feasible for VVOLs.

staskorz · Post by **staskorz** » Apr 18, 2018 7:16 am this post

But as the VM is "live", it keeps changing, making the results non-deterministic

Just thought about a non-VVOLs specific workaround for blocks changing during verification.

After the backup is complete, read some portion of the blocks from inside the original live VM with an *agent
Chances are at this point at least some blocks will not be identical as the VM is live and thus keeps modifying the disk(s)
Query the CBT again to see what blocks have changed since the creation of the snapshot used for backup
Filter the list of non-identical blocks, removing those skipped by BitLooker as well as the ones reported by CBT as changed in the previous step
After the filtering, any non-identical block found indicates an inconsistency - and this time it IS deterministic

Why only reading a portion of blocks each time? Because as the process runs inside the live VM, it might be preferable to make it as short as possible - cause reading the entire VM may take many hours and thus be impractical.

Re the *agent: maybe it's possible to utilize that little process Veeam injects for guest interaction? (it probably should be modified to support the block-level verification functionality)

Re Step 3: maybe it's even possible to query the CBT without creating a snapshot? - I mean you don't need to read the changed blocks, you only need the list of them.

This time it should really eventually ensure I1 = I5.

May 07, 2018 10:40 pm

Gostev wrote:I don't disagree (see my first post above). But we were discussing the bigger issue here (achieving bulletproof verification), and the rabit hole reference was regards of that specifically. Which by the way can be done, but only by reading VMDK from inside of the virtual machine with an agent.

It seems that since CBT appears to be broken when migrating a VM on a VVol to another host, that the basic (non agent) verify idea would have caught the backup corruption. I know it would have caught all the different troubles with CBT we've experienced in the past as well. And it would seem like a very easy idea to implement. I hope you guys will consider this as a first step towards foolproof verify and implement as soon as possible. Has this idea been officially submitted as a feature request?

staskorz · May 08, 2018 2:56 pm

Such (or a similar) feature is a must, considering the numerous issues customers have experienced with CBT-based backups.

Here is an update on what looks like the latest CBT-related bug, this time only affecting VVOLs users.

Previously we had 2090639, 2136854 and 2148408, just to name a few.

As you can see, it's not unusual for the CBT to misbehave, every time under different conditions, so blindly relying on it makes no sense. The thing is, that is exactly what Veeam does for now - it reports a "green" backup completion status, even though the produced backup is corrupt - silently, that is.

Right, it's a VMware bug, but by implementing such a verification feature, Veeam would protect its customers from any similar silent backup corruptions.

R&D Forums

Feature Request: Selective Block-Level Backup Verification

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Re: Feature Request: Selective Block-Level Backup Verificati

Who is online