VMWare issue with CBT ?

MB-NS · Post by **MB-NS** » May 11, 2010 7:16 am this post

Hello,

I was just reading the following which got me quite worried about our backups.
http://planetvm.net/blog/?p=1520&utm_so ... lanetVM%29
For the moment it seems unconfirmed, so do Veeam know about this and already got round investigating it ?

Post by **Gostev** » May 11, 2010 10:24 am this post

Thanks for the link, I will ask devs about it, however at a first sight I must say the whole scenario seems far-fetched to me, as this requires:

1. Running VMs off manually created snapshot for a long time, crossing backup windows. I know most people avoid using manual snapshots at all because this affects VM performance as it runs and especially on commit, and also fills up datastore resulting in VM stop and corruption on no free space.

2. But more importantly, this scenario requires reverting (not commiting) this old and large snapshot. This effectively means massive data loss on production server. Any data you had accumulated in the snapshot is simply gone! I cannot imagine anyone doing this kind of thing to a production server?

And the fact that we have not seen anyone running into this issue for 8 months now (since v4 release) only proves that the whole scenario is really made up just to make some noise.

Anyhow, if you are concerned about CBT reliability, you can always disable the use of changed block tracking completely in the job settings, this will make the job use our own proprietary changed block tracking mechanism.

Meanwhile, we will do the research and I will update on our findings later.

Post by **tsightler** » May 11, 2010 12:55 pm this post

Gostev wrote:Thanks for the link, I will ask devs about it, however at a first sight I must say the whole scenario seems far-fetched to me, as this requires:

1. Running VMs off manually created snapshot for a long time, crossing backup windows. I know most people avoid using manual snapshots at all because this affects VM performance as it runs and especially on commit, and also fills up datastore resulting in VM stop and corruption on no free space.

2. But more importantly, this scenario requires reverting (not commiting) this old and large snapshot. This effectively means massive data loss on production server. Any data you had accumulated in the snapshot is simply gone! I cannot imagine anyone doing this kind of thing to a production server?

Anton, I really don't agree with your analysis here at all. First, while we have both production and non-production systems, we expect the reliability of the backups for both systems to be 100% reliable. In many cases the difference between "production" and "non-production" systems are simply their RPO and RTO targets, but they're both expected to be 100% available. It's quite common for us to create snapshots of development systems that cross backup windows and even more common for systems to have snapshots that cross replication windows. We just had this just this week as we tested the application of our large patchset to one of our development environments. It took several days to work through the application of the patchset and then we revert the environment and try again. If this causes all future backups of our development environment to be invalid that would be a HUGE flaw, I would consider it catastrophic.

Gostev wrote: And the fact that we have not seen anyone running into this issue for 8 months now (since v4 release) only proves that the whole scenario is really made up just to make some noise.
...
Meanwhile, we will do the research and I will update on our findings later.

I will anxiously await your findings. I'm hoping that Veeam developers were perhaps aware of this issue and already handle this in a different way or that the method the Veeam engine uses simply isn't affected for some reason. Or perhaps it's "made up" like you suggested, but I'm not really sure what's to gain from making up a issue like that. It seems more likely that it's a misunderstanding of how they expected CBT to work in that scenario. I personally would have expected a snapshot reversion to revert the UUID and thus the backup software would have to know how to deal with that.

Post by **Gostev** » May 11, 2010 4:46 pm this post

Not sure why you don't agree at all with my analysis, when I was talking specifically about production systems. The points you are making are valid for test/development systems, I will not argue that on such systems snapshot reversal is a common operation. In fact, I do this myself all the time

I don't do daily backups of my test systems though (since I just revert the state), just periodic VM Copy of some states. But I realize some other people may have requirement to do this, I am with you here.

tsightler wrote:We just had this just this week as we tested the application of our large patchset to one of our development environments. It took several days to work through the application of the patchset and then we revert the environment and try again. If this causes all future backups of our development environment to be invalid that would be a HUGE flaw, I would consider it catastrophic.

I happened to have such test system in my lab as well (few days running of snapshot w/backups). I reverted this snapshot, and did another backup. Logs have shown that Veeam Backup handled this scenario correctly: it realized the provided changeID is invalid/unexpected, and automatically triggered "native" (v3 style) full image scan to determine the incremental changes since last pass - instead of relying on CBT data.

Now, we are trying to replicate specific sequence to get matching changeID returned, as described in article, and see what happens in this case.

MB-NS · Post by **MB-NS** » May 11, 2010 5:09 pm this post

Gostev wrote:I happened to have such test system in my lab as well (few days running of snapshot w/backups). I reverted this snapshot, and did another backup. Logs have shown that Veeam Backup handled this scenario correctly: it realized the provided changeID is invalid/unexpected, and automatically triggered "native" (v3 style) full image scan to determine the incremental changes since last pass - instead of relying on CBT data.

Now, we are trying to replicate specific sequence to get matching changeID returned, as described in article, and see what happens in this case.

Hello,
First thanks for these quick answers. I'm looking forward the results of your investigations.

NB : I pretty much agree with you that you *should not* run VM with snapshots on them.
But I also agree with tshindler because it won't mean it won't happen. I have customers who do this despite the best practices.
And as for the data loss incurred by a revert to snapshot, it is very limited if the server in question is a stateless server. For instance a front-end for an app with database on another instance.

Regards

Post by **tsightler** » May 11, 2010 10:19 pm this post

Gostev wrote:Not sure why you don't agree at all with my analysis, when I was talking specifically about production systems. The points you are making are valid for test/development systems, I will not argue that on such systems snapshot reversal is a common operation. In fact, I do this myself all the time I don't do daily backups of my test systems though (since I just revert the state), just periodic VM Copy of some states. But I realize some other people may have requirement to do this, I am with you here.

OK, I'll try to be more clear regarding exactly what I disagreed with. Basically, the following is my summary of what you said in your previous post:

This seems far fetched since it's best practice not to run snapshots for a long time and uncommon to revert long snapshots on production system

So the part I disagree with is "this seems far fetched" since I have many systems within my environment that are not production systems where running with snapshots and reverting snapshots are common place, yet it's very important that I be able to restore those systems reliably. Take our development ERP environment. We might take a snapshot prior to applying a large patch set and run with it for several days, then revert the snapshot and developers go back to work for weeks. If all the backups after we reverted the snapshot were invalid, and then we had a disaster, we could lose weeks of developer work.

Basically I thought the attitude was a little flippant and I'm not a big fan of my backup vendor being flippant with regard to the safety/integrity of my data. Perhaps you didn't mean it that way, but that's what saying it was "far fetched" implied to me. Not that Veeam handled the issue, but that it didn't seem like a "big deal" because it was an uncommon scenario. Uncommon scenarios are what causes most disasters so that's not an acceptable argument for why something isn't a big deal.

Post by **Gostev** » May 11, 2010 10:51 pm this post

OK, definitely did not mean to say that potential data integrity issue is not a big deal for us. That's an issue of not being native speaker, sometimes certain phrases have stronger hidden meaning for native speakers than my brain's dictionary thinks

I can assure you that I got the best dev resources researching this immediately after we first found out about the blog post. And even if there is minor chance of data corruption, creating hotfix will be highest priority for us, no matter of how low that chance is.

I will update once I have more information, so far my own basic test did not show issues with reverting snapshots. But I tested just one scenario (similar to what you have described), and until we test all possible scenarios, we cannot be sure if the issue is there or not.

Post by **tsightler** » May 12, 2010 1:36 pm this post

Not a problem, perhaps I read too much into your previous comment. I actually suspect that Veeam handles this issue correctly already, but it will be good to have confirmation.

Post by **Gostev** » May 13, 2010 7:12 pm this post

We have completed the testing of various scenarios involving snapshot reversals and playing around this CBT issue (or even bug - now that VMware confirmed it indeed exists).

It appeared that our existing ChangeID validation algorithm was able to detect invalid ChangeID supplied by CBT and correctly initiate failover to proprietary change determination mechanisms in all scenarios, except one specific sequence of actions described below.

This scenario requires that the following sequence of action is performed in the correct order:

B1 > S > B2 > R > B3, where:

B is backup
S is manual snapshot
R is snapshot reversal

These actions have to be performed strictly one after another. With such specific order of actions, during B3 incremental pass our ChangeID validation engine does not recognize that ChangeID returned is invalid, which results in invalid B3 backup.

Modifying the above sequence in any way by performing additional backup, manual snapshot, VMotion, VM power off anywhere inside this chain eliminates the issue (returned ChangeID still remains invalid, but this is properly detected). One example is running more than one backup cycle while the snapshot is open (as in my above mentioned test).

The fix for this was pretty simple in our case (not the same as the original article suggests, but "smarter" - without completely disabling CBT if VM has snapshot). The fix is now in testing for validation. I will update this thread once it becomes available through our support.

Until the hotfix is made available, I recommend disabling the use of changed block tracking in the Advanced job settings for all jobs which process VMs where manual snapshot reversal may happen, and triggering Full Backup on these jobs to heal the backup file (in case you believe you may have this scenario happened before for some VMs).

MB-NS · Post by **MB-NS** » May 14, 2010 1:23 pm this post

Hello, thanks for the feedback.

I will keep waiting for the hotfix then.

I assume that when you speak of "proprietary change determination mechanism", it is the previous pre-CBT system (old behavior) which is not as quick as CBT-enabled backups ?

I have however some additional questions :
- when correctly handled, does it revert to old behavior just for that VM or for all the VMs in the same job ?
- when not correctly handled, is it only this VM backup which is corrupted, or all VMs in the same job ?
- does it revert to old behavior forever or subsequent backups go back to normal CBT behavior ?

Post by **Gostev** » May 14, 2010 1:36 pm this post

MB-NS wrote:I assume that when you speak of "proprietary change determination mechanism", it is the previous pre-CBT system (old behavior) which is not as quick as CBT-enabled backups ?

That is correct. This involves full source image scan, calculating hash for each block, and comparing with hashes of blocks stored in the backup file to determine which blocks have been changed.

MB-NS wrote:when correctly handled, does it revert to old behavior just for that VM or for all the VMs in the same job ?

Just for that VM only.

MB-NS wrote:when not correctly handled, is it only this VM backup which is corrupted, or all VMs in the same job ?

Just that VM only.

MB-NS wrote:does it revert to old behavior forever or subsequent backups go back to normal CBT behavior ?

Subsequent backup will be СBT-based (as long as valid ChangeID is returned by VMware during the next run, of course).

MB-NS · Post by **MB-NS** » May 14, 2010 1:50 pm this post

Hello again,

your reactivity is most appreciated, thank you. Just one last confirmation, one can't be too sure.

Gostev wrote: Subsequent backup will be СBT-based (as long as valid ChangeID is returned by VMware during the next run, of course).

Which will be the case without touching anything as long as the bad scenario you described isn't reproduced, right ?
No need to make a Full or something ?
(by the way, sorry for the shameless plug, will there be a way in v5 to launch a full backup on just one VM, not all VM in the job ?)

Post by **Gostev** » May 14, 2010 1:57 pm this post

MB-NS wrote: Which will be the case without touching anything as long as the bad scenario you described isn't reproduced, right ? No need to make a Full or something ?

Correct, no need for fulls or anything. Should the system of validating ChangeID (that is in place today) detect get invalid ChangeID during any incremental run, it fails over the processing mode for the specific VM to "full-scan" mode. But this does not affect subsequent passes for this VM anyhow.

MB-NS wrote:(by the way, sorry for the shameless plug, will there be a way in v5 to launch a full backup on just one VM, not all VM in the job ?)

No, we are not planning this feature for v5. It has not been requested before. Please PM me the use case (why do you need it).

Post by **Gostev** » May 19, 2010 1:05 pm this post

Hello all, the fix for this issue is now available through our support. Thank you for your patience!

R&D Forums

VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Re: VMWare issue with CBT ?

Who is online