Host-based backup of VMware vSphere VMs.
Locked
Reimold
Enthusiast
Posts: 41
Liked: 1 time
Joined: Sep 07, 2009 11:58 am
Full Name: Dirk Reimold
Contact:

VMware CBT bug KB 2090639

Post by Reimold »

Hello,

I have just read about the CBT bug in all ESXi versions and now I face a few questions about that:

- the KB document tells me "A virtual machine may be at risk if the vmdk file was extended to a size greater than 128 G." - does this mean that a VM that has a Initial disk size greater tha 128 GB would not bee affected? (Example a VM disk with 200GB expanede to 300GB) ?

- How can I tell if I have a corrupt backup of a CBT bug afected VM? Will Instant VM recovery fail or will just some files on the restored disk be unreadable? - Example: a Fileserver with 2 disks (40 GB and 700GB expanded to 900 GB) - will SureBackup job be able to recognize a corrupt disk?

- Is there a way to Bypass wron CBT Information without shutting the VM down to disable CBT? I think of creating a new backup Job as this would result in reading and backing up the whole disk again.

We have a lot of VM´s with disks greater than 128 GB that were expanded over the past few years.

Thank you for your comments

Dirk
MrSpock
Service Provider
Posts: 49
Liked: 3 times
Joined: Apr 24, 2009 10:16 pm
Contact:

Re: VMware CBT bug KB 2090639

Post by MrSpock »

Let me add one question to Dirk's list:

- Will the CBT be automatically reset if I manually make an "Active Full" backup?

Best regards,

Johan
Vitaliy S.
VP, Product Management
Posts: 27375
Liked: 2799 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov
Contact:

Re: VMware CBT bug KB 2090639

Post by Vitaliy S. »

Hi Dirk and Johan,
Reimold wrote:- the KB document tells me "A virtual machine may be at risk if the vmdk file was extended to a size greater than 128 G." - does this mean that a VM that has a Initial disk size greater tha 128 GB would not bee affected? (Example a VM disk with 200GB expanede to 300GB) ?
I have passed this question to our QC team and will let you know after we perform these tests.
Reimold wrote:- How can I tell if I have a corrupt backup of a CBT bug afected VM? Will Instant VM recovery fail or will just some files on the restored disk be unreadable? - Example: a Fileserver with 2 disks (40 GB and 700GB expanded to 900 GB) - will SureBackup job be able to recognize a corrupt disk?
Yes, it is recommended to configure and run SureBackup jobs for all mission critical VMs you protect with Veeam B&R server. These jobs will allow you to detect all problems with boot procedure.
Reimold wrote:- Is there a way to Bypass wrong CBT Information without shutting the VM down to disable CBT? I think of creating a new backup Job as this would result in reading and backing up the whole disk again.
CBT data should be reset, our support team should have instructions on how to do that.
MrSpock wrote:- Will the CBT be automatically reset if I manually make an "Active Full" backup?
CBT will not reset in this case, but the new active full backup should create a new valid restore point, though if you're affected by this issue, you need to reset CBT first.

Thanks!
Reimold
Enthusiast
Posts: 41
Liked: 1 time
Joined: Sep 07, 2009 11:58 am
Full Name: Dirk Reimold
Contact:

Re: VMware CBT bug KB 2090639

Post by Reimold »

Vitaliy S. wrote: Yes, it is recommended to configure and run SureBackup jobs for all mission critical VMs you protect with Veeam B&R server. These jobs will allow you to detect all problems with boot procedure.
But I am not talking about boot disks here. In most cases my boot disks are between 40-60 GB - but data disks have often grown between 128 GB. Will SureBackup find affected "CBT-bug" problems with that disks too?

Thank you

Dirk
Vitaliy S.
VP, Product Management
Posts: 27375
Liked: 2799 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov
Contact:

Re: VMware CBT bug KB 2090639

Post by Vitaliy S. »

Dirk, no, I believe you will need to check that you have all recent data on these disks manually.
Reimold
Enthusiast
Posts: 41
Liked: 1 time
Joined: Sep 07, 2009 11:58 am
Full Name: Dirk Reimold
Contact:

Re: VMware CBT bug KB 2090639

Post by Reimold »

Vitaliy S. wrote:Dirk, no, I believe you will need to check that you have all recent data on these disks manually.
so this would mean, that i cannot trust any backup I have made of my bigger VM´s during the past years and the statement from Gostev´s recent newsletter: "but only SureBackup can guarantee you the ability to recover." does not come true when we talk about bigger file- and databaseserver.

Is there any hint how often this bug corrupts a VM backed up with Veeam? Are we talking about any VM that has expanded disks or only a small percentage?

Thank you

Dirk
cffit
Veteran
Posts: 338
Liked: 35 times
Joined: Jan 20, 2012 2:36 pm
Full Name: Christensen Farms
Contact:

Re: VMware CBT bug KB 2090639

Post by cffit » 6 people like this post

I agree with the others on here. The weekly email that brought this topic up was good to inform us of the issue, but lacked any detail in specifics. Beings this is such a critical issue, I think an in-depth explanation of what it affects and how to resolve it would be important. Thanks
JeremyS132
Novice
Posts: 8
Liked: never
Joined: Feb 07, 2014 2:40 pm
Full Name: Jeremy Schwarzrock
Contact:

Re: VMware CBT bug KB 2090639

Post by JeremyS132 »

cffit wrote:I agree with the others on here. The weekly email that brought this topic up was good to inform us of the issue, but lacked any detail in specifics. Beings this is such a critical issue, I think an in-depth explanation of what it affects and how to resolve it would be important. Thanks
I have to agree with this statement.
maddog2050
Lurker
Posts: 1
Liked: 2 times
Joined: Mar 21, 2012 4:45 pm
Full Name: Adam Stirk
Contact:

Re: VMware CBT bug KB 2090639

Post by maddog2050 » 2 people like this post

Hi,

In the community forum digest that came out highlighting this bug, one of the methods of disabling CBT was using PowerCLI. Is this a proven and supported way of disabling CBT? As VMware state powering off the VM, disabling CBT and then powering the VM back on. Looking at the script all it does is disable CBT and then create and remove a snapshot.

Thanks

Adam
namiko78
Expert
Posts: 117
Liked: 4 times
Joined: Mar 03, 2011 1:49 pm
Full Name: Steven Stirling
Contact:

[MERGED] Surebackup and CBT bug

Post by namiko78 »

Regarding Gotev's forum message (posted below) , if i had a D: drive that was expanded and thus affected by the bug, would surebackup catch this? Would only the D: fail to come back online or does it mean the entire VM would not recover?

>>>>

Unfortunately, I also have some not so good news to share. Earlier this month, VMware has quietly published a KB article about a rather terrible CBT bug that exists in all versions of ESX(i) since changed block tracking functionality was first introduced. We have been working directly with VMware to confirm the exact scope of the issue and update the KB article with more details. But the main point is that your backups and replicas for all VMs that had its virtual disk size expanded beyond 128 GB at some point may be unrecoverable. We are working on a hot fix for both 7.0 Patch 4 and 8.0 code branches that will reset CBT automatically upon detecting source virtual disk size change. Meanwhile, I recommend manual CBT reset for all VMs that had their virtual disks expanded at some point by disabling CBT (the following Veeam job run will re-enable CBT automatically). Perhaps, just disabling CBT on all VMs with a PowerCLI script might be the best idea - but keep in mind that the following job runs will take much longer, so best is to do this before the weekend.

These kind of issues always make me stress the importance of SureBackup. Many users consider setting up SureBackup jobs to be a low priority when compared to actual backups - however, only SureBackup is able to catch these kind of issues. Interestingly enough, a lot of people seem to recognize the importance of backup integrity testing, in fact our Backup Validator tool seems to be very popular. I do agree that integrity checks are important, in fact I have dedicated the entire VeeamON breakout session to "classic" data corruption issues. However, integrity checks will not detect corruption issues similar to the above. And yet, these sort of issues are much more common. I cannot stress this enough, especially in light of enhancements we are adding to our Backup Validator tool in v8. These enhancements are based on your feedback, but they do not mean that Backup Validator is the future. It has its use in detecting storage level corruptions, but only SureBackup can guarantee you the ability to recover.
Stoo
Service Provider
Posts: 5
Liked: never
Joined: Aug 23, 2013 8:42 am
Full Name: Stu P.
Contact:

Re: VMware CBT bug KB 2090639

Post by Stoo »

Has anyone actually come across this bug 'in the wild' yet and have personal experience of how it manifests?

Keen to know whether this will trash the entire 128Gb+ disk's structure and file headers, making it effectively unusable/unmountable, or whether theoretically, if i'm able to use the windows guest File-Level-Restore wizard which creates vmdk mountpoints in C:\veeamflr on my backup server, and it successfully enumerates the entire drive's contents and directory structure, i should be in the clear?
Reimold
Enthusiast
Posts: 41
Liked: 1 time
Joined: Sep 07, 2009 11:58 am
Full Name: Dirk Reimold
Contact:

Re: VMware CBT bug KB 2090639

Post by Reimold » 1 person likes this post

I have opened a ticket at VMware this morning and just got a call from their support:

- they will check if this Problem only affects disks that are expanded from under 128 GB to a size greater that128 GB and get back to me.
- there is no fix available in the near future
- there are not much support requests about that CBT bug and only one real case is linked to that KB document
- to disable CBT the VM has to be powered off - no other way is available
jklimo
Lurker
Posts: 1
Liked: never
Joined: Apr 30, 2014 2:20 pm
Full Name: Johnathan Klimo
Contact:

Re: VMware CBT bug KB 2090639

Post by jklimo »

MrSpock wrote:Let me add one question to Dirk's list:

- Will the CBT be automatically reset if I manually make an "Active Full" backup?

Best regards,

Johan
Great question. I've been researching this as well, and haven't found clarity/details yet re: CBT when Veeam completes Active Full backup jobs.

John
Khue
Enthusiast
Posts: 67
Liked: 3 times
Joined: Sep 26, 2013 6:01 pm
Contact:

Re: VMware CBT bug KB 2090639

Post by Khue »

Reimold wrote: - to disable CBT the VM has to be powered off - no other way is available
Data Protector (HP Product) used to have CBT issues all the time which would require you to disable and re-enable CBT. You may want to check, but I'd imagine you also have to delete the existing change block database (*.ctk files).
jklimo wrote: I've been researching this as well, and haven't found clarity/details yet re: CBT when Veeam completes Active Full backup jobs.
Just a guess based on my statement above, but I would imagine CBT would have to be completely turned off and then re-enabled by the job. The takeaway is that the change block tracking database get's fubard and the only work around is to wipe it and start from scratch. I would imagine when you add blocks to the vmdk past 128 gigs, it requires the cbt database to go through some process of growth and VMware's method of adding that to the existing database is destructive.
dzeleski
Novice
Posts: 3
Liked: 1 time
Joined: Sep 12, 2014 2:54 am
Full Name: Dylan Zeleski
Contact:

Re: VMware CBT bug KB 2090639

Post by dzeleski » 1 person likes this post

Reimold wrote: - to disable CBT the VM has to be powered off - no other way is available
As far as I am aware you can set CBT to disabled, snap the VM, remove the snap, and that should reset CBT(powercli). There are a few blogs stating this is the case, im waiting on my call back from veeam now.

I know for a fact that a storage vMotion will reset CBT as well.

Code: Select all

$choice = Read-Host 'Press 1 to select a VM or Press 2 to select a Cluster.'

switch ($choice)
{
    1 {
       $vmName = Read-Host 'Please type in a VM name to reset CBT on.'

        $vmInfo = Get-vm $vmName
        $spec = New-Object VMware.Vim.VirtualMachineConfigSpec
        $spec.ChangeTrackingEnabled = $false


        $vmInfo.ExtensionData.ReconfigVM($spec)
        $snap=$vmInfo | New-Snapshot -Name 'Disable CBT'
        $snap | Remove-Snapshot -confirm:$false 
    }
    2 {
        $vmName = Read-Host 'Please type in a Cluster name to reset CBT on.'

        $vmInfo = Get-Cluster $vmName | Get-VM 
        $spec = New-Object VMware.Vim.VirtualMachineConfigSpec
        $spec.ChangeTrackingEnabled = $false


        $vmInfo.ExtensionData.ReconfigVM($spec)
        $snap=$vmInfo | New-Snapshot -Name 'Disable CBT'
        $snap | Remove-Snapshot -confirm:$false
    }
    default {Write-host 'Invaild Input, exiting...'}
}    
johndoe10110
Influencer
Posts: 22
Liked: 5 times
Joined: Oct 23, 2013 12:49 pm
Full Name: John Dooe
Contact:

Re: VMware CBT bug KB 2090639

Post by johndoe10110 »

Just subscribing to this thread awaiting more info.
Reimold
Enthusiast
Posts: 41
Liked: 1 time
Joined: Sep 07, 2009 11:58 am
Full Name: Dirk Reimold
Contact:

Re: VMware CBT bug KB 2090639

Post by Reimold »

Vmware-Support has just updated my ticket:

- to check the backup of a VM is OK they suggest to do a "Veeam Instant VM Recovery" and the do a chkdisk /fsck on the expanded drive
- VMware is working on a fix - but not timeframe yet
joergr
Veteran
Posts: 391
Liked: 39 times
Joined: Jun 08, 2010 2:01 pm
Full Name: Joerg Riether
Contact:

Re: VMware CBT bug KB 2090639

Post by joergr » 2 people like this post

As far as i got to know in the last hours: CBT can be reset without powering VM off, BUT you have to take some actions (there are several possibilities). To keep it simple - one easy way is to take and remove a snapshot after the tracking parameter is set to false.

The script Anton provided via link in his mail should do the trick:

$vms=get-vm | ?{$_.ExtensionData.Config.ChangeTrackingEnabled -eq $true}
$spec = New-Object VMware.Vim.VirtualMachineConfigSpec
$spec.ChangeTrackingEnabled = $false
foreach($vm in $vms){
$vm.ExtensionData.ReconfigVM($spec)
$snap=$vm | New-Snapshot -Name 'Disable CBT'
$snap | Remove-Snapshot -confirm:$false}

Credits to Benj (http://www.itwalkthru.com/2012/03/disab ... hange.html, https://www.blogger.com/profile/04023318055860494153)

This will disable CBT at all VMs where CBT is enabled and leave it disabled. VEEAM B+R will automatically enable it again during the next run.

Be aware: The next run takes as long time as an active full.

Update1: PLEASE check it in your testlab before using it. Me personally - i will do some tests before i´ll apply it to all of my VMs. At this time i applied it to 5 VMs and just backed them up after the change. Seems all complete OK - only the time is the time of an active full - VEEAM B+R just backups it without throwing any error or warning. CBT is enabled quietly before the backup runs. Seems good so far.

Update2: @Reimold: Could you ask VMware if that will do the trick (just so we can be 100% sure)?

Best regards
Joerg
mloeckle
Service Provider
Posts: 7
Liked: 9 times
Joined: May 30, 2013 10:04 pm
Full Name: Michael Loeckle
Contact:

Re: VMware CBT bug KB 2090639

Post by mloeckle »

Stoo wrote:Has anyone actually come across this bug 'in the wild' yet and have personal experience of how it manifests?

Keen to know whether this will trash the entire 128Gb+ disk's structure and file headers, making it effectively unusable/unmountable, or whether theoretically, if i'm able to use the windows guest File-Level-Restore wizard which creates vmdk mountpoints in C:\veeamflr on my backup server, and it successfully enumerates the entire drive's contents and directory structure, i should be in the clear?
I've been dealing with it for roughly 2 years. But we do a lot of disk growing though. I never was able to give VMware or Veeam enough information or able to reproduce the problem on demand for them to find the cause. We were always pretty sure it was a CBT bug and not a problem with Veeam. But to your question about FLR, I don't think you can feel safe just because you are able to mount the volume in FLR. I found on one occasion that a resized VM took about a week or two before SureBackup started failing. The best bet is to immediately reset CBT on every VM in your environment and always reset CBT after resizing a VM. That has worked for us without fail so far.
isaako
Service Provider
Posts: 26
Liked: never
Joined: Sep 15, 2010 11:31 am
Full Name: Isaac González
Contact:

Re: VMware CBT bug KB 2090639

Post by isaako »

Just subscribing to this thread awaiting more info.
Peejay62
Expert
Posts: 235
Liked: 37 times
Joined: Aug 06, 2013 10:40 am
Full Name: Peter Jansen
Contact:

Re: VMware CBT bug KB 2090639

Post by Peejay62 »

me too,subscribing to this thread awaiting more info.
joergr
Veteran
Posts: 391
Liked: 39 times
Joined: Jun 08, 2010 2:01 pm
Full Name: Joerg Riether
Contact:

Re: VMware CBT bug KB 2090639

Post by joergr »

Update 3: Me personally - i am doing it with all vms now. I slept over it and decided this way this morning. i can´t stand the thought that many of my backups might be at risk or might be unusable. BUT please - this is only to update this thread with my personal thoughts - this is NO ADVICE at all to you. We are at a very early stage of knowing details about this cbt bug. Thus - it MIGHT be a good idea to wait - it might be a good idea to act now - i honestly don´t know. My decision is a decision based on my personal gut feeling.

Best regards,
Joerg
stuartmacgreen
Expert
Posts: 149
Liked: 34 times
Joined: May 01, 2012 11:56 am
Full Name: Stuart Green
Contact:

Re: VMware CBT bug KB 2090639

Post by stuartmacgreen »

For me the KB is too vague and therefore await more detail. But for now my points are:

1. Can you check if a VMDK was ever extended beyond 128GB? In its history.
2. Can this bug be dormant and just appear or is it exactly after it is extended beyond 128GB.
3. I might have come across this bug myself. As cannot explain it otherwise until this KB was announced.

Let me explain #3.
I have a Backup Job with a single VM. It does Full once a week and Inc's the remainder. It is a critical VM. I have a SureBackup Job for it.
The VM has SLES 11 64 bit Guest OS. The VM has 3HDD's:: 10GB (OS)+ 2GB(swap) and 248GB (data)
Up until 2 months ago I have not been able to get the SB job to complete successfully - that is boot. So the VM could not even get to heartbeat check.
On the console it just hung as it was doing a fsck on the 248GB disk and was sticking at somewhere around 39%.

This was very worrying. Every week my SB job was failing. So my VM in production would not boot if i even restarted it?

As a result of yesterdays Gostev mail, I had the idea to reset the CBT, but maintain uptime by doing SVMOTION of the VM.
I then performed a Active Full, then kicked off a SB job and instant success.

This is far more worrying if i was to restore my entire VM from backup data prior to this Active Full I did.
So I cannot conclude about the 128GB disk size as no work was done 2 months ago around disks on this VM.

SUMMARY: If my VM backup data cannot boot/hangs in SureBackup, perform a SVMOTION (Reset CBT) and perform SB verification from a backup after the SVMOTION.
joergr
Veteran
Posts: 391
Liked: 39 times
Joined: Jun 08, 2010 2:01 pm
Full Name: Joerg Riether
Contact:

Re: VMware CBT bug KB 2090639

Post by joergr »

Update4: OK did it.

IMHO there is no absolute need to do an active full after it as b+r is smart enough to get the changes right (mainly important for reverse increment jobs)

@Veeam: Am i right here with this assumption?
joergr
Veteran
Posts: 391
Liked: 39 times
Joined: Jun 08, 2010 2:01 pm
Full Name: Joerg Riether
Contact:

Re: VMware CBT bug KB 2090639

Post by joergr »

Can someone at VEEAM comment on this one please? This information would be important - for reverse increment jobs and also for replication jobs (even more).

I assume that after resetting CBT data with the described method you don´t need to use active full because b+r will handle it 100%. I saw various cases in the past where this has worked 100%.

But we need to be absolutely sure - so the reason why i am asking.

Joerg
Gostev
Chief Product Officer
Posts: 31806
Liked: 7299 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: VMware CBT bug KB 2090639

Post by Gostev » 1 person likes this post

We are in the process of testing all scenarios around this issue. There are a lot of things to test. So far, we have not run into a situation when the Active Full was needed. In theory, with CBT being reset, B&R job fails over to "full scan" incremental processing which read the entire virtual disk, and transfers the delta between previous and actual state. This includes contents of all virtual disk blocks that once felt out of CBT tracking due to the CBT disk expansion bug.

One other thing that we have confirmed by now is that the size of virtual disk before or after expansion does not seem to matter. What matters is whether the virtual disk was increased for more than 128GB in size at once. For example, 200GB>300GB expansion is fine, but 200GB>350GB will cause CBT bug.
cffit wrote:I agree with the others on here. The weekly email that brought this topic up was good to inform us of the issue, but lacked any detail in specifics. Beings this is such a critical issue, I think an in-depth explanation of what it affects and how to resolve it would be important. Thanks
This is what happen when we find out about the issue on Friday night. There was no specifics available at the time. So, I had a simple choice: either keep quiet about potential data corruption issue, or warn everyone that it exists for sure, but without much details. I've chosen the latter, but perhaps this was a wrong choice?
JeremyS132
Novice
Posts: 8
Liked: never
Joined: Feb 07, 2014 2:40 pm
Full Name: Jeremy Schwarzrock
Contact:

Re: VMware CBT bug KB 2090639

Post by JeremyS132 »

Gostev wrote: This is what happen when we find out about the issue on Friday night. There was no specifics available at the time. So, I had a simple choice: either keep quiet about potential data corruption issue, or warn everyone that it exists for sure, but without much details. I've chosen the latter, but perhaps this was a wrong choice?
While I understand what you mean Gostev, I appreciate the fact that you did notify us. I would rather have Veeam acknowledge the bug and state that we are working on it vs. not saying anything. Just my two cents.
joergr
Veteran
Posts: 391
Liked: 39 times
Joined: Jun 08, 2010 2:01 pm
Full Name: Joerg Riether
Contact:

Re: VMware CBT bug KB 2090639

Post by joergr »

Yeah thanks for bringing all this unprettified and transparent to attention for the community - this is very important information and thus highly appreciated.

For me personally, this is just as important as this one: http://www.computerworld.com/article/25 ... -mess.html

@Anton: This was NO wrong choice, you did exactly the right thing. The ONLY right thing! I can assure you that!

Best regards,
Joerg
Mac
Novice
Posts: 4
Liked: 1 time
Joined: Apr 17, 2013 9:02 am
Full Name: Mac

Re: VMware CBT bug KB 2090639

Post by Mac » 1 person likes this post

Gostev wrote:There was no specifics available at the time. So, I had a simple choice: either keep quiet about potential data corruption issue, or warn everyone that it exists for sure, but without much details. I've chosen the latter, but perhaps this was a wrong choice?
Imo it was the right choice 100%. A solution is already available and even though the exact scope of the affected vms hasn't been determined yet, at least people are already on alert and can do some planning. Better safe than sorry.
lightsout
Expert
Posts: 227
Liked: 62 times
Joined: Apr 10, 2014 4:13 pm
Contact:

Re: VMware CBT bug KB 2090639

Post by lightsout » 1 person likes this post

No, I think it is the right choice to let us know. The only thing that I'd suggest is creating a forum thread, like this one, and linking to it and saying we'll add more information here as we find it.
Locked

Who is online

Users browsing this forum: No registered users and 75 guests