VMware CBT bug KB 2090639

kimhansen · Nov 04, 2014 11:15 pm

For people asking about confirming that CBT has been reset, this is a way to do it:

1. Reset CBT using any of the previously posted methods; manually with reboot or script with snapshot method (both work). Then just browse the data files and confirm that the <diskname>-ctk.vmdk files are gone. This should be enough, but if you want to really really make sure, also to the next step.

2. Let Veeam run a backup job on the VM after the reset and look for the following in the Veeam logs (%programdata%\Veeam\Backup\VMNAME\Task.VMNAME.vm-nnn.log):

[04.11.2014 06:30:17] <60> Info VM information: name "VM NAME", ref "vm-nnn", uuid "564d6345-3311-21da-9f59-a7188eb2062e", host "vsphere.hostname.local", resourcePool "resgroup-73", connectionState "Connected", powerState "PoweredOn", template "False", changeTracking "False", configVersion "vmx-07"

As you see veeam reports changeTracking as "False"

You can then some lines below find:

[04.11.2014 06:30:25] <60> Info [Soap] SetVmChangeTracking, vmRef 'vm-nnn', changeTrackingEnabled 'True'
[04.11.2014 06:30:25] <60> Info [VimApi] ReconfigVM, type "VirtualMachine", ref "vm-nnn"

So, veeam is trurning CBT back on

That's all folks

Brgs,
Kim Alexander Hansen

Post by **Gostev** » Nov 05, 2014 12:34 am this post

@DrColonel this information from VMware support is incorrect.
@cliffm all your questions have already been answered earlier in this topic.

kwells · Post by **kwells** » Nov 05, 2014 9:49 am this post

MrSpock wrote: I got the same error on one host. Solution: http://www.veeam.com/kb1113

Best regards,

Johan

Hi Johan,
The link you suggested was the one I used when I said I carried out the manual reset method. The interesting thing is that I set both the "ctkEnabled" and "scsi0:x.ctkEnabled" to false, then delete the -CTK files, when I check the options I find that the ctkEnabled has set itself back to True.

Time to talk to Veeam support?
I tried this again last night, and again the backup gave the Soap error. I have also just checked the log files and can not see the changed tracking=false anywhere. Does this mean that CBT is not being reset on this VM?

Post by **MrSpock** » Nov 05, 2014 9:55 am this post

Hi, kwells.

Yes, I did also notice that ctkEnabled was set back to True automatically.

So you did set "scsi0:x.ctkEnabled" back to True manually as stated in step 7? That did the trick for me at least.

Can you see any "-ctk" files now?

Best regards,

Johan

DrColonel · Post by **DrColonel** » Nov 05, 2014 3:06 pm this post

Gostev wrote:@DrColonel this information from VMware support is incorrect.
@cliffm all your questions have already been answered earlier in this topic.

Their response did seem a bit suspect to me. I had already reset CBT on our affected VM's so it was no longer an immediate issue, but I thought their response was interesting. Can you elaborate or refer me to a past post showing why it's incorrect so that I can let them know that the info they're giving out about this issue is wrong?

Nov 05, 2014 3:43 pm

DrColonel wrote:Their response did seem a bit suspect to me. I had already reset CBT on our affected VM's so it was no longer an immediate issue, but I thought their response was interesting. Can you elaborate or refer me to a past post showing why it's incorrect so that I can let them know that the info they're giving out about this issue is wrong?

The actual VMware KB article that this thread refers to is really all you need. Also, it's important to note that VMware KB2090639 has changed significantly since it was originally released. At first it claimed that it was only impacting a specific call to QueryChangedDiskAreas(), specifically when you call it with an "*", a special call that should return a list of all blocks that have been used/allocated in the entire VMDK. This call was useful for full backups to save time reading from disk areas that were not previously written and thus obviously had no data. This information lead to some vendors which didn't use this specific call to claim they were not impacted by this bug.

However, the KB article has since been updated and now notes that any call to this API can return inconsistent data if the VMDK has been expanded beyond the given thresholds. Also, they have added some additional information about those threasholds, specifically they have the following in the Q&A section:

Are virtual machines grown in smaller increments affected?
The amount of space the virtual disk is extended is not relevant, the increment of space by which a virtual disk is extended is not relevant.
Virtual machine is affected when the disk is grown past the 128G boundary in absolute size. The issue is triggered at other sizes which are a power of 2 from 128G up. For example: 256G, 512G, and 1024G.

A full backup that didn't use the QueryChangeDiskAreas() API at all should not be impacted, but incremental backups using CBT from that point could still be impacted and thus invalid since the API would fail to return changed blocks from blocks over the extended thresholds. Veeam uses QueryChangeDiskAreas() even during full backups to identified "used" blocks in a VMDK so it would impact both Full and Incremental backups until the CBT data is reset.

So to clarify, the only "safe" thing to do is to reset CBT for any VM that is over 128GB, unless you have a full history of change control information for a VM and know it was never expanded.

I would continue to monitor this thread as well as the VMware KB article as it's obvious that information is still being discovered about this issue and we may not yet be at the final state.

joergr · Post by **joergr** » Nov 05, 2014 7:31 pm this post

tsightler wrote:The issue is triggered at other sizes which are a power of 2 from 128G up. For example: 256G, 512G, and 1024G.

Tom, i would not count on that 100%. As Anton verified 200G to 300G was fine in the VEEAM Lab. Thus, i think VMware has to do way more research regarding this issue.

Again - me personally - i did reset CBT on all VMs, regardless the size. And at present time - if we change a vdisk size we trigger a cbt reset for this particular vm via scipt.

Best regards,
Joerg

jai64 · Post by **jai64** » Nov 05, 2014 8:30 pm this post

Has anyone come across the CBT getting stuck after following the steps in KB 1113?

I ran as per KB 1113 and now every server I trigger a cbt reset too gives a "Cannot use CBT: Soap fault." on every backup.

Even stranger, I triggered a full backup and it ran OK with no warnings, but the next job the "Cannot use CBT: Soap fault." came back.

xx/xx/xxxx xx:xx:xx PM :: Cannot use CBT: Soap fault. A specified parameter was not correct. . deviceKeyDetail: '<InvalidArgumentFault xmlns="urn:internalvim25" xsi:type="InvalidArgument"><invalidProperty>deviceKey</invalidProperty></InvalidArgumentFault>', endpoint: ''

Support suggested I redo the cbt reset but I have a bunch of servers to do that run 24/7 and I am using the vCenter Appliance (no powershell).

So additionally does anyone know a way to trigger a cbt reset without powershell or the Windows version of vCenter?

Nov 05, 2014 8:42 pm

joergr wrote: Tom, i would not count on that 100%. As Anton verified 200G to 300G was fine in the VEEAM Lab. Thus, i think VMware has to do way more research regarding this issue.

I wasn't suggesting you trust it, so sorry if that was implied. I was attempting to point out exactly your point, we don't have the final answer yet as even VMware continues to change their information and this best thing to do is to continue to monitor. I agree with you that the approach of resetting CBT on every disk size change is the most prudent for now.

Resqman · Post by **Resqman** » Nov 05, 2014 10:59 pm this post

foggy wrote: No, all the missing blocks will be copied after scanning the entire VM image.

@Foggy. On regards to that last part. I want to make sure that I understand this correct. So if I disable CBT and then let my nightly backup jobs run tonight then the entire VM will be scanned and any blocks that were previously missed on prior backups because of this CBT issue PLUS the blocks that have changed since my last backup (last night's) will be included in this new backup correct? If this is the case then that means that yesterday's (and the day before, and the day before that ...) backups are technically useless right? So am I mistaken in thinking that since from this point forward I can only rely on tonight's incremental backup to confidently restore anything on that VM am I just not better of taking a full backup and getting rid of all the other prior incrementals and Fulls up to this point?

VeForum · Nov 06, 2014 10:46 am

For all of us who want to reset CBT one by one I made a Powercli script to handle this easily.
To find out if a VM already has run CBT reset I use a CustomAttribute which need to be set up first:

Code: Select all

New-CustomAttribute -Name "CBTReset" -TargetType VirtualMachine

As value for that attribute I use a date so the script can be used later to do it again by changing the $Marker (really don't hope it will be necessary).
Here is the script:

Code: Select all

$Marker = "2014-11-05"
$menu = Read-Host "[1] All Machines, [2] CBT reset, [3] CBT not reset"
switch ($menu) {
    1 {$vms=get-vm -Verbose | ?{($_.ExtensionData.Config.ChangeTrackingEnabled -eq $true)}}
    2 {$vms=get-vm -Verbose | ?{($_.ExtensionData.Config.ChangeTrackingEnabled -eq $true) -and ($_.CustomFields.Item("CBTReset") -eq $Marker)}}
    3 {$vms=get-vm -Verbose | ?{($_.ExtensionData.Config.ChangeTrackingEnabled -eq $true) -and ($_.CustomFields.Item("CBTReset") -ne $Marker)}}
    default {Write-Host "Invalid Option!"}
    }

$vmSelected=$null
$vmSelected=$vms | select Name -ExpandProperty CustomFields| Where{$_.key -eq "CBTReset"} | Out-GridView -OutputMode Single
if ($vmSelected -ne $null)
    {
    switch (Read-Host "Reset CTB on $vmSelected.name (y/n)?")
        {
            y 
                {
                $spec = New-Object VMware.Vim.VirtualMachineConfigSpec
                $spec.ChangeTrackingEnabled = $false
                $vm = Get-VM -Name $vmSelected.Name
                $vm.ExtensionData.ReconfigVM($spec)
                $snap=$vm | New-Snapshot -Name 'Disable CBT' 
                $snap | Remove-Snapshot -confirm:$false
                $vmReload = Get-VM -Name $vmSelected.Name | ?{($_.ExtensionData.Config.ChangeTrackingEnabled -eq $true)}
                if ($vmReload -eq $null)
                    {Set-Annotation -Entity $vm -CustomAttribute "CBTReset" -Value $Marker}
                else
                    {Write-Host "Something went wrong with $vm.Name"}
                }
            default {Write-Host "Bye"}
        }
    }
else
    {Write-Host "Bye Bye"}

Good luck
Herby

Post by **foggy** » Nov 06, 2014 5:28 pm this post

Resqman wrote:@Foggy. On regards to that last part. I want to make sure that I understand this correct. So if I disable CBT and then let my nightly backup jobs run tonight then the entire VM will be scanned and any blocks that were previously missed on prior backups because of this CBT issue PLUS the blocks that have changed since my last backup (last night's) will be included in this new backup correct?

Correct.

Resqman wrote:If this is the case then that means that yesterday's (and the day before, and the day before that ...) backups are technically useless right? So am I mistaken in thinking that since from this point forward I can only rely on tonight's incremental backup to confidently restore anything on that VM am I just not better of taking a full backup and getting rid of all the other prior incrementals and Fulls up to this point?

Previous backups of VMs that have disks with size over 128GB are at risk, yes. So, as described earlier in this thread, creating new restore points for them is recommended, after resetting CBT. It is completely up to you though, whether to delete older backup files, since they might still be recoverable (at least FLR may work).

jai64 · Post by **jai64** » Nov 06, 2014 5:34 pm this post

If the Surebackup for a device was run successfully, will that guarantee no corruption and a good recoverable VM?

nreutemann · Post by **nreutemann** » Nov 06, 2014 6:36 pm this post

How much safe is run the script from veeam KB1940?

The release note of v8 says, inside "upgrade to v8", this:

"11. Reset CBT for all VMs in the environment. For more information, refer to Veeam support article KB1940."

Im following this thread the last few days and im a little scared to run the script.
I do some test and dont have any trouble, but i need to run over all the VMs and I need some encourage!

Thanks in advance!

cliffm · Post by **cliffm** » Nov 06, 2014 7:33 pm this post

chrisdearden wrote: Sure Replica is in v7.

I have V7 but can't find any SureReplica in it?

Post by **tsightler** » Nov 06, 2014 7:57 pm this post

cliffm wrote: I have V7 but can't find any SureReplica in it?

It's there! There's nothing specifically called "SureReplica" in the GUI, but in v7 when you create an application group you'll notice that there's both and "Add Backup" and "Add Replica" for you to select VMs for the app group. Not only that, but when you create the "SureBackup" job itself, you can add select both backup and replica jobs, you can even mix and match in the same job! Some links:

Veeam Helpcenter: SureReplica Documentation
Video: Put your replicas to work

Post by **Gostev** » Nov 06, 2014 9:45 pm this post

nreutemann wrote:How much safe is run the script from veeam KB1940?

Super safe. Basically zero impact, except that the next run for all jobs will take longer than usual.

nreutemann · Post by **nreutemann** » Nov 06, 2014 11:09 pm this post

Gostev wrote: Super safe. Basically zero impact, except that thenext run for all jobs will take longer than usual.

Excellent, thanks Gostev.

Tomorrow, before the Incremental + Synthetic of friday, I will run the script.

After the full run, I will post the results here.

Again, thanks.

_richiix · Post by **_richiix** » Nov 07, 2014 10:52 am this post

Hi guys,

Firstly thanks for all the information, it has been a fantastic help in diagnosing this problem on our infrastructure.

Quick question though, I am seeing this error appear on VM's that are far below the threshold.
I have seen this error crop up on VM's that only have a 25GB disk..

Anything else I should search for regarding this?

Cheers guys!

Nov 07, 2014 11:44 am

Do you mean the "Cannot use CBT: Soap fault." error? There's a dedicated thread on it, but better contact support directly.

ptcruisergt · Post by **ptcruisergt** » Nov 07, 2014 4:19 pm this post

There is word over in the EMC forums (https://community.emc.com/thread/201841) that this VMware bug has been around since 2007. If that's true, I'm at a loss for words. There is also mention of a hotfix that can be obtained by calling support.

Apologies if this information was already posted here.

Post by **tsightler** » Nov 07, 2014 5:06 pm this post

ptcruisergt wrote:There is word over in the EMC forums (https://community.emc.com/thread/201841) that this VMware bug has been around since 2007.

Thanks for the info. CBT wasn't an available feature until ESX/ESXi 4.0, and I don't believe 4.0 was released publicly until 2009, but yes, this bug impacts every single version of ESX/ESXi that had the CBT feature. I guess it could have existed in 2007 in beta versions of ESX 4.0. Definitely would be good to confirm a hotfix and whether it addresses VMs that already have broken CBT or will it still require a manual CBT reset for those VMs and simply prevent the issue in the future?

lobo519 · Post by **lobo519** » Nov 07, 2014 8:23 pm this post

Maybe I missed it in the 10 pages but - Veeam disables CBT when there is a disk size change does it not?

I just did one last week and I am looking at the log that says "Change Block tracking is disabled".

What am I missing?

Post by **Gostev** » Nov 07, 2014 8:26 pm this post

That CBT tables in VMware get messed up and this affects all future use of CBT data.
Our message only means that the job will not use CBT data during this one specific run.

lobo519 · Post by **lobo519** » Nov 07, 2014 8:27 pm this post

Got it - Thanks!

cffit · Post by **cffit** » Nov 07, 2014 8:53 pm this post

I opened a case with VMware on this to get the hotfix mentioned a few posts above. Here is their reply to me:

Thank you for your Support Request. Just tried to call but hit your voicemail. I'm not sure where you were given information that a hotfix is available, but that is not the case. At this point there is a workaround available in http://kb.vmware.com/kb/2090639, and a fix is scheduled for a future release. You can also subscribe to an RSS feed of the KB and you'll receive an update when the fix is released.

mrt · Post by **mrt** » Nov 07, 2014 9:37 pm this post

can someone provide a modified version of the script on http://www.veeam.com/kb1940 that uses a single named vm instead of a query for every vm that has cbt enabled? I'd like to do the few vm's in my env that are potentially affected individually, not every single one. Also, can it be confirmed that the script touches all the disks in the vm? Thanks

xadamz23 · Nov 07, 2014 10:28 pm

Code: Select all

$myvm="Your_VM_name_goes_here"
$vm=get-vm $myvm
$spec = New-Object VMware.Vim.VirtualMachineConfigSpec 
$spec.ChangeTrackingEnabled = $false
$vm.ExtensionData.ReconfigVM($spec) 
$snap=$vm | New-Snapshot -Name 'Disable CBT' 
$snap | Remove-Snapshot -confirm:$false

Yes, the script resets the CBT for the entire VM, not individual vmdk disks.

cliffm · Post by **cliffm** » Nov 08, 2014 5:52 am this post

tsightler wrote:It's there! There's nothing specifically called "SureReplica" in the GUI, but in v7 when you create an application group you'll notice that there's both and "Add Backup" and "Add Replica" for you to select VMs for the app group. Not only that, but when you create the "SureBackup" job itself, you can add select both backup and replica jobs, you can even mix and match in the same job! Some links:

Veeam Helpcenter: SureReplica Documentation
Video: Put your replicas to work

AAAAAHHHHHHH! I had no idea, that is truly wonderful news!
Thank you

nreutemann · Nov 08, 2014 10:16 pm

Well, my five cents.

I run the script from the KB1940. After 30 minutes of work, no news, no problems.

After that, I upgrade to v8. And my next round of backups starts as we plan, took more time and everything goes fine, with the incrementals and with the synthetic.

So, thats all, I hope i got this problem solved.

Cya.

(Sorry, I know, my english sucks)

R&D Forums

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Re: VMware CBT bug KB 2090639

Who is online