CBT behavior after SAN crash, power failure

ashleyw · Post by **ashleyw** » Dec 20, 2011 7:38 pm this post

Hi,

Due to reasons of completely unknown stupidity by one of our cabling contractors (don't ask!), our SAN layer was disconnected from our compute layer on our VMware farm for several hours. Eventually when the connectivity was restored, I was forced to reboot our 7 ESX5i hosts (hosting around 400 VMs) and then startup all VMs to restore service.

Unfortunately now when all the backup jobs run, for each and every disk on each and every VM I get a message from Veeam v6 "Cannot use CBT: Soap Fault". Prior to the "crash" we weren't having any CBT related issues.

How do we fix this quickly across our entire farm (without taking offline any VMs) so that normal CBT behaviour can commence?

thanks!

Post by **Gostev** » Dec 20, 2011 7:49 pm this post

Hi Ashley, the only way I am aware of is described in the sticky "Known Issues" topic. Unfortunately, this process involves power cycling each VM twice to fully reset CTK database in VMware. Thanks.

ashleyw · Post by **ashleyw** » Dec 20, 2011 10:29 pm this post

oh deary deary me. This is bad news.

Is it possible for Veeam to reproduce this scenario to work out any possible alternate solutions, or help with any powershell scripting to do this for running machines across a vCentre instance?

From my limited understanding the CBT for a running VM can be reset via the VMware APIs and then a snapshot triggered and then removed for the new effect to take place.

This would help a lot of other clients in the same situation as us and remove the need for manual intervention.

Post by **Gostev** » Dec 20, 2011 10:51 pm this post

In fact, such research was done at Veeam long ago. This problem is not exactly new, storage problems happen to everyone occasionally, and they do affect CBT sometimes. The process was actually developed based on our own testing, this is the only way we were able to make CBT work again. There are simply no other options, obviously there's little you can do when data gets corrupted but do a hard reset of all data structures. Of course, you may also try and open a support case with VMware, but I doubt they will be able to provide resolution that does not require VM downtime.

If your primary concern is short downtime for each VM - then just treat this as required downtime to fully recover from the disaster. That SAN issue and ESXi host reset caused much more downtime to you already, I am sure you were not too excited about having to do this either, but you had to do it anyway. Now, rebooting all your VMs one by one is really not too big of a deal comparing to what you have already been through.

If you concern is the amount of manual work required - then keep in mind that the whole process should be easily scriptable with PowerCLI. If you don't know PowerShell, then as a starting point it might be a good idea to post a request in the PowerShell subforum and see if anyone would be willing to help you.

ashleyw · Post by **ashleyw** » Dec 21, 2011 12:49 am this post

thanks Gostev, the part I want to double check before I go searching for powershell scripts, is that will a stun cycle to a running machine through a snapshot create and snapshot remove process rectify the issue? Once I know what the actual process is, I'll automate it.

ie. for each VM in each vCentre instance;
1. Change CBT to off via VMware APIs (both overall and for each disk).
2. Issue a snapshot create then snapshot remove to "stun the VM" to acknowledge the settings
3. Change CBT to on via VMware APIs
4. Issue a snapshot create then snapshot remove to "stun the VM" to acknowledge the settings

or does the VM actually need to be power cycled?

Dec 21, 2011 3:44 am

After a SAN crash it is "normal" to see these issues on the first run after the SAN is restored. This is because there was not a "clean" shutdown of the CBT state, thus there is no way for it to be sure that CBT information is accurate. However, subsequent runs should work again without having to disable it completely, unless the CBT information is actually corrupt. Are you actually seeing this behavior on the subsequent runs? I can certainly see this happening on some VMs (perhaps VMs that were actively busy) however, even after a crash I've never seen this issue impact every VM except on the first run.

Post by **Gostev** » Dec 21, 2011 8:02 am this post

Hi Ashley, in our testing power cycling was required - other methods did not re-create CBT data structures, but were instead re-using the existing, corrupted ones. Thanks.

UltraSub · Post by **UltraSub** » Dec 22, 2011 7:44 pm this post

I can confirm.. after first succesfull run, CBT status is reset, and 2nd run is with CBT again. No power cycle needed. (v6 and ESXi 4.1)

rollster · Post by **rollster** » Aug 07, 2012 4:36 pm this post

After running the powershell script, do the ctk.vmdk files need to be deleted from the vm container in the datastore?

Post by **foggy** » Aug 08, 2012 10:30 am this post

No need to delete files manually.

JeffN825 · Post by **JeffN825** » Jul 14, 2013 4:05 am this post

We've had a couple of power failures in the past month or two and I've noticed that even though all VMs come up without any problems, it seems that CBT is becoming corrupt upon any power failure.

I can tell this because doing a full backup takes approximately 30 hours and lists the GB processed as equal to the GB read, whereas normally, the GB processed is substantially less and a full backup takes about 6 hours.

Resetting CBT on all VMs fixes this issue, but I'm wondering if there is a way to prevent CBT from becoming corrupt so routinely.

Thank you.

Post by **veremin** » Jul 15, 2013 8:27 am this post

, but I'm wondering if there is a way to prevent CBT from becoming corrupt so routinely

Nothing we’re aware of. As mentioned above, this problem has to with unexpected and occasional storage failures and the way it affects CBT mechanism. So, the only solution you have in this case is to reset CBT. In order to make this task easier, you can a PS script that will be responsible for reseting CBT on given VM(s).

Thanks.

JeffN825 · Post by **JeffN825** » Jul 15, 2013 3:09 pm this post

I created such a powershell script that loops through all my VMs and does that...but what about my off site (WAN) backups? Won't that mean they need to do a full backup and it will take days?

Post by **veremin** » Jul 15, 2013 3:17 pm this post

Generally speaking, it will be still an incremental run not a full one; though, after CBT reset it certainly will take sometime for VB&R to understand what blocks have to be transferred to target location. In other words, the time will be same as for full backup run, however, only changed blocks will be backed up in this case.Thanks.

Post by **Ferrari-Dude** » Dec 05, 2019 3:51 pm this post

I am in a similar situation to the OP. We just had a total power failure and the ESXi hosts, switches and SAN all went down. Just so that I am clear on what I need to do to after all equipment is brought online again.

1. Power on/off each VM two times - this will reset the CBT state
2. Create new backup jobs and/or backup copy jobs

Please advise,

-Don

Post by **Gostev** » Dec 05, 2019 9:34 pm this post

Don - actually, this thread is many years old, and it a relates to really old ESXi version that is no longer supported. The issue haven't been reported once ever since. Are you getting any API errors? If yes, the issue can be much more severe, and it's best to investigate with VMware support. Thanks!

Post by **Ferrari-Dude** » Dec 05, 2019 9:42 pm this post

Hi Gostev - I power cycled a sample VM two times and created a new backup job and ran a full backup. No errors were reported. In this case, do you think I'm good to go or should I still investigate with VMware support?

Post by **Gostev** » Dec 06, 2019 5:51 pm this post

You're good to go!

R&D Forums

CBT behavior after SAN crash, power failure

Re: CBT behavior after SAN crash.

Re: CBT behavior after SAN crash.

Re: CBT behavior after SAN crash.

Re: CBT behavior after SAN crash.

Re: CBT behavior after SAN crash.

Re: CBT behavior after SAN crash.

Re: CBT behavior after SAN crash.

Re: CBT behavior after SAN crash.

Re: CBT behavior after SAN crash.

[MERGED] CBT Routinely Corrupt

Re: CBT behavior after SAN crash, power failure

Re: CBT behavior after SAN crash, power failure

Re: CBT behavior after SAN crash, power failure

Re: CBT behavior after SAN crash, power failure

Re: CBT behavior after SAN crash, power failure

Re: CBT behavior after SAN crash, power failure

Re: CBT behavior after SAN crash, power failure

Who is online