Host-based backup of VMware vSphere VMs.
Post Reply
ashleyw
Service Provider
Posts: 208
Liked: 43 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

CBT behavior after SAN crash, power failure

Post by ashleyw »

Hi,

Due to reasons of completely unknown stupidity by one of our cabling contractors (don't ask!), our SAN layer was disconnected from our compute layer on our VMware farm for several hours. Eventually when the connectivity was restored, I was forced to reboot our 7 ESX5i hosts (hosting around 400 VMs) and then startup all VMs to restore service.

Unfortunately now when all the backup jobs run, for each and every disk on each and every VM I get a message from Veeam v6 "Cannot use CBT: Soap Fault". Prior to the "crash" we weren't having any CBT related issues.

How do we fix this quickly across our entire farm (without taking offline any VMs) so that normal CBT behaviour can commence?

thanks!
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: CBT behavior after SAN crash.

Post by Gostev »

Hi Ashley, the only way I am aware of is described in the sticky "Known Issues" topic. Unfortunately, this process involves power cycling each VM twice to fully reset CTK database in VMware. Thanks.
ashleyw
Service Provider
Posts: 208
Liked: 43 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: CBT behavior after SAN crash.

Post by ashleyw »

oh deary deary me. This is bad news.

Is it possible for Veeam to reproduce this scenario to work out any possible alternate solutions, or help with any powershell scripting to do this for running machines across a vCentre instance?

From my limited understanding the CBT for a running VM can be reset via the VMware APIs and then a snapshot triggered and then removed for the new effect to take place.

This would help a lot of other clients in the same situation as us and remove the need for manual intervention.
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: CBT behavior after SAN crash.

Post by Gostev »

In fact, such research was done at Veeam long ago. This problem is not exactly new, storage problems happen to everyone occasionally, and they do affect CBT sometimes. The process was actually developed based on our own testing, this is the only way we were able to make CBT work again. There are simply no other options, obviously there's little you can do when data gets corrupted but do a hard reset of all data structures. Of course, you may also try and open a support case with VMware, but I doubt they will be able to provide resolution that does not require VM downtime.

If your primary concern is short downtime for each VM - then just treat this as required downtime to fully recover from the disaster. That SAN issue and ESXi host reset caused much more downtime to you already, I am sure you were not too excited about having to do this either, but you had to do it anyway. Now, rebooting all your VMs one by one is really not too big of a deal comparing to what you have already been through.

If you concern is the amount of manual work required - then keep in mind that the whole process should be easily scriptable with PowerCLI. If you don't know PowerShell, then as a starting point it might be a good idea to post a request in the PowerShell subforum and see if anyone would be willing to help you.
ashleyw
Service Provider
Posts: 208
Liked: 43 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: CBT behavior after SAN crash.

Post by ashleyw »

thanks Gostev, the part I want to double check before I go searching for powershell scripts, is that will a stun cycle to a running machine through a snapshot create and snapshot remove process rectify the issue? Once I know what the actual process is, I'll automate it.

ie. for each VM in each vCentre instance;
1. Change CBT to off via VMware APIs (both overall and for each disk).
2. Issue a snapshot create then snapshot remove to "stun the VM" to acknowledge the settings
3. Change CBT to on via VMware APIs
4. Issue a snapshot create then snapshot remove to "stun the VM" to acknowledge the settings

or does the VM actually need to be power cycled?
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: CBT behavior after SAN crash.

Post by tsightler » 1 person likes this post

After a SAN crash it is "normal" to see these issues on the first run after the SAN is restored. This is because there was not a "clean" shutdown of the CBT state, thus there is no way for it to be sure that CBT information is accurate. However, subsequent runs should work again without having to disable it completely, unless the CBT information is actually corrupt. Are you actually seeing this behavior on the subsequent runs? I can certainly see this happening on some VMs (perhaps VMs that were actively busy) however, even after a crash I've never seen this issue impact every VM except on the first run.
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: CBT behavior after SAN crash.

Post by Gostev »

Hi Ashley, in our testing power cycling was required - other methods did not re-create CBT data structures, but were instead re-using the existing, corrupted ones. Thanks.
UltraSub
Lurker
Posts: 1
Liked: never
Joined: Dec 22, 2011 7:31 pm
Full Name: Robert
Contact:

Re: CBT behavior after SAN crash.

Post by UltraSub »

I can confirm.. after first succesfull run, CBT status is reset, and 2nd run is with CBT again. No power cycle needed. (v6 and ESXi 4.1)
rollster
Lurker
Posts: 1
Liked: never
Joined: Aug 07, 2012 4:33 pm
Full Name: Rolando Rodriguez
Contact:

Re: CBT behavior after SAN crash.

Post by rollster »

After running the powershell script, do the ctk.vmdk files need to be deleted from the vm container in the datastore?
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: CBT behavior after SAN crash.

Post by foggy »

No need to delete files manually.
JeffN825
Novice
Posts: 5
Liked: never
Joined: Jun 07, 2013 2:51 am
Full Name: Jeff Nevins
Contact:

[MERGED] CBT Routinely Corrupt

Post by JeffN825 »

We've had a couple of power failures in the past month or two and I've noticed that even though all VMs come up without any problems, it seems that CBT is becoming corrupt upon any power failure.

I can tell this because doing a full backup takes approximately 30 hours and lists the GB processed as equal to the GB read, whereas normally, the GB processed is substantially less and a full backup takes about 6 hours.

Resetting CBT on all VMs fixes this issue, but I'm wondering if there is a way to prevent CBT from becoming corrupt so routinely.

Thank you.
veremin
Product Manager
Posts: 20415
Liked: 2302 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: CBT behavior after SAN crash, power failure

Post by veremin »

, but I'm wondering if there is a way to prevent CBT from becoming corrupt so routinely
Nothing we’re aware of. As mentioned above, this problem has to with unexpected and occasional storage failures and the way it affects CBT mechanism. So, the only solution you have in this case is to reset CBT. In order to make this task easier, you can a PS script that will be responsible for reseting CBT on given VM(s).

Thanks.
JeffN825
Novice
Posts: 5
Liked: never
Joined: Jun 07, 2013 2:51 am
Full Name: Jeff Nevins
Contact:

Re: CBT behavior after SAN crash, power failure

Post by JeffN825 »

I created such a powershell script that loops through all my VMs and does that...but what about my off site (WAN) backups? Won't that mean they need to do a full backup and it will take days?
veremin
Product Manager
Posts: 20415
Liked: 2302 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: CBT behavior after SAN crash, power failure

Post by veremin »

Generally speaking, it will be still an incremental run not a full one; though, after CBT reset it certainly will take sometime for VB&R to understand what blocks have to be transferred to target location. In other words, the time will be same as for full backup run, however, only changed blocks will be backed up in this case.Thanks.
Ferrari-Dude
Service Provider
Posts: 11
Liked: 1 time
Joined: Jun 14, 2017 4:59 pm
Full Name: Don Neumann
Contact:

Re: CBT behavior after SAN crash, power failure

Post by Ferrari-Dude »

I am in a similar situation to the OP. We just had a total power failure and the ESXi hosts, switches and SAN all went down. Just so that I am clear on what I need to do to after all equipment is brought online again.

1. Power on/off each VM two times - this will reset the CBT state
2. Create new backup jobs and/or backup copy jobs

Please advise,

-Don
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: CBT behavior after SAN crash, power failure

Post by Gostev »

Don - actually, this thread is many years old, and it a relates to really old ESXi version that is no longer supported. The issue haven't been reported once ever since. Are you getting any API errors? If yes, the issue can be much more severe, and it's best to investigate with VMware support. Thanks!
Ferrari-Dude
Service Provider
Posts: 11
Liked: 1 time
Joined: Jun 14, 2017 4:59 pm
Full Name: Don Neumann
Contact:

Re: CBT behavior after SAN crash, power failure

Post by Ferrari-Dude »

Hi Gostev - I power cycled a sample VM two times and created a new backup job and ran a full backup. No errors were reported. In this case, do you think I'm good to go or should I still investigate with VMware support?
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: CBT behavior after SAN crash, power failure

Post by Gostev »

You're good to go!
Post Reply

Who is online

Users browsing this forum: Google [Bot] and 20 guests