-
- Service Provider
- Posts: 204
- Liked: 38 times
- Joined: Oct 28, 2010 10:55 pm
- Full Name: Ashley Watson
- Contact:
CBT behavior after SAN crash, power failure
Hi,
Due to reasons of completely unknown stupidity by one of our cabling contractors (don't ask!), our SAN layer was disconnected from our compute layer on our VMware farm for several hours. Eventually when the connectivity was restored, I was forced to reboot our 7 ESX5i hosts (hosting around 400 VMs) and then startup all VMs to restore service.
Unfortunately now when all the backup jobs run, for each and every disk on each and every VM I get a message from Veeam v6 "Cannot use CBT: Soap Fault". Prior to the "crash" we weren't having any CBT related issues.
How do we fix this quickly across our entire farm (without taking offline any VMs) so that normal CBT behaviour can commence?
thanks!
Due to reasons of completely unknown stupidity by one of our cabling contractors (don't ask!), our SAN layer was disconnected from our compute layer on our VMware farm for several hours. Eventually when the connectivity was restored, I was forced to reboot our 7 ESX5i hosts (hosting around 400 VMs) and then startup all VMs to restore service.
Unfortunately now when all the backup jobs run, for each and every disk on each and every VM I get a message from Veeam v6 "Cannot use CBT: Soap Fault". Prior to the "crash" we weren't having any CBT related issues.
How do we fix this quickly across our entire farm (without taking offline any VMs) so that normal CBT behaviour can commence?
thanks!
-
- Chief Product Officer
- Posts: 31707
- Liked: 7212 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: CBT behavior after SAN crash.
Hi Ashley, the only way I am aware of is described in the sticky "Known Issues" topic. Unfortunately, this process involves power cycling each VM twice to fully reset CTK database in VMware. Thanks.
-
- Service Provider
- Posts: 204
- Liked: 38 times
- Joined: Oct 28, 2010 10:55 pm
- Full Name: Ashley Watson
- Contact:
Re: CBT behavior after SAN crash.
oh deary deary me. This is bad news.
Is it possible for Veeam to reproduce this scenario to work out any possible alternate solutions, or help with any powershell scripting to do this for running machines across a vCentre instance?
From my limited understanding the CBT for a running VM can be reset via the VMware APIs and then a snapshot triggered and then removed for the new effect to take place.
This would help a lot of other clients in the same situation as us and remove the need for manual intervention.
Is it possible for Veeam to reproduce this scenario to work out any possible alternate solutions, or help with any powershell scripting to do this for running machines across a vCentre instance?
From my limited understanding the CBT for a running VM can be reset via the VMware APIs and then a snapshot triggered and then removed for the new effect to take place.
This would help a lot of other clients in the same situation as us and remove the need for manual intervention.
-
- Chief Product Officer
- Posts: 31707
- Liked: 7212 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: CBT behavior after SAN crash.
In fact, such research was done at Veeam long ago. This problem is not exactly new, storage problems happen to everyone occasionally, and they do affect CBT sometimes. The process was actually developed based on our own testing, this is the only way we were able to make CBT work again. There are simply no other options, obviously there's little you can do when data gets corrupted but do a hard reset of all data structures. Of course, you may also try and open a support case with VMware, but I doubt they will be able to provide resolution that does not require VM downtime.
If your primary concern is short downtime for each VM - then just treat this as required downtime to fully recover from the disaster. That SAN issue and ESXi host reset caused much more downtime to you already, I am sure you were not too excited about having to do this either, but you had to do it anyway. Now, rebooting all your VMs one by one is really not too big of a deal comparing to what you have already been through.
If you concern is the amount of manual work required - then keep in mind that the whole process should be easily scriptable with PowerCLI. If you don't know PowerShell, then as a starting point it might be a good idea to post a request in the PowerShell subforum and see if anyone would be willing to help you.
If your primary concern is short downtime for each VM - then just treat this as required downtime to fully recover from the disaster. That SAN issue and ESXi host reset caused much more downtime to you already, I am sure you were not too excited about having to do this either, but you had to do it anyway. Now, rebooting all your VMs one by one is really not too big of a deal comparing to what you have already been through.
If you concern is the amount of manual work required - then keep in mind that the whole process should be easily scriptable with PowerCLI. If you don't know PowerShell, then as a starting point it might be a good idea to post a request in the PowerShell subforum and see if anyone would be willing to help you.
-
- Service Provider
- Posts: 204
- Liked: 38 times
- Joined: Oct 28, 2010 10:55 pm
- Full Name: Ashley Watson
- Contact:
Re: CBT behavior after SAN crash.
thanks Gostev, the part I want to double check before I go searching for powershell scripts, is that will a stun cycle to a running machine through a snapshot create and snapshot remove process rectify the issue? Once I know what the actual process is, I'll automate it.
ie. for each VM in each vCentre instance;
1. Change CBT to off via VMware APIs (both overall and for each disk).
2. Issue a snapshot create then snapshot remove to "stun the VM" to acknowledge the settings
3. Change CBT to on via VMware APIs
4. Issue a snapshot create then snapshot remove to "stun the VM" to acknowledge the settings
or does the VM actually need to be power cycled?
ie. for each VM in each vCentre instance;
1. Change CBT to off via VMware APIs (both overall and for each disk).
2. Issue a snapshot create then snapshot remove to "stun the VM" to acknowledge the settings
3. Change CBT to on via VMware APIs
4. Issue a snapshot create then snapshot remove to "stun the VM" to acknowledge the settings
or does the VM actually need to be power cycled?
-
- VP, Product Management
- Posts: 6027
- Liked: 2855 times
- Joined: Jun 05, 2009 12:57 pm
- Full Name: Tom Sightler
- Contact:
Re: CBT behavior after SAN crash.
After a SAN crash it is "normal" to see these issues on the first run after the SAN is restored. This is because there was not a "clean" shutdown of the CBT state, thus there is no way for it to be sure that CBT information is accurate. However, subsequent runs should work again without having to disable it completely, unless the CBT information is actually corrupt. Are you actually seeing this behavior on the subsequent runs? I can certainly see this happening on some VMs (perhaps VMs that were actively busy) however, even after a crash I've never seen this issue impact every VM except on the first run.
-
- Chief Product Officer
- Posts: 31707
- Liked: 7212 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: CBT behavior after SAN crash.
Hi Ashley, in our testing power cycling was required - other methods did not re-create CBT data structures, but were instead re-using the existing, corrupted ones. Thanks.
-
- Lurker
- Posts: 1
- Liked: never
- Joined: Dec 22, 2011 7:31 pm
- Full Name: Robert
- Contact:
Re: CBT behavior after SAN crash.
I can confirm.. after first succesfull run, CBT status is reset, and 2nd run is with CBT again. No power cycle needed. (v6 and ESXi 4.1)
-
- Lurker
- Posts: 1
- Liked: never
- Joined: Aug 07, 2012 4:33 pm
- Full Name: Rolando Rodriguez
- Contact:
Re: CBT behavior after SAN crash.
After running the powershell script, do the ctk.vmdk files need to be deleted from the vm container in the datastore?
-
- Veeam Software
- Posts: 21128
- Liked: 2137 times
- Joined: Jul 11, 2011 10:22 am
- Full Name: Alexander Fogelson
- Contact:
Re: CBT behavior after SAN crash.
No need to delete files manually.
-
- Novice
- Posts: 5
- Liked: never
- Joined: Jun 07, 2013 2:51 am
- Full Name: Jeff Nevins
- Contact:
[MERGED] CBT Routinely Corrupt
We've had a couple of power failures in the past month or two and I've noticed that even though all VMs come up without any problems, it seems that CBT is becoming corrupt upon any power failure.
I can tell this because doing a full backup takes approximately 30 hours and lists the GB processed as equal to the GB read, whereas normally, the GB processed is substantially less and a full backup takes about 6 hours.
Resetting CBT on all VMs fixes this issue, but I'm wondering if there is a way to prevent CBT from becoming corrupt so routinely.
Thank you.
I can tell this because doing a full backup takes approximately 30 hours and lists the GB processed as equal to the GB read, whereas normally, the GB processed is substantially less and a full backup takes about 6 hours.
Resetting CBT on all VMs fixes this issue, but I'm wondering if there is a way to prevent CBT from becoming corrupt so routinely.
Thank you.
-
- Product Manager
- Posts: 20353
- Liked: 2285 times
- Joined: Oct 26, 2012 3:28 pm
- Full Name: Vladimir Eremin
- Contact:
Re: CBT behavior after SAN crash, power failure
Nothing we’re aware of. As mentioned above, this problem has to with unexpected and occasional storage failures and the way it affects CBT mechanism. So, the only solution you have in this case is to reset CBT. In order to make this task easier, you can a PS script that will be responsible for reseting CBT on given VM(s)., but I'm wondering if there is a way to prevent CBT from becoming corrupt so routinely
Thanks.
-
- Novice
- Posts: 5
- Liked: never
- Joined: Jun 07, 2013 2:51 am
- Full Name: Jeff Nevins
- Contact:
Re: CBT behavior after SAN crash, power failure
I created such a powershell script that loops through all my VMs and does that...but what about my off site (WAN) backups? Won't that mean they need to do a full backup and it will take days?
-
- Product Manager
- Posts: 20353
- Liked: 2285 times
- Joined: Oct 26, 2012 3:28 pm
- Full Name: Vladimir Eremin
- Contact:
Re: CBT behavior after SAN crash, power failure
Generally speaking, it will be still an incremental run not a full one; though, after CBT reset it certainly will take sometime for VB&R to understand what blocks have to be transferred to target location. In other words, the time will be same as for full backup run, however, only changed blocks will be backed up in this case.Thanks.
-
- Service Provider
- Posts: 11
- Liked: 1 time
- Joined: Jun 14, 2017 4:59 pm
- Full Name: Don Neumann
- Contact:
Re: CBT behavior after SAN crash, power failure
I am in a similar situation to the OP. We just had a total power failure and the ESXi hosts, switches and SAN all went down. Just so that I am clear on what I need to do to after all equipment is brought online again.
1. Power on/off each VM two times - this will reset the CBT state
2. Create new backup jobs and/or backup copy jobs
Please advise,
-Don
1. Power on/off each VM two times - this will reset the CBT state
2. Create new backup jobs and/or backup copy jobs
Please advise,
-Don
-
- Chief Product Officer
- Posts: 31707
- Liked: 7212 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: CBT behavior after SAN crash, power failure
Don - actually, this thread is many years old, and it a relates to really old ESXi version that is no longer supported. The issue haven't been reported once ever since. Are you getting any API errors? If yes, the issue can be much more severe, and it's best to investigate with VMware support. Thanks!
-
- Service Provider
- Posts: 11
- Liked: 1 time
- Joined: Jun 14, 2017 4:59 pm
- Full Name: Don Neumann
- Contact:
Re: CBT behavior after SAN crash, power failure
Hi Gostev - I power cycled a sample VM two times and created a new backup job and ran a full backup. No errors were reported. In this case, do you think I'm good to go or should I still investigate with VMware support?
-
- Chief Product Officer
- Posts: 31707
- Liked: 7212 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: CBT behavior after SAN crash, power failure
You're good to go!
Who is online
Users browsing this forum: No registered users and 21 guests