Failback question

KOPFteam · Post by **KOPFteam** » Oct 04, 2017 5:52 pm this post

One of our 2 ESX servers failed today. The RAID controller is broken. It will be replaced tomorrow.

This is the current situation:
This morning at 5am Veeam (current version fully patched) created replicas of all VMs (VMware 6.0; no vCenter) from Server A to Server B. People were working with the files that were available thru the shares on Server A. At 12am the RAID controller stopped working. At 3pm I decided that I can't repair this myself today and used Veeam to do a "Failover" of 2 VMs to Server B. That worked perfectly

, the VMs are "back", they can continue to work.

I hope that Server A will be repaired tomorrow with the original hard disks intact and so I think I will start a "Failback" tomorrow evening. But what will be the result? Will the changes that the users are going to create during the day tomorrow (on the "failover replica") still be there after the failback? I think/hope so! But what is with the changes they made today in the hours before the RAID controller failed? Will they be replaced by the failback or integrated?

Thanks for your ideas!
Florian Pürner
KOPFteam GmnH
Munich, Germany

Post by **Deon** » Oct 04, 2017 6:52 pm this post

Hello Mr. Pürner,

Your current situation is the following:
(0) Replica was created at 5AM.
(1) Production VM stopped functioning at 12PM. It has changes in the period of 5AM-12PM, let's call them "5-12PROD".
(2) You started replica failover from the state of 5AM. The users started working on it, so it now has changes "5AM-current time", but they are different, so let's call them "5-CurREP".
(3) When you decide to trigger the Failback to production, the disks of production machine will be changed to the state of disks on the replica. The production machine will have "5-CurREP" state.

It means that you won't have "5-12PROD" changes anymore.

There's no way to automate a "merge" of any sort, because in reality when your production machine becomes available and you still have the replica, you will have two different versions of the machine with different data, both with useful changes after the last replication.
For Veeam disk state = state of datablocks, there's no "file analysis to compare two states of the machine to merge them" or any magic like that (unfortunately, I do hope that in far future we will have such technology).

Since it's an important situation, I would advise to involve some "manual work" if possible. You could pick the critical machines and check the difference between files/application items with external tools, and move these changes to replica before doing the failback. It may be unrealistic if you have a huge amount of machines, but something CAN be done.

What I would advise for sure:
Before you decide to go for the failback to production, create a single backup of all your critical production machines. This way, if you notice something critical missing after the failback, you can get it out of this "transition" backup with File-level restore or Application-item restore.

KOPFteam · Post by **KOPFteam** » Oct 06, 2017 6:55 pm this post

Hello Deon,

thanks for the clarification! Today the server was repaired (new mainboard, new RAID controller) and I started a "Failback" for the first VM (a Domain Controller with 120 GB) and it failed

It finished the "Calculating original signature..." step but failed in the "Replicating restore point ...." after 7 minutes with error

Code: Select all

06.10.2017 20:04:11 Error    Failed to perform failback Error: Failed to open VDDK disk [[ESX1 - SAS-RAID5] pp-vserver1.polyplan.local/pp-vserver1.polyplan.local.vmdk] ( is read-only mode - [false] )
                             Logon attempt with parameters [VC/ESX: [PP-ESX1];Port: 902;Login: [root];VMX Spec: [moref=5];Snapshot mor: [5-snapshot-6275];Transports: [nbd];Read Only: [false]] failed because of the following errors:
                             Failed to process [replicateVddkDiskIncremental].

I opened a support case (#02336918) but because we only have "Basic Support", I guess nobody will work on it before monday. I planed the weekend for failback (the other VM is the fileserver with about 2 TB). Does anyone have an idea what's wrong here?

[Update] Looking at the repaired server with the vSphere Client it now tells me that the disk of the VM that didn't failback needs consilidation. Is that because the Failback failed or is that the reason why it failed? Should I do it?

Thanks,
Florian

KOPFteam · Post by **KOPFteam** » Oct 06, 2017 8:37 pm this post

OK, I learned from the Veeam KB that the "needs consolidation" is to be expected after a failed failback.

I tried a 2nd failback to a "specified location". That doesn't use the (now probably corrupt) original VM but should simply copy over the whole VM to the specified location. But - after a few minutes I got nearly the same error ..... Failed to process [replicateVddkDiskIncremental].

What is going on here?
There is no vCenter used anymore for more than a year now. Both ESX are standalone.
What I saw is that the failback process choose that VM as proxy that it should failback (I left it to automatic). Maybe it can't proxy itself?

KOPFteam · Post by **KOPFteam** » Oct 06, 2017 9:29 pm this post

Sorry for reposting this, but simply replying under the "Failback qusting" below was not a good idea I think. Because now this is not a question, this is a failure!

Two days ago one of our servers failed and I used "Failover" without problems to start the replicas of a Domain Controller and a File Server on a second ESX.
Today our server (ESX1) was repaired (new mainboard, new RAID controller) and I started a "Failback" for the first VM (PP-VSERVER1, a Domain Controller with 120 GB) and it failed.

It finished the "Calculating original signature..." step but failed in the "Replicating restore point ...." after 7 minutes with error

Code: Select all

    06.10.2017 20:04:11 Error    Failed to perform failback Error: Failed to open VDDK disk [[ESX1 - SAS-RAID5] pp-vserver1.polyplan.local/pp-vserver1.polyplan.local.vmdk] ( is read-only mode - [false] )
                                 Logon attempt with parameters [VC/ESX: [PP-ESX1];Port: 902;Login: [root];VMX Spec: [moref=5];Snapshot mor: [5-snapshot-6275];Transports: [nbd];Read Only: [false]] failed because of the following errors:
                                 Failed to process [replicateVddkDiskIncremental].

I opened a support case (#02336918) but because we only have "Basic Support", I guess nobody will work on it before monday. I planed the weekend for failback (the other VM is the fileserver with about 2 TB). Does anyone have an idea what's wrong here?

Looking at the repaired server with the vSphere Client it now tells me that the disk of the VM that didn't failback needs consilidation. I learned from the Veeam KB that the "needs consolidation" is to be expected after a failed failback.

I tried a 2nd failback to a "specified location". That doesn't use the (now probably corrupt) original VM but should simply copy over the whole VM to the specified location. But - after a few minutes I got nearly the same error ..... Failed to process [replicateVddkDiskIncremental].

What is going on here?
There is no vCenter used anymore for more than a year now. Both ESX are standalone.
What I saw is that the failback process choose exactly that VM as proxy that it should failback (I left the proxy to automatic). Maybe it can't proxy itself?

Thanks,
Florian

Post by **Gostev** » Oct 08, 2017 6:19 pm this post

Hi Florian, I've merged the two topics together. Please note that according to the forum rules provided when you click New Topic, this forum is not an alternative way to obtain support when you are unable to get it from our technical support team for whatever reason. Please do not force thousands of community members to read through multiple topics about the same environment-specific issue by creating duplicate discussions. Please help us maintain this forum easily consumable and interesting for everyone. Thanks!

bsquillace · Oct 09, 2017 4:02 pm

Hi Florian,

I had the same problem when trying to failback to production. With the same error message:
10/4/2017 4:17:40 PM Error Failed to perform failback Error: Failed to open VDDK disk [[PEC-SAN] WebHelpDesk/WebHelpDesk.vmdk] ( is read-only mode - [false] )
Logon attempt with parameters [VC/ESX: [Vcenter1.premiereyecare.net];Port: 443;Login: [administrator@vsphere.local];VMX Spec: [moref=vm-15138];Snapshot mor: [snapshot-15195];Transports: [nbd];Read Only: [false]] failed because of the following errors: Failed to process [replicateVddkDiskIncremental

I was able to fix the problem by doing two things:
1) During Failback to Production choose the DR Site Proxy and the Veeam server as the Prod Site Proxy. Instead to letting Veeam automatically select these click on Pick Backup Proxies for data transfer in the Failback wizard.

2) In Backup Infrastructure under Backup Repositories change the transport mode to Network for the Veeam Server and the DR site proxy server.

Both of these steps solved the problem for me. I had a ticket with Veeam support and this was the fix after a few attempts. Until this was done, I kept seeing the same failback errors and ended up with a corrupted production VM.

Note: If your production VM is corrupted (consolidation needed) you may have to delete the production VM from the production Vcenter then do a failback to production choosing "Failback to the specified location" option.

KOPFteam · Post by **KOPFteam** » Oct 09, 2017 8:52 pm this post

Thanks for your answer, Brian!

today (before I read your post) I managed to move the 1st VM (Domain Controller) back to the repaired production server. Veeam-Support pointed me to https://www.veeam.com/kb2018 and this way it worked.

Remains the 2nd (and last) VM: the file server with 2.2 TB! I changed the settings for the transport mode (the setting are under Proxies not under Repositories

) and started the failback wizard with manual proxy settings. When it started to "Calculating original signature", it needed 7 minutes for 1%. I canceled the wizard (early enough not to "destroy" the production VM) and will try again next weekend.

I will let you know what happens.

Thanks again,
Florian

R&D Forums

Failback question

Re: Failback question

Re: Failback question

Failback still failing!

[MERGED] Failback failed

Re: Failback question

Re: Failback question

Re: Failback question

Who is online