Failback after Restore vs Quick Migration after Permanent Failover

ihussain · Post by **ihussain** » Dec 31, 2020 6:53 pm this post

Hi

I have a VM recovery scenario that I would like some expert input on.

VM01 is a large VM (>5TB) which resides on the prod site (10.0.0.1/24) and is backed up locally. Due to the size of the vmdks (and their individually required retention policies) backup jobs are split into 3 smaller jobs instead of 1 big job.
This VM01 is also replicated to a DR site (10.0.99.0/24) twice daily.
VM01 has been accidentally deleted from the prod site and is now running from the replica failover VM01_replica at the DR site.
There is a 100Meg VPN link between the 2 sites and all backup & replica jobs for VM01 have been stopped in the interim.

In order to minimise further disruption and data loss what is the best & efficient way to get VM01 running back on the prod site?

1). Restore VM01 entirely from the multiple backup restore points at the prod site and then initiate the replica failback from the DR site. And then finally re-enabling>re-running the stopped backup & replication jobs.

Or

2). Initiate the Permanent Failover on VM01_replica at the DR site and then carry out a Veeam Quick Migration from the DR site to get VM01 back running on the vCenter at the prod site. And then finally re-enabling>re-running the stopped backup & replication jobs.

Or

3). Any other suggestions?

Please can someone advise on this? Thanks!

Post by **HannesK** » Jan 04, 2021 6:33 am this post

Hello,

Due to the size of the vmdks (and their individually required retention policies) backup jobs are split into 3 smaller jobs instead of 1 big job.

I recommend to fix the infrastructure performance instead of implementing such workarounds. VMs 10x bigger than yours are no problem with proper configuration & hardware.

I assume that "100Meg" means 100Mbit/s... so that means around 120h full transfer without any compression. I assume that you can use the Veeam WAN accelerator in "High bandwidth mode" https://helpcenter.veeam.com/docs/backu ... ml?ver=100

1. I have no experience with that "split job scenario", but it should work, yes. The calculation of changes will take some time.

2. I would go for 3 instead of quick migration.

I would go for option 3 because it sounds like you never tested that scenario and it looks like the "safest" and "most predictable" way without knowing any details about your infrastructure performance.

3. Do permanent failover. Create a new replication job that points from the DR site to the production site. Wait for 120h or less hours and do a planned failover. That way you have zero data loss. Then create new backup and replication jobs because the VM ID (MoRefID) changed and re-using the old jobs is complicated (please use forum search to see which options are available).

Best regards,
Hannes

ihussain · Post by **ihussain** » Jan 04, 2021 10:19 am this post

Thanks Hannes

Would option 1 still entail data loss even when doing replica failback back onto the restored (from backup) VM01 at the prod site?

Can you elaborate more on option 3 please?
When doing a permanent failover then would VM01_replica become VM01 and adopt it's IP settings?
After doing the permanent failover where should the new replication job be created & run from? VBR server on the prod or the DR site?

Post by **HannesK** » Jan 04, 2021 11:42 am this post

reading these questions and having the feeling that we are talking about a production environment: I recommend to involve someone (a Veeam partner for example) who has worked with replication before. Or at least I recommend that you try out failover and failback with a test-VM

As far as I see, you did not commit failover yet. That means that you are running on snapshots on the target side and you might run out of disk space on the VMware datastore. Please follow the user guide and do not manually delete the snapshots!

1. I'm not sure whether you are really asking for "failback" or whether you are talking about "undo failover". "undo failover" causes data loss, yes. Please check the user guide https://helpcenter.veeam.com/docs/backu ... ml?ver=100

3. Failover already adopted the IP settings according to your replication job configuration. I'm talking about https://helpcenter.veeam.com/docs/backu ... ml?ver=100

Hmm, I did not see that you have two backup servers. If you really have two backup servers, then it depends on your design (I don't know how to guess that). The new replication job in the scenario I had in my mind (one backup server) takes the VM from the DR site as source. Once everything is done, you can revert the direction again.

ihussain · Post by **ihussain** » Jan 04, 2021 12:49 pm this post

Yes that's correct there are 2 VBR servers. Physical at prod (for BJ and BCJ onlys) and Virtual at DR (for replicas jobs only). The virtual VBR at the DR site is hosted on the same ESXI server where the VM01_replica lies and is running from.

So after "committing" the failover (via permanent failover) are you saying to replicate this back to the vCenter on the prod site?

Post by **HannesK** » Jan 05, 2021 12:30 pm this post

okay, then the backup server that is responsible for replica sounds like "the right one". From a performance perspective, I assume that you have a proxy server for replication tasks on the prod site.

I repeat my recommendation to try it out with a small machine to get used to the software (or ask somebody for help who knows your environment).

So after "committing" the failover (via permanent failover) are you saying to replicate this back to the vCenter on the prod site?

yes. As I said: I only recommend that because it seems to be the safest way for me to avoid full disk or performance issues in a production environment.

The "normal" way would be "failback to production" (new location because you deleted the original VM) https://helpcenter.veeam.com/docs/backu ... ml?ver=100 . But I did not want to recommend that, because you are still running snapshots. And during failback, you will run on snapshots for another 5 days. I have no idea how much free space you have and which other things might be untested in your environment. That's why I went for the "safest way".

ihussain · Post by **ihussain** » Jan 05, 2021 2:11 pm this post

Thanks Hannes this has been most helpful.

Disk space is plenty.
It sounds like an initial replication of VM01 from DR site to prod site (with high bandwidth WAN acceleration mode) and then planned failover is the way to go.

R&D Forums

Failback after Restore vs Quick Migration after Permanent Failover

Re: Failback after Restore vs Quick Migration after Permanent Failover

Re: Failback after Restore vs Quick Migration after Permanent Failover

Re: Failback after Restore vs Quick Migration after Permanent Failover

Re: Failback after Restore vs Quick Migration after Permanent Failover

Re: Failback after Restore vs Quick Migration after Permanent Failover

Re: Failback after Restore vs Quick Migration after Permanent Failover

Who is online