Failover Replica Now Live

homerjnick · Post by **homerjnick** » Dec 06, 2012 10:54 am this post

We have had Veeam just over 6 months...fantastic application and we have had to restore two live servers in that 6 months and had no problem.

We also setup replication to our DR site and whilst we have tested it in the IT Department we struggle to get it tested in live use.

We have a VM that has a 80GB OS disk and a 350GB data disk that is used by our entire company.

An upgrade to an application went horribly wrong. We never used snapshots but we took a backup via Veeam before the upgrade. We restored just the C drive from the backup prior to the upgrade and the server blue screens with a rather nasty STOP error.

We then failed over to the replica...quick DNS change and users are happy...

I then deleted the original VM and restore it from the incremental backup taken prior to the upgrade as my thinking was perhaps it was just restoring te OS disk and not the data disk that caused it to blue screen but alas after a full restore last night it still blue screens. I tried Startup Repair and all that but nothing works.

I will restore from a Full backup taken the day before the incremental backup that I took just prior to the upgrade.

But my question is this.

Even if the restore fails to boot I still have my working failed over replica running.

What is the best procedure to fail this back so that at our main site the sever is running bearing in mind the original VM has been corrupted, been deleted, been restored and is still corrupted...

I am worried if I fail it back that I will lose all the data that has changed since the replica has been running and that the original VM will still be corrupted...can that occur?

Post by **foggy** » Dec 06, 2012 11:06 am this post

Nick, you can always undo failback in case you find that the original VM is corrupt. Undo will revert the replica to the protective failover snapshot without committing changes made while the VM replica was in the failback state.

Alternatively, you can just Quick Migrate your currently running replica VM to the main site.

homerjnick · Post by **homerjnick** » Dec 06, 2012 11:14 am this post

Hi foggy, I don't understand...I need the changes that are now in the replica to be replicated/failover back to original VM...if I undo failover does that not just discard all the changes made on the replica and simply power up the original VM?

And it is 430GB VM and we have a 10Mb connection so I can't see all of that coming down the line even over the weekend if a Quick Migrate just moves the whole thing...

Post by **foggy** » Dec 06, 2012 11:22 am this post

homerjnick wrote:Hi foggy, I don't understand...I need the changes that are now in the replica to be replicated/failover back to original VM...if I undo failover does that not just discard all the changes made on the replica and simply power up the original VM?

I was talking about undoing failback operation (not failover) in case if the original VM is corrupt after failback.

homerjnick · Post by **homerjnick** » Dec 06, 2012 12:17 pm this post

Ok thanks foggy, I was not aware of that option...it gets a bit more confusing though...

My replica at my DR site was out of sync for 2 weeks...ie replication from the main site VM had not occurred for 2 weeks due to networking work...not a big issue...

If I restore from a Full backup from last weekend to my main site so that this VM will have the 2 weeks worth of data NOT in my replica what will happen to:

a) the 2 weeks worth of data missing in the replica that exists in the original VM when failback occurs?
b) the data added/changed on the replica that has occurred since it has been running that is not on the original VM when failback occurs?

Does the process merge the data together or will it simply be a copy of the replica that will then exist in my main site and I'll be missing 2 weeks of data?

Post by **foggy** » Dec 06, 2012 12:57 pm this post

Well, the data will not be merged, you will get the copy of the replica VM at your original location (missing the changes occurred on the original VM). The original VM will be synchronized with its replica.

I'm not sure though why do you need both 2 weeks of changes occurred on the original VM and replica VM? The whole replication scenario is supposed to be used in cases where only one of the two VMs (the original one and the replica VM) is used at any moment in time, so you need to keep only one set of changes. Are you saying that both VMs were up and running and used by your employees during these 2 weeks so that some of them used the original VM and others used its replica?

homerjnick · Post by **homerjnick** » Dec 06, 2012 2:02 pm this post

No...my replica is 2 weeks out of sync...in other words it has not received any data from the original VM in 2 weeks...thus when we failed over we told users the last 2 weeks were missing..they were fine and carried on working adding data to the replica...

I will then tonight restore the original VM thus the VM will have the 2 weeks missing from the replica but of course will be missing all the new data that has been added/changed on the replica since users have been using it.

So it is a little bit of a split brain scenario...I need the 2 weeks data from the original VM (since the replica had not replicated in 2 weeks due to network installs) but I want the added data since the replica has been running.

I guess if there is no merge of data then I just want the replica version to be the one that is live, if our users can do without the 2 weeks of data then so be it.

So is it a case of JUST failing back and all the changes on the DR version will be replicated back to the original VM? Of course, I can manually mount the data drive from the corrupted VM prior to failbackand manually copy the missing two weeks so no big deal...

Post by **foggy** » Dec 06, 2012 2:08 pm this post

homerjnick wrote:So is it a case of JUST failing back and all the changes on the DR version will be replicated back to the original VM?

Yes, that's right.

homerjnick wrote:Of course, I can manually mount the data drive from the corrupted VM prior to failbackand manually copy the missing two weeks so no big deal...

Great, this will allow you to have all the data and miss nothing!

homerjnick · Post by **homerjnick** » Dec 06, 2012 2:21 pm this post

Great...and you are saying if I failback to the original VM and it is STILL corrupted then there is an "undo failback" option?

I'll need to read up on that as I'm not sure what that is...is it a case that the replica comes live again and all changes made whilst failing back are discarded? So the replica carries on since it is working...

Post by **foggy** » Dec 06, 2012 2:24 pm this post

You can read about that in the product user guide (p.54) in detail.

homerjnick · Post by **homerjnick** » Dec 14, 2012 9:59 pm this post

I'm just failing back just now...bit confused as to what is going on...because my original VM was corrupted I had deleted it and restored a backup...thus when I failed back it complained the original VM was not there but I could map to the restored VM fine...

It then calculated disk digests then replicated "RP Harddisks"...job done or so I thought...

It then came up with powering down the replica and then starting replicating the harddrives again????

I really hope all the live data on my replica is ok!

Post by **dellock6** » Dec 14, 2012 10:06 pm this post

Well, when you do a failback, you are actually getting back a VM in DR to your production site.
After the replica is completed, the production VM needs to be protected again, that's why it starts again to replicate it to DR. My suggestion is, after failback is completed, and you see production VM powered up and RD replica powered down, keep your replica job disabled, do all your checks on the production VM, and then you can safely start again to replicate it.

Luca.

homerjnick · Post by **homerjnick** » Dec 14, 2012 10:21 pm this post

Thanks Luca...but why is there a double replication? My replication job is indeed disabled but am confused as to why it replicated "RP Harddisks" before the power down of the replica which I expected (replica HD's replicated back to Production) but in the failback job it started replicating harddisks after it powered down the replica...

Dec 15, 2012 12:25 am

Notice that the "RP Hard disk" portion happens while the replica is still powered on. This can help to minimize downtime during the failback since it's possible you have been running on the replica for hours, days, or even weeks before you choose to failback, so it could be a LOT of data. This stage replicates the VM back to the original VM while the replica is still powered on, however, once this portion is complete, to completely failback the final state of the replica, it must power off the replica VM and replicate the final changes that occurred during the "RP Hard disk" phase. Typically, this portion is much smaller/shorter than the "RP Hard disk" portion, minimizing the time that the server must be offline.

For example, in my most recent failback, the entire operation took about 18 minutes, of which the final "powered off" replication too about 1.5 minutes and the total time spent from power of of the replica to power on of the failback replica was about 4 minutes. Without this intermediate phase, the system would have had to power off the replica and it would have taken 14 minutes. So the difference is 4 minutes of downtime vs 14 minutes, although the overall process took slightly longer.

Also, Luca is exactly right, you should verify that the failback is working as expected before starting the replication job again. This is supported in the product because the replica job actually stays disabled after a failback until you "commit" the failback, at which point the normal replica job will be re-enabled.

homerjnick · Post by **homerjnick** » Dec 15, 2012 8:52 am this post

Perfect answer in theory...thank you...but something doesn't seem right...before I started the failback I disconnected the virtual NIC so no user could use it whilst I failback...

Replicating the RP Harddisks took 3 hours but now after powerdown of the replica and 8 hours later it is still replicating...as you say it is only supposed to be replicating the changes from the time the failover started but no data has changed since the VM is not on the network...

Post by **dellock6** » Dec 15, 2012 5:16 pm this post

Nick, so instead of deep diving onto the failback procedure, which seems to work well, you should concentrate on the excessive time it takes to do the final replica once the VM is powered down. As Tom explained the last piece of the replica should be quick, even if during a powered-off state the VM has no way to use CBT records, Veeam has still the metadata informations about changed bits.
Can you take a look at the failback job and see the different steps it's doing, and see their times?

homerjnick · Dec 15, 2012 6:10 pm

Yeah the replica HD's took 3 hours to replicate pre-replica-power down....then the replica powered down and replicated again the HD's which took 12 hours then the job completed.

All is good, the production VM is now on and looks good, after 30 mins of testing I can confirm server is good and data intact and up-to-date. I have committed the failback.

Now I am doing a full backup...as ALL my backups were corrupt on this VM it was the replica that saved it...this success of Veeam will feature in our staff newsletter!!!

We have a 10 meg link to our DR site and it is a 500GB VM...hence the times involved...glad to say all went well!!! But as you say, why did the last part of the replication take so long?

I had throttling enabled between 9am-6pm weekdays but at weekends Veeam can have the whole pipe...but it looked like throttling was still being applied...I'll look into that...

Thanks to all who replied on here! Not sure if Veeam would be interested in a copy of our staff newsletter if it gets a mention?

Post by **Gostev** » Dec 15, 2012 6:58 pm this post

homerjnick wrote:Thanks to all who replied on here! Not sure if Veeam would be interested in a copy of our staff newsletter if it gets a mention?

Absolutely! This would certainly make a great internal case study for us!
Please forward it to me once out (email is my forum nick at veeam.com).

Thanks and congratulations on building a DR strategy that worked in need!

Post by **tsightler** » Dec 17, 2012 4:47 pm this post

Just as a followup, I think my initial description of the failback isn't exactly as it happens. In testing it appears that the first step of failback when it replicates "RP Disk" is basically to return the failback target to the known state of the most recent replica restore point prior to the failover, then, one the VM is powered off, all changes are replicated back. This is why this failback probably took longer, although it can vary based on how different the failback target VM is from the restore point.

I think there's some room for continued enhancement here. Ideally it would work the way I initially described, effectively, create a "failback restore point", replicate all changes, the finally have a point where you failback with only the most recent changes, perhaps even continuing this cycle until the amount of data is below a threashold to keep the failback time to a minimum. Of course, you can always do this with a manual replication job, with replica mapping, back the other direction rather than performing a failback.

Still, good to know that everything worked for you.

R&D Forums

Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Re: Failover Replica Now Live

Who is online