FEATURE REQUEST - Speed Up the Planned Failback Process

DrWhy · Post by **DrWhy** » Jan 23, 2017 4:36 pm this post

Got it, thanks for taking the time to provide an update. What is the ETA at this time?

Jan 23, 2017 5:35 pm

We typically do not share ETA until we 100% confident the feature is getting into the particular release.

Post by **Gostev** » Feb 14, 2017 1:44 am this post

Implementation has been completed, and if all is well QC-wise, the feature will be included in 9.5 U2. Just don't ask when U2 is going to be released, because we don't have the timeline defined yet (no pressure to release one in terms of bugs). Sometimes in the spring!

DrWhy · Post by **DrWhy** » Feb 14, 2017 4:47 pm this post

Best news I've heard all month! Can't wait.

Erhard · Post by **Erhard** » Mar 21, 2017 2:00 pm this post

Hi,

while failing over is quite straight forward and does not take very long (original VM is synced while running, than powered down, last minute sync and replica powered on), failing back that very same machine takes at least 3 times the time of failover, if not hours or days.

Now imagine I want to maintain the primary Hyper-V host where some or all VMs are running. Well yes, lets perform a failover, install some gigabytes of MS updates and then failback the VMs.

No way! I tried this with some test VM that has got a dynamic hard drive that has got occupied space of 117G and that is 130G large in Windows. That VHD file is 125G large.

While failover took 8 minutes failback took three quarters of an hour. Okay, I placed some call and asked what is going wrong. The answer was, that there is nothing going wrong.

It seems like failback works/worked like this:

- take a snapshot of the running replica
- read the original machine to check for changes
- calculate the differences and throw the result in trash
- copy back the entire replica (130G) anyway
- then write another 130G, because the dynamic VHD contains unpartitioned space of another 130G (no idea where the non-existent data is written to, maybe to some single dummy sector?)
- and so on and so on

Now what I'ld like to have is a fast and straight foward failback like this

- the amount of time when neither replica nor original VM are available must be as short as possible, so please sync it online and if possible based on CBT
- there must be an option whether failback shall power down the replica and power up the original automatically or supervised (I haven't got any mind to stare on some progress bar for eight hours or so)
- if supervised failback is selected, the replica must be kept running until the "do it now - button" is pushed
- when that button is pushed, the productive replica(s) is powered down, the remaining data is synced and the original VM(s) is(are) powered on

I do hope that somebody undestands that there seems to be room for improvement in failback. And please, planned failover and planned failback are enterprise options. While I am not talking about the enterprise edition I would bet several pizzas that the enterprise edition suffers from slow failback as well.

Best regards

Erhard

Mar 21, 2017 4:02 pm

Hi Erhard, thanks for your feedback. Please get a chance to test it once again once Veeam B&R v9.5 Update 2 is released in April.

kenny782 · Post by **kenny782** » Apr 24, 2017 4:07 pm this post

Just curious if you had any updates?

No pun intended lol

Thanks,

Kenny

DrWhy · Apr 24, 2017 4:08 pm

April isn't over yet

jim3cantos · Post by **jim3cantos** » Jun 07, 2017 8:05 am this post

Ok. Ather reading the whole thread while testing failback, may be it's time for applying Update 2:

https://www.veeam.com/kb2283

Failback performance improvements. Failback can now optionally use changed block tracking data to determine the changes between the original VM and replica VM state. This dramatically accelerates the failback performance due removing the need to read the entire original VM disks (“Calculating original signature” operation). For VMware hypervisor, we recommend that this option is not used if the failover event was triggered by a disaster that involved host or storage crash or dirty shutdown, as CBT data may be inconsistent in this Case.

...be aware of this issue with update 2 before (or after) updating.

DrWhy · Post by **DrWhy** » Jun 07, 2017 3:21 pm this post

Good info! thanks Jim. There is a hotfix that fixes the issue, which is good, but it must be obtained by contacting support.

Feb 16, 2018 5:19 am

Hey guys, I'd like to chime in here, albeit a bit late. Hopefully someone sees this! Please excuse the long post, but I just want to be clear and detailed.

Environment and situation:
Tenant replicating to a cloud connect replication environment (hosted by me)
Test VM: 350GB provisioned, only about 50GB actually used
100 mbit/s pipe between tenant and service provider

I replicated the test VM in a few hours with no issue. Did a partial failover, which worked spectacularly. It only took about 5 minutes from me starting the failover, to me being able to ping the VM in the DR site. To simulate some data change, I downloaded a 2.8GB ISO (and for fun, it was a Veeam ISO) and left it in my downloads folder. I let the VM run for a few hours, just sitting there, not doing anything.

I go to failback, and I was pretty surprised by the result, not in a good way. I did a quick rollback, utilizing CBT and restoring the VM to the original location, however, for 2.8GB worth of changes, my VM was still down for nearly an hour! What I don't understand is that the job log says "Replicating restore point for Hard disk 1 (350.0 GB) 2.6 GB processed", and that part only took 3 minutes, 57 seconds. That 2.6GB lines up closely with the size of the ISO I downloaded.

What I don't understand, is that the next phase, "Replicating changes Hard disk 1 (350.0 GB) 37.6 GB processed" took 43 minutes, all while the VM was powered off. I noticed that it was going with an average speed of 15 - 20 MB/s

Link to a picture of the failback log: https://ibb.co/hyAsy7

My questions are:
1. What in the world is it doing while "Replicating changes Hard disk 1"? I thought a large portion of the changed data was copied during the "Replicating restore point for Hard disk 1" phase. If that isn't the case, what does that phase do then?
2. Why did it take so long for a VM that presumably only had 2.8GB of changed data? I understand logs and what not make changes and take space, but even if we double it, that should have only taken roughly 5 minutes at an average speed of 18 MB/s.

I guess I'm just confused here as to what is happening and why we can't leverage the awesome replication features Veeam has built in to essentially do a few reverse replications as someone here has stated, power off, then do one more quick replication (5-10 minutes tops), then power on.

Failing over is nice and simple, but quite frankly, I'm terrified to use it because of the implications of falling back. I don't want to have to tell my client "We can fail you over, but honestly I have no idea how long you'll be down while we fail you back, and I have no idea when Veeam will actually decide to take you down to finish the failover". If a 50GB VM with roughly 3GB of changed data took that long, what if we have to fail over a client's Exchange server for an extended period of time? I have no idea how long it would take for a 3tb Exchange server that potentially has a week worth of changed data on it and I don't want to find out.

The only solution I can see around this is to build a VPN tunnel between our DR site and our customer's network, then use another Veeam server to replicate the changes from our DR site to the customer site, turn off the replicas during a maintenance window, run another replication to get the changed data, then turn the customer's servers on in the original production environment. It seems like this would give me the flexibility I need to determine exactly when they go down, while also giving me the least amount of downtime. However, I absolutely know my networking guys are going to say "Why didn't we do that in the first place, and why don't we just do the normal replications from the tenant that way as well? And honestly, I don't have an answer for that because it seems to make more sense to do it that way than to deal with the uncertainty and mystery around Veeam's built-in failback process that it seems like we have to utilize for Cloud Connect.

Now, if I've missed something, and there's maybe some slick feature I'm not aware of, or that what I'm experiencing is out of the ordinary after update 2 which gives us the option to use CBT to skip calculating disk digests (we're on U3 by the way), please, let me know. I would love to know about it. Actually, I'm really begging to know about it at this point

Make me look dumb, I don't care, I just want to know.

If this is normal behavior, then please take this as my feature request to continue development here and us the same mechanism you already have that works wonderfully to replicate hot data over, but for the replication back. If anyone has any better suggestions than my VPN tunnel idea in a cloud connect environment, please chime in.

Feb 20, 2018 3:10 am

Does anyone else have any thoughts here or am I alone on this? Hoping someone has an idea or observation better than mine

Post by **foggy** » Feb 20, 2018 5:17 pm this post

Hi Cory, quick rollback (failover using CBT) is performed in two steps:

1. First Veeam B&R needs to align the state of the original source VM with the state stored in restore point it was failed over to. This phase is reported as "Replicating restore point for Hard disk 1" and took ~3 minutes in your case. Basically, all the changes occurred in the original VM after the restore point was created are rolled back during this phase.
2. Then it needs to sync the changes occurred inside the replica VM while it was running after the failover event back to the original VM. This step is reported as "Replicating changes Hard disk 1" and took the most time during the entire failback operation.

So the amount of changes occurred while the replica was running is 37.6GB, including those 2.8GB from the downloaded ISO.

Feb 20, 2018 8:48 pm

Foggy, thanks for the reply.

I'm afraid to say, I don't think that's the case. It just doesn't seam realistic to me that we can have 37.6GB of changes on a test VM that doesn't really do anything. I'm testing with support right now, and we failed over, then almost immediately failed back using a quick rollback. The virtual machine was up no longer than 3 minutes, and yet it had to process about 21GB of data and took about 15 minutes of downtime. This server isn't a file server, SQL server, Exchange, or anything user facing. I just don't think it is realistic that it really has 20+ GB of changed data in the span of a few short minutes. My case number is 02617103 if you are interested in taking a look.

Post by **foggy** » Feb 21, 2018 2:02 pm this post

Ok, let's see what they can come up with after reviewing the log files.

Mar 02, 2018 4:16 am

foggy wrote:Ok, let's see what they can come up with after reviewing the log files.

Just an update here. Support seems to be relatively stumped at the performance. They thought that maybe doing a planned failover would result in a quicker failback, however it was just as slow. I believe the case is being escalated.

ChrisGundry · Post by **ChrisGundry** » Apr 03, 2019 11:01 am this post

Cory, do you have any progress on your issue? I am seeing a similar issue in a test we are running. We don't have a support case open at the moment, but wondered what progress if any had been made between 02/03/2019 and now?
Thanks!

Layla-shmayla · Post by **Layla-shmayla** » Apr 19, 2019 2:07 am this post

Same thing here as well. In testing our DR setup, Failback took 4 hours for a 250 GB disk. Failed over to DR for 10 minutes, saved a text file to desktop, initiated Failback. Dubs tee eff???

Post by **foggy** » Apr 19, 2019 11:43 am this post

I've checked the Cory's case and unfortunately there was no resolution to the issue, so I recommend both of you to open your own cases for a closer look by our engineers.

Apr 22, 2019 3:31 pm

Hey guys,
Correct, unfortunately there was never a resolution.

I love veeam and all of the fantastic support and capabilities that they have. While Veeam continues to be our go to for backups, we ultimately had to go with another provider for replication, largely due to this issue.

I spent countless hours (on and off business hours) testing and providing logs and information to Veeam (on the case I have already noted, some other similar cases, and personal research and testing time), but I could not come to anything resembling an acceptable failback process and timetable.

I had the opportunity at VeeamOn 2017 to discuss this briefly with Gostev, and it seems like my concerns were noted, but I'm not sure if there's anything happening with this on Veeam's end.

Again, I really hope that there's something I'm missing to make this process a success, but support couldn't really give me an answer.

ChrisGundry · Post by **ChrisGundry** » Apr 23, 2019 7:50 am this post

Thanks for the update Cory, even though it is not a positive one

I also love Veeam and have been a long standing customer and advocate of Veeam. But recently I have been feeling a lot of my technical/usability issues are not being addressed and Veeam don't seem to care that things like this are not working the way customers want/need them to. Unfortunately at the moment I don't have time to log a case for this replication issue (busy fighting a couple of other Veeam issues!) but I will try and get it logged ASAP.

Apr 24, 2019 1:22 am

Please keep us updated

If there's a good solution, I am happy to re-evaluate my VM replication utility!

Feb 24, 2020 10:39 pm

I'd like to know if anyone is still experiencing this or if this has been resolved. Hoping that something has been done since this issue was brought to light a while ago.

Thanks!

Mar 03, 2020 10:56 pm

@Foggy, did anything ever come of this?

Post by **veremin** » Mar 05, 2020 6:07 pm this post

There are certain plans on how to make failback work faster and more predictably, but it's too early to share any details. Thanks!

Mar 05, 2020 7:50 pm

Was it ever determined that there was an actual issue of CBT appearing to not be used, or Veeam attempting to read MUCH more of the VM data than it should? I'm wanting to know if Veeam ever found a root cause.

Post by **foggy** » Mar 20, 2020 10:15 pm this post

Do you mean your particular support case? Are you still experiencing the same behavior? I can see that your case was closed by you without being tracked down to the resolution. While I totally understand your frustration from not getting the resolution for a long time, the fact that there are only a couple of people encountering this behavior makes us think the issue is environment-specific and hence we cannot investigate it without your assistance.

Mar 31, 2020 6:14 pm

Foggy,
Support was clueless and the case was stalled. After a good amount of time (I can't remember how long exactly), I had to move to a different replication solution due to this issue. I could recreate the issue in a couple of environments as well.

My frustration has subsided long ago, I'm just wondering if there was ever anything found by Veeam support and it seems like there wasn't

ChrisGundry · Post by **ChrisGundry** » Apr 06, 2020 8:34 am this post

Last time I checked we were still seeing this behaviour as well. Unfortuantly my experiance with support on issues like this is also that support are 'clueless' as YouGotServered says. This means I don't usually get to logging them because they go no where and just lead to be getting frustrated with support when I don't have time to go round 'the support loop'.

For most things we have moved to an application level solution like Exchange DAGs, DFSR, SQL Always On etc. Whilst more costly, they work more reliably, give more functionality and don't require Veeam.

Post by **foggy** » Apr 06, 2020 8:24 pm this post

Cory's case ended up on Tier 1 which means it never got escalated for deeper investigation. Our recommendation is to always ask for an escalation if you feel that the investigation drags out or leads nowhere.

R&D Forums

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

[MERGED] Failback too slow for temporary failover

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Who is online