FEATURE REQUEST - Speed Up the Planned Failback Process

DrWhy · May 21, 2015 7:23 pm

The problem Overview:
In some situations, it can take up to 48-72hrs to fully complete and return from a planned failover/failback operation.

Request:
Make the "Calculating Original Signature Hard Disk" operation for a planned failback, optional.

Background:
Veeam has designed a feature called "Planned Failover" into their product. This is a great feature that can allow you to operate a near-highly available environment without the cost of shared storage or vMotion. This is how the feature is described on the website http://helpcenter.veeam.com/backup/80/v ... lover.html, this feature is:

"If you know that your primary VMs are about to go offline, you can proactively switch the workload to their replicas. A planned failover is smooth manual switching from a primary VM to its replica with minimum interrupting in operation. You can use the planned failover, for example, if you plan to perform datacenter migration, maintenance or software upgrade of the primary VMs. You can also perform planned failover if you have an advance notice of a disaster approaching that will require taking the primary servers offline."

The Problem in Detail:
In many of the intended use circumstances, a "Planned Failover" is only as good as a planned failback. Veeam's Planned Failback feature currently has two noticeable limitations when used in practice.

1. A planned failover/back When performing a failback after a "Planned Failover" operation, Veeam requires a task called "Calculating Original Signature Hard Disk" to be performed. This task reads the entire source VM from disk in order to verify the integrity of the VM data, prior to failing back to it. Because of this costly read operation, the time to complete this task takes too long and places a heavy i/o load on the source host during the duration of this operation. Additionally, you RPO’s may not be met during this time if the failback process takes too long to complete.

Example- if you have a 100GB VM and your storage can read at 200MB/s (The time to complete this task will depend on how large the VM is and how fast the storage is), a failback task will take 9min, while placing a heavy i/o load on your primary storage during this time. Maybe 9min isn’t that bad, though. However, consider the scenarios that this is designed for, a common one being, host maintenance (i.e. Upgrade to the Hypervisor) where you have to move all VM’s off of a host. In this scenario, say you have 15 x 100GB VM’s and a single 2TB File Server VM for a total of 3.5TB of data (probably a common workload for many). Once the primary host is back online and you attempt to perform a planned failback, this operation will take 5hrs. If you have more data that, say a large file server with 15TB of data, this operation will take you a full 24hrs! Add on top of this the fact that your backups are not being performed during this window either unless you want to remap them to the new host, which by the way, will require the same full read operation from disk.

2. After a planned failback has been completed, presumably, you will want to re-start your replication of that VM to insure you are protected. The first time this replication job runs, it performs a task called "Processing <Job Name>". During this operation, it reads the entire source VM from disk (yes it has to do it again!). During this time, a heavy i/o load is placed on the host server (while all VM’s are in production) negatively impacting the performance of the entire host. Additionally, during this time, no backups can be performed to the VM’s, further extending your RPO objectives, quite possible past your defined policy.

Example - In our example discussed above, we would have no backups for 24hrs. This operation would add an addition 24hrs and would now be up to 48hrs with no backups before we were back to business as usual.
Note – I have not tested this scenario while backups were being performed, only replication. It stands to reason that after a planned failback, the backups would also need to perform this same "Calculating Original Signature Hard Disk" operation which would take you another 24hrs, totaling 72hrs with no backups performed. However, I have not confirmed this.

The Solution:
While I’m not a Veeam engineer and I do not know how everything works under the covers, I see one logical flaw in the current design. The “Planned Failover / Failback” process isn’t designed with the assumption that this is a “planned” operation. In a planned operation, the source and replica VM are identical at the time of failover and the software should know this. Because they are identical at the time of failover, all that needs to be done is to track the changes made on the replica VM during the time of the failover, and sync those changes back to the source. That’s that. There is no need to re-verify the source VM. I see implementing this in the software as an option. For those who want to perform the expensive read operation to make double sure they data is in tact, they can. For those who want to rely on CBT, give them the option. Likewise when resuming replication of the VM (and potentially backup jobs), the software should use this same logic.

Post by **Gostev** » May 21, 2015 10:17 pm this post

DrWhy wrote:This task reads the entire source VM from disk in order to verify the integrity of the VM data.

Actually, it's not for verify integrity of VM data. Rather, it is used to understand the contents of virtual disks, so that failback process knows what virtual disk blocks it needs to synchronize. This is why this process is mandatory and cannot be made optional - in theory, this could be replaced by a CBT query.

DrWhy · Post by **DrWhy** » May 21, 2015 10:55 pm this post

Hey Gostev, thank you for your comment. I believe you are right and that a CBT query would accomplish this. Is this an improvement that you and the development team would seriously considering implementing anytime soon? I am currently evaluating Veeam, and this limitation is one that is making it hard for us to meet our requirements. It would greatly impact my decision if I knew this improvement was in the pipe for a future release. Thanks again.

DrWhy · Post by **DrWhy** » May 21, 2015 10:55 pm this post

Additionally - I believe Unitrends Reliable DR is (formely PHD Virtual) is doing a CBT query as their product does not require this expensive read operation upon failback nor when you go to resume the replication job.

Post by **Gostev** » May 21, 2015 11:14 pm this post

Hi, Caleb. We have not considered this feature yet, because you are the first to ask for it. We will need to perform some research on potential reliability implications of this approach before we can put this feature into the pipeline. But even in the best case scenario, I don't expect this feature making into the next release, because feature set for that one was finalized many months ago (besides, with a single request so far we cannot prioritize this feature high enough). So, if this feature is critical for you for some imminent project, you should go with another solution that meets your needs. Thanks!

DrWhy · Post by **DrWhy** » May 21, 2015 11:49 pm this post

I understand, thanks Gustev. While I may be the first to officially make this specific request, there are others who are experiencing the affects of this in the following thread.

http://forums.veeam.com/vmware-vsphere- ... 11581.html

I have several more questions:
1. Can you comment on why, after the failback has completed and that during the next replication job of that VM it has to do a 2nd re-read of the entire source VM?

1a. It appears that there is some re-reading of the replica VM after replication has resumed as well. I'm not sure about this but saw a noticeable spike in disk utilization while this job ran. Can you clarify what's going on here?

2. Aside from replication, if we have backups being performed of these VM's, once a failback has been performed, will those backups have to do yet a 3rd re-read of the entire source VM (or re-read on the backup files for that matter)?

Post by **Gostev** » May 22, 2015 12:37 am this post

1. This is because CBT cannot be used in the first job pass for a VM that had its disks modified by an external process. The job has to discover baseline virtual disk state to apply future changed blocks information to, and for that it needs to read the entire VM, because it may already have some changesin its virtual disks since failback has been completed.

1a. What do you mean by "resume" as it comes to replication job?

2. Yes for the source VM (see 1), No for backup files.

DrWhy · Post by **DrWhy** » May 22, 2015 4:07 pm this post

1a. This is related to question one. After a VM has been failed back, it's assumed that you will want to start performing the replication job again. As you have stated, the first time this replication job is run, a full read of the source VM has to be performed. Also, during this process, I've observed that the replica VM has to be read from disk. I'm not sure if this is a full read of the replica VM or what is going on. Can you please clarify how much of the replica VM must be read from disk during this process?

DrWhy · Post by **DrWhy** » May 28, 2015 12:11 am this post

Hey Gostev, does the additional info help you answer my questions?

DrWhy · Post by **DrWhy** » May 29, 2015 4:21 pm this post

After you have answered the question above, can you please respond to the follow as well? I want to make sure I understand everything correctly.

Is the following accurate? Please correct me if I’m wrong.
1. During Failback - The entire Source VM must be fully read from disk, prior to the failback completing.
2. After Failback --- The entire Source VM must be fully read from disk, the first time the replication job runs for that VM.
3. After Failback --- The entire Replica VM must be fully read from disk, the first time the replication job runs for that VM.
4. After Failback --- The entire Source VM must be fully read from disk, the first time the backup jobs runs for that VM.

Post by **Gostev** » May 29, 2015 4:25 pm this post

Team, please confirm with the devs and respond.

Post by **foggy** » May 29, 2015 5:02 pm this post

1. Correct (discussed above).
2. Correct (discussed above).
3. Reading of the entire replica VM should not be required, if its digests are updated during failback (which would be reasonable). I will check this on Monday.
4. Correct (discussed above).

DrWhy · Post by **DrWhy** » May 29, 2015 5:06 pm this post

foggy wrote: 3. Reading of the entire replica VM should not be required, if its digests are updated during failback (which would be reasonable). I will check this on Monday.

I have observed noticeable disk usage on the replica during this process. While I don't know if it's doing the entire thing, there is certainly noticeable disk usage going on. Thanks for looking into this.

Post by **foggy** » Jun 03, 2015 2:21 pm this post

Replica VM shouldn't be read during the first replication cycle after failback, unless its digests are missing for some reason. Could you please check whether there's a corresponding record (smth like "Digests are missing, calculating digests...") in the job session log for that run?

DrWhy · Post by **DrWhy** » Jun 09, 2015 9:30 pm this post

I have not had a chance to do this yet. And sadly, I'm not sure if I will ever get to it as Veeam doesn't meet our requirements because of the slow failback process. I appreciate the response by your team and look forward to seeing this feature implemented into Veeam in the future.

DrWhy · Post by **DrWhy** » May 02, 2016 7:41 pm this post

Hey Gostev, has there been any movement on this feature request? This is the one thing that's keeping us from being able to use Veeam for our large file servers.

Post by **Gostev** » May 02, 2016 8:12 pm this post

In fact yes, there was. This feature has made it to the highest priority features list, and is on the edge of making it into 9.5 (fingers crossed).

DrWhy · Post by **DrWhy** » May 16, 2016 6:55 pm this post

That's fantastic news!! Can't wait!

DrWhy · Post by **DrWhy** » Jun 21, 2016 11:02 pm this post

did this get into 9.5?

rreed · Post by **rreed** » Jun 22, 2016 1:32 pm this post

A very big +1. I've just very recently starting playing around w/ Replication, and w/ just a simple ~14GB test VM taking upwards of half an hour to fail either direction, I'm afraid to present this as a viable alternative to VMware's SRM to management.

DrWhy · Post by **DrWhy** » Jun 22, 2016 4:54 pm this post

The only thing that scares me about this solution is that it will presumably rely on vmware CBT, which, has been notoriously unreliable in vSphere 6. I'm curious to hear what Gostev's thoughts on this are.

nunciate · Post by **nunciate** » Sep 19, 2016 7:12 pm this post

I haven't read all of the posts here so hopefully this isn't a repeat. I was performing some tests doing planned fail over jobs this weekend in preparation for our yearly BCP test. I noticed the long fail back times and this gave me great concern. We have several VMs that need to be failed back with changes after our BCP test is complete. One of these is a very large file server. About 10Tb big (maybe 5 when you consider deleted blocks). There is absolutely no way I can fail that over to DR and then fail back with changes. It would take up to 48 hours to recalculate disk digests on the way back. I typically rely on Windows DFSR for file servers however it doesn't work reliably with this server for some reason.

One flaw I see in this process is that the job shuts down the replica in DR during the fail back, then it performs a replication in which it calculates disk digests.
Why does the job need to shut down the server in DR first? Why can't it simply run the replication first while keeping the server online? I know you can do that when doing a normal replication job while also calculating digests so why isn't it the same coming back? If you simply changed the process to leave the VM online in DR while the job replicates back that would help a lot. I could then leave the VM online in DR while it is replicating back. Once that first pass is done the job can shut down the VM, perform a second pass replication which should be quick and then power back up in production. It isn't perfect but it is better than hours or even days of down time just to do a fail back operation.

Post by **veremin** » Sep 20, 2016 10:09 am this post

Have you thought about replicating failovered VMs back and then executing planned failover operation for them? That should allow you to achieve the described scenario. Thanks.

nunciate · Post by **nunciate** » Sep 20, 2016 12:23 pm this post

Yea that was going to be my next test. Just do a planned fail over then a permanent fail over to DR and then reverse the process.

Post by **veremin** » Sep 21, 2016 8:32 am this post

Correct, that approach should allow you to achieve more or less the same goal you're after. Also, it should minimize time needed to transfer final changes. Thanks.

DrWhy · Post by **DrWhy** » Sep 21, 2016 3:33 pm this post

Is there any update as to whether the feature request mentioned in this thread will be implemented in Veeam 9.5?

Post by **foggy** » Sep 21, 2016 4:14 pm this post

As far as I know, this was postponed until the next major release.

DrWhy · Post by **DrWhy** » Sep 21, 2016 4:18 pm this post

got it, thanks for the update.

DrWhy · Post by **DrWhy** » Jan 19, 2017 9:44 pm this post

Hello, can you please provide another update on the status of this feature request?

Post by **veremin** » Jan 20, 2017 12:05 pm this post

It didn't make it into release, having been superseded by features with higher priority. Thanks.

R&D Forums

FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Who is online