In some situations, it can take up to 48-72hrs to fully complete and return from a planned failover/failback operation.
Request:
Make the "Calculating Original Signature Hard Disk" operation for a planned failback, optional.
Background:
Veeam has designed a feature called "Planned Failover" into their product. This is a great feature that can allow you to operate a near-highly available environment without the cost of shared storage or vMotion. This is how the feature is described on the website http://helpcenter.veeam.com/backup/80/v ... lover.html, this feature is:
The Problem in Detail:"If you know that your primary VMs are about to go offline, you can proactively switch the workload to their replicas. A planned failover is smooth manual switching from a primary VM to its replica with minimum interrupting in operation. You can use the planned failover, for example, if you plan to perform datacenter migration, maintenance or software upgrade of the primary VMs. You can also perform planned failover if you have an advance notice of a disaster approaching that will require taking the primary servers offline."
In many of the intended use circumstances, a "Planned Failover" is only as good as a planned failback. Veeam's Planned Failback feature currently has two noticeable limitations when used in practice.
1. A planned failover/back When performing a failback after a "Planned Failover" operation, Veeam requires a task called "Calculating Original Signature Hard Disk" to be performed. This task reads the entire source VM from disk in order to verify the integrity of the VM data, prior to failing back to it. Because of this costly read operation, the time to complete this task takes too long and places a heavy i/o load on the source host during the duration of this operation. Additionally, you RPO’s may not be met during this time if the failback process takes too long to complete.
Example- if you have a 100GB VM and your storage can read at 200MB/s (The time to complete this task will depend on how large the VM is and how fast the storage is), a failback task will take 9min, while placing a heavy i/o load on your primary storage during this time. Maybe 9min isn’t that bad, though. However, consider the scenarios that this is designed for, a common one being, host maintenance (i.e. Upgrade to the Hypervisor) where you have to move all VM’s off of a host. In this scenario, say you have 15 x 100GB VM’s and a single 2TB File Server VM for a total of 3.5TB of data (probably a common workload for many). Once the primary host is back online and you attempt to perform a planned failback, this operation will take 5hrs. If you have more data that, say a large file server with 15TB of data, this operation will take you a full 24hrs! Add on top of this the fact that your backups are not being performed during this window either unless you want to remap them to the new host, which by the way, will require the same full read operation from disk.
2. After a planned failback has been completed, presumably, you will want to re-start your replication of that VM to insure you are protected. The first time this replication job runs, it performs a task called "Processing <Job Name>". During this operation, it reads the entire source VM from disk (yes it has to do it again!). During this time, a heavy i/o load is placed on the host server (while all VM’s are in production) negatively impacting the performance of the entire host. Additionally, during this time, no backups can be performed to the VM’s, further extending your RPO objectives, quite possible past your defined policy.
Example - In our example discussed above, we would have no backups for 24hrs. This operation would add an addition 24hrs and would now be up to 48hrs with no backups before we were back to business as usual.
Note – I have not tested this scenario while backups were being performed, only replication. It stands to reason that after a planned failback, the backups would also need to perform this same "Calculating Original Signature Hard Disk" operation which would take you another 24hrs, totaling 72hrs with no backups performed. However, I have not confirmed this.
The Solution:
While I’m not a Veeam engineer and I do not know how everything works under the covers, I see one logical flaw in the current design. The “Planned Failover / Failback” process isn’t designed with the assumption that this is a “planned” operation. In a planned operation, the source and replica VM are identical at the time of failover and the software should know this. Because they are identical at the time of failover, all that needs to be done is to track the changes made on the replica VM during the time of the failover, and sync those changes back to the source. That’s that. There is no need to re-verify the source VM. I see implementing this in the software as an option. For those who want to perform the expensive read operation to make double sure they data is in tact, they can. For those who want to rely on CBT, give them the option. Likewise when resuming replication of the VM (and potentially backup jobs), the software should use this same logic.