FEATURE REQUEST - Speed Up the Planned Failback Process

VMware specific discussions

FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby DrWhy » Thu May 21, 2015 7:23 pm 1 person likes this post

The problem Overview:
In some situations, it can take up to 48-72hrs to fully complete and return from a planned failover/failback operation.

Request:
Make the "Calculating Original Signature Hard Disk" operation for a planned failback, optional.

Background:
Veeam has designed a feature called "Planned Failover" into their product. This is a great feature that can allow you to operate a near-highly available environment without the cost of shared storage or vMotion. This is how the feature is described on the website http://helpcenter.veeam.com/backup/80/vsphere/planned_failover.html, this feature is:
"If you know that your primary VMs are about to go offline, you can proactively switch the workload to their replicas. A planned failover is smooth manual switching from a primary VM to its replica with minimum interrupting in operation. You can use the planned failover, for example, if you plan to perform datacenter migration, maintenance or software upgrade of the primary VMs. You can also perform planned failover if you have an advance notice of a disaster approaching that will require taking the primary servers offline."


The Problem in Detail:
In many of the intended use circumstances, a "Planned Failover" is only as good as a planned failback. Veeam's Planned Failback feature currently has two noticeable limitations when used in practice.

1. A planned failover/back When performing a failback after a "Planned Failover" operation, Veeam requires a task called "Calculating Original Signature Hard Disk" to be performed. This task reads the entire source VM from disk in order to verify the integrity of the VM data, prior to failing back to it. Because of this costly read operation, the time to complete this task takes too long and places a heavy i/o load on the source host during the duration of this operation. Additionally, you RPO’s may not be met during this time if the failback process takes too long to complete.

Example- if you have a 100GB VM and your storage can read at 200MB/s (The time to complete this task will depend on how large the VM is and how fast the storage is), a failback task will take 9min, while placing a heavy i/o load on your primary storage during this time. Maybe 9min isn’t that bad, though. However, consider the scenarios that this is designed for, a common one being, host maintenance (i.e. Upgrade to the Hypervisor) where you have to move all VM’s off of a host. In this scenario, say you have 15 x 100GB VM’s and a single 2TB File Server VM for a total of 3.5TB of data (probably a common workload for many). Once the primary host is back online and you attempt to perform a planned failback, this operation will take 5hrs. If you have more data that, say a large file server with 15TB of data, this operation will take you a full 24hrs! Add on top of this the fact that your backups are not being performed during this window either unless you want to remap them to the new host, which by the way, will require the same full read operation from disk.

2. After a planned failback has been completed, presumably, you will want to re-start your replication of that VM to insure you are protected. The first time this replication job runs, it performs a task called "Processing <Job Name>". During this operation, it reads the entire source VM from disk (yes it has to do it again!). During this time, a heavy i/o load is placed on the host server (while all VM’s are in production) negatively impacting the performance of the entire host. Additionally, during this time, no backups can be performed to the VM’s, further extending your RPO objectives, quite possible past your defined policy.

Example - In our example discussed above, we would have no backups for 24hrs. This operation would add an addition 24hrs and would now be up to 48hrs with no backups before we were back to business as usual.
Note – I have not tested this scenario while backups were being performed, only replication. It stands to reason that after a planned failback, the backups would also need to perform this same "Calculating Original Signature Hard Disk" operation which would take you another 24hrs, totaling 72hrs with no backups performed. However, I have not confirmed this.

The Solution:
While I’m not a Veeam engineer and I do not know how everything works under the covers, I see one logical flaw in the current design. The “Planned Failover / Failback” process isn’t designed with the assumption that this is a “planned” operation. In a planned operation, the source and replica VM are identical at the time of failover and the software should know this. Because they are identical at the time of failover, all that needs to be done is to track the changes made on the replica VM during the time of the failover, and sync those changes back to the source. That’s that. There is no need to re-verify the source VM. I see implementing this in the software as an option. For those who want to perform the expensive read operation to make double sure they data is in tact, they can. For those who want to rely on CBT, give them the option. Likewise when resuming replication of the VM (and potentially backup jobs), the software should use this same logic.
DrWhy
Enthusiast
 
Posts: 38
Liked: 2 times
Joined: Tue May 12, 2015 7:05 pm
Full Name: Caleb

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby Gostev » Thu May 21, 2015 10:17 pm

DrWhy wrote:This task reads the entire source VM from disk in order to verify the integrity of the VM data.

Actually, it's not for verify integrity of VM data. Rather, it is used to understand the contents of virtual disks, so that failback process knows what virtual disk blocks it needs to synchronize. This is why this process is mandatory and cannot be made optional - in theory, this could be replaced by a CBT query.
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby DrWhy » Thu May 21, 2015 10:55 pm

Hey Gostev, thank you for your comment. I believe you are right and that a CBT query would accomplish this. Is this an improvement that you and the development team would seriously considering implementing anytime soon? I am currently evaluating Veeam, and this limitation is one that is making it hard for us to meet our requirements. It would greatly impact my decision if I knew this improvement was in the pipe for a future release. Thanks again.
DrWhy
Enthusiast
 
Posts: 38
Liked: 2 times
Joined: Tue May 12, 2015 7:05 pm
Full Name: Caleb

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby DrWhy » Thu May 21, 2015 10:55 pm

Additionally - I believe Unitrends Reliable DR is (formely PHD Virtual) is doing a CBT query as their product does not require this expensive read operation upon failback nor when you go to resume the replication job.
DrWhy
Enthusiast
 
Posts: 38
Liked: 2 times
Joined: Tue May 12, 2015 7:05 pm
Full Name: Caleb

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby Gostev » Thu May 21, 2015 11:14 pm

Hi, Caleb. We have not considered this feature yet, because you are the first to ask for it. We will need to perform some research on potential reliability implications of this approach before we can put this feature into the pipeline. But even in the best case scenario, I don't expect this feature making into the next release, because feature set for that one was finalized many months ago (besides, with a single request so far we cannot prioritize this feature high enough). So, if this feature is critical for you for some imminent project, you should go with another solution that meets your needs. Thanks!
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby DrWhy » Thu May 21, 2015 11:49 pm

I understand, thanks Gustev. While I may be the first to officially make this specific request, there are others who are experiencing the affects of this in the following thread.

vmware-vsphere-f24/replication-failover-and-failback-t11581.html

I have several more questions:
1. Can you comment on why, after the failback has completed and that during the next replication job of that VM it has to do a 2nd re-read of the entire source VM?

1a. It appears that there is some re-reading of the replica VM after replication has resumed as well. I'm not sure about this but saw a noticeable spike in disk utilization while this job ran. Can you clarify what's going on here?

2. Aside from replication, if we have backups being performed of these VM's, once a failback has been performed, will those backups have to do yet a 3rd re-read of the entire source VM (or re-read on the backup files for that matter)?
DrWhy
Enthusiast
 
Posts: 38
Liked: 2 times
Joined: Tue May 12, 2015 7:05 pm
Full Name: Caleb

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby Gostev » Fri May 22, 2015 12:37 am

1. This is because CBT cannot be used in the first job pass for a VM that had its disks modified by an external process. The job has to discover baseline virtual disk state to apply future changed blocks information to, and for that it needs to read the entire VM, because it may already have some changesin its virtual disks since failback has been completed.

1a. What do you mean by "resume" as it comes to replication job?

2. Yes for the source VM (see 1), No for backup files.
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby DrWhy » Fri May 22, 2015 4:07 pm

1a. This is related to question one. After a VM has been failed back, it's assumed that you will want to start performing the replication job again. As you have stated, the first time this replication job is run, a full read of the source VM has to be performed. Also, during this process, I've observed that the replica VM has to be read from disk. I'm not sure if this is a full read of the replica VM or what is going on. Can you please clarify how much of the replica VM must be read from disk during this process?
DrWhy
Enthusiast
 
Posts: 38
Liked: 2 times
Joined: Tue May 12, 2015 7:05 pm
Full Name: Caleb

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby DrWhy » Thu May 28, 2015 12:11 am

Hey Gostev, does the additional info help you answer my questions?
DrWhy
Enthusiast
 
Posts: 38
Liked: 2 times
Joined: Tue May 12, 2015 7:05 pm
Full Name: Caleb

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby DrWhy » Fri May 29, 2015 4:21 pm

After you have answered the question above, can you please respond to the follow as well? I want to make sure I understand everything correctly.

Is the following accurate? Please correct me if I’m wrong.
1.During Failback - The entire Source VM must be fully read from disk, prior to the failback completing.
2.After Failback --- The entire Source VM must be fully read from disk, the first time the replication job runs for that VM.
3.After Failback --- The entire Replica VM must be fully read from disk, the first time the replication job runs for that VM.
4.After Failback --- The entire Source VM must be fully read from disk, the first time the backup jobs runs for that VM.
DrWhy
Enthusiast
 
Posts: 38
Liked: 2 times
Joined: Tue May 12, 2015 7:05 pm
Full Name: Caleb

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby Gostev » Fri May 29, 2015 4:25 pm

Team, please confirm with the devs and respond.
Gostev
Veeam Software
 
Posts: 21390
Liked: 2349 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby foggy » Fri May 29, 2015 5:02 pm

1. Correct (discussed above).
2. Correct (discussed above).
3. Reading of the entire replica VM should not be required, if its digests are updated during failback (which would be reasonable). I will check this on Monday.
4. Correct (discussed above).
foggy
Veeam Software
 
Posts: 14728
Liked: 1078 times
Joined: Mon Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby DrWhy » Fri May 29, 2015 5:06 pm

foggy wrote:3. Reading of the entire replica VM should not be required, if its digests are updated during failback (which would be reasonable). I will check this on Monday.

I have observed noticeable disk usage on the replica during this process. While I don't know if it's doing the entire thing, there is certainly noticeable disk usage going on. Thanks for looking into this.
DrWhy
Enthusiast
 
Posts: 38
Liked: 2 times
Joined: Tue May 12, 2015 7:05 pm
Full Name: Caleb

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby foggy » Wed Jun 03, 2015 2:21 pm

Replica VM shouldn't be read during the first replication cycle after failback, unless its digests are missing for some reason. Could you please check whether there's a corresponding record (smth like "Digests are missing, calculating digests...") in the job session log for that run?
foggy
Veeam Software
 
Posts: 14728
Liked: 1078 times
Joined: Mon Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Veeam Logoby DrWhy » Tue Jun 09, 2015 9:30 pm

I have not had a chance to do this yet. And sadly, I'm not sure if I will ever get to it as Veeam doesn't meet our requirements because of the slow failback process. I appreciate the response by your team and look forward to seeing this feature implemented into Veeam in the future.
DrWhy
Enthusiast
 
Posts: 38
Liked: 2 times
Joined: Tue May 12, 2015 7:05 pm
Full Name: Caleb

Next

Return to VMware vSphere



Who is online

Users browsing this forum: Bing [Bot], EricB, ilya.konopak and 33 guests