Host-based backup of VMware vSphere VMs.
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy » 1 person likes this post

The problem Overview:
In some situations, it can take up to 48-72hrs to fully complete and return from a planned failover/failback operation.

Request:
Make the "Calculating Original Signature Hard Disk" operation for a planned failback, optional.

Background:
Veeam has designed a feature called "Planned Failover" into their product. This is a great feature that can allow you to operate a near-highly available environment without the cost of shared storage or vMotion. This is how the feature is described on the website http://helpcenter.veeam.com/backup/80/v ... lover.html, this feature is:
"If you know that your primary VMs are about to go offline, you can proactively switch the workload to their replicas. A planned failover is smooth manual switching from a primary VM to its replica with minimum interrupting in operation. You can use the planned failover, for example, if you plan to perform datacenter migration, maintenance or software upgrade of the primary VMs. You can also perform planned failover if you have an advance notice of a disaster approaching that will require taking the primary servers offline."
The Problem in Detail:
In many of the intended use circumstances, a "Planned Failover" is only as good as a planned failback. Veeam's Planned Failback feature currently has two noticeable limitations when used in practice.

1. A planned failover/back When performing a failback after a "Planned Failover" operation, Veeam requires a task called "Calculating Original Signature Hard Disk" to be performed. This task reads the entire source VM from disk in order to verify the integrity of the VM data, prior to failing back to it. Because of this costly read operation, the time to complete this task takes too long and places a heavy i/o load on the source host during the duration of this operation. Additionally, you RPO’s may not be met during this time if the failback process takes too long to complete.

Example- if you have a 100GB VM and your storage can read at 200MB/s (The time to complete this task will depend on how large the VM is and how fast the storage is), a failback task will take 9min, while placing a heavy i/o load on your primary storage during this time. Maybe 9min isn’t that bad, though. However, consider the scenarios that this is designed for, a common one being, host maintenance (i.e. Upgrade to the Hypervisor) where you have to move all VM’s off of a host. In this scenario, say you have 15 x 100GB VM’s and a single 2TB File Server VM for a total of 3.5TB of data (probably a common workload for many). Once the primary host is back online and you attempt to perform a planned failback, this operation will take 5hrs. If you have more data that, say a large file server with 15TB of data, this operation will take you a full 24hrs! Add on top of this the fact that your backups are not being performed during this window either unless you want to remap them to the new host, which by the way, will require the same full read operation from disk.

2. After a planned failback has been completed, presumably, you will want to re-start your replication of that VM to insure you are protected. The first time this replication job runs, it performs a task called "Processing <Job Name>". During this operation, it reads the entire source VM from disk (yes it has to do it again!). During this time, a heavy i/o load is placed on the host server (while all VM’s are in production) negatively impacting the performance of the entire host. Additionally, during this time, no backups can be performed to the VM’s, further extending your RPO objectives, quite possible past your defined policy.

Example - In our example discussed above, we would have no backups for 24hrs. This operation would add an addition 24hrs and would now be up to 48hrs with no backups before we were back to business as usual.
Note – I have not tested this scenario while backups were being performed, only replication. It stands to reason that after a planned failback, the backups would also need to perform this same "Calculating Original Signature Hard Disk" operation which would take you another 24hrs, totaling 72hrs with no backups performed. However, I have not confirmed this.

The Solution:
While I’m not a Veeam engineer and I do not know how everything works under the covers, I see one logical flaw in the current design. The “Planned Failover / Failback” process isn’t designed with the assumption that this is a “planned” operation. In a planned operation, the source and replica VM are identical at the time of failover and the software should know this. Because they are identical at the time of failover, all that needs to be done is to track the changes made on the replica VM during the time of the failover, and sync those changes back to the source. That’s that. There is no need to re-verify the source VM. I see implementing this in the software as an option. For those who want to perform the expensive read operation to make double sure they data is in tact, they can. For those who want to rely on CBT, give them the option. Likewise when resuming replication of the VM (and potentially backup jobs), the software should use this same logic.
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by Gostev »

DrWhy wrote:This task reads the entire source VM from disk in order to verify the integrity of the VM data.
Actually, it's not for verify integrity of VM data. Rather, it is used to understand the contents of virtual disks, so that failback process knows what virtual disk blocks it needs to synchronize. This is why this process is mandatory and cannot be made optional - in theory, this could be replaced by a CBT query.
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

Hey Gostev, thank you for your comment. I believe you are right and that a CBT query would accomplish this. Is this an improvement that you and the development team would seriously considering implementing anytime soon? I am currently evaluating Veeam, and this limitation is one that is making it hard for us to meet our requirements. It would greatly impact my decision if I knew this improvement was in the pipe for a future release. Thanks again.
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

Additionally - I believe Unitrends Reliable DR is (formely PHD Virtual) is doing a CBT query as their product does not require this expensive read operation upon failback nor when you go to resume the replication job.
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by Gostev »

Hi, Caleb. We have not considered this feature yet, because you are the first to ask for it. We will need to perform some research on potential reliability implications of this approach before we can put this feature into the pipeline. But even in the best case scenario, I don't expect this feature making into the next release, because feature set for that one was finalized many months ago (besides, with a single request so far we cannot prioritize this feature high enough). So, if this feature is critical for you for some imminent project, you should go with another solution that meets your needs. Thanks!
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

I understand, thanks Gustev. While I may be the first to officially make this specific request, there are others who are experiencing the affects of this in the following thread.

http://forums.veeam.com/vmware-vsphere- ... 11581.html

I have several more questions:
1. Can you comment on why, after the failback has completed and that during the next replication job of that VM it has to do a 2nd re-read of the entire source VM?

1a. It appears that there is some re-reading of the replica VM after replication has resumed as well. I'm not sure about this but saw a noticeable spike in disk utilization while this job ran. Can you clarify what's going on here?

2. Aside from replication, if we have backups being performed of these VM's, once a failback has been performed, will those backups have to do yet a 3rd re-read of the entire source VM (or re-read on the backup files for that matter)?
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by Gostev »

1. This is because CBT cannot be used in the first job pass for a VM that had its disks modified by an external process. The job has to discover baseline virtual disk state to apply future changed blocks information to, and for that it needs to read the entire VM, because it may already have some changesin its virtual disks since failback has been completed.

1a. What do you mean by "resume" as it comes to replication job?

2. Yes for the source VM (see 1), No for backup files.
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

1a. This is related to question one. After a VM has been failed back, it's assumed that you will want to start performing the replication job again. As you have stated, the first time this replication job is run, a full read of the source VM has to be performed. Also, during this process, I've observed that the replica VM has to be read from disk. I'm not sure if this is a full read of the replica VM or what is going on. Can you please clarify how much of the replica VM must be read from disk during this process?
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

Hey Gostev, does the additional info help you answer my questions?
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

After you have answered the question above, can you please respond to the follow as well? I want to make sure I understand everything correctly.

Is the following accurate? Please correct me if I’m wrong.
1. During Failback - The entire Source VM must be fully read from disk, prior to the failback completing.
2. After Failback --- The entire Source VM must be fully read from disk, the first time the replication job runs for that VM.
3. After Failback --- The entire Replica VM must be fully read from disk, the first time the replication job runs for that VM.
4. After Failback --- The entire Source VM must be fully read from disk, the first time the backup jobs runs for that VM.
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by Gostev »

Team, please confirm with the devs and respond.
foggy
Veeam Software
Posts: 21069
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by foggy »

1. Correct (discussed above).
2. Correct (discussed above).
3. Reading of the entire replica VM should not be required, if its digests are updated during failback (which would be reasonable). I will check this on Monday.
4. Correct (discussed above).
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

foggy wrote: 3. Reading of the entire replica VM should not be required, if its digests are updated during failback (which would be reasonable). I will check this on Monday.
I have observed noticeable disk usage on the replica during this process. While I don't know if it's doing the entire thing, there is certainly noticeable disk usage going on. Thanks for looking into this.
foggy
Veeam Software
Posts: 21069
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by foggy »

Replica VM shouldn't be read during the first replication cycle after failback, unless its digests are missing for some reason. Could you please check whether there's a corresponding record (smth like "Digests are missing, calculating digests...") in the job session log for that run?
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

I have not had a chance to do this yet. And sadly, I'm not sure if I will ever get to it as Veeam doesn't meet our requirements because of the slow failback process. I appreciate the response by your team and look forward to seeing this feature implemented into Veeam in the future.
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

Hey Gostev, has there been any movement on this feature request? This is the one thing that's keeping us from being able to use Veeam for our large file servers.
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by Gostev »

In fact yes, there was. This feature has made it to the highest priority features list, and is on the edge of making it into 9.5 (fingers crossed).
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

That's fantastic news!! Can't wait!
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

did this get into 9.5?
rreed
Veteran
Posts: 354
Liked: 72 times
Joined: Jun 30, 2015 6:06 pm
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by rreed »

A very big +1. I've just very recently starting playing around w/ Replication, and w/ just a simple ~14GB test VM taking upwards of half an hour to fail either direction, I'm afraid to present this as a viable alternative to VMware's SRM to management.
VMware 6
Veeam B&R v9
Dell DR4100's
EMC DD2200's
EMC DD620's
Dell TL2000 via PE430 (SAS)
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

The only thing that scares me about this solution is that it will presumably rely on vmware CBT, which, has been notoriously unreliable in vSphere 6. I'm curious to hear what Gostev's thoughts on this are.
nunciate
Expert
Posts: 247
Liked: 39 times
Joined: May 21, 2013 9:08 pm
Full Name: Alan Wells
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by nunciate »

I haven't read all of the posts here so hopefully this isn't a repeat. I was performing some tests doing planned fail over jobs this weekend in preparation for our yearly BCP test. I noticed the long fail back times and this gave me great concern. We have several VMs that need to be failed back with changes after our BCP test is complete. One of these is a very large file server. About 10Tb big (maybe 5 when you consider deleted blocks). There is absolutely no way I can fail that over to DR and then fail back with changes. It would take up to 48 hours to recalculate disk digests on the way back. I typically rely on Windows DFSR for file servers however it doesn't work reliably with this server for some reason.

One flaw I see in this process is that the job shuts down the replica in DR during the fail back, then it performs a replication in which it calculates disk digests.
Why does the job need to shut down the server in DR first? Why can't it simply run the replication first while keeping the server online? I know you can do that when doing a normal replication job while also calculating digests so why isn't it the same coming back? If you simply changed the process to leave the VM online in DR while the job replicates back that would help a lot. I could then leave the VM online in DR while it is replicating back. Once that first pass is done the job can shut down the VM, perform a second pass replication which should be quick and then power back up in production. It isn't perfect but it is better than hours or even days of down time just to do a fail back operation.
veremin
Product Manager
Posts: 20270
Liked: 2252 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by veremin »

Have you thought about replicating failovered VMs back and then executing planned failover operation for them? That should allow you to achieve the described scenario. Thanks.
nunciate
Expert
Posts: 247
Liked: 39 times
Joined: May 21, 2013 9:08 pm
Full Name: Alan Wells
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by nunciate »

Yea that was going to be my next test. Just do a planned fail over then a permanent fail over to DR and then reverse the process.
veremin
Product Manager
Posts: 20270
Liked: 2252 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by veremin »

Correct, that approach should allow you to achieve more or less the same goal you're after. Also, it should minimize time needed to transfer final changes. Thanks.
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

Is there any update as to whether the feature request mentioned in this thread will be implemented in Veeam 9.5?
foggy
Veeam Software
Posts: 21069
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by foggy »

As far as I know, this was postponed until the next major release.
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

got it, thanks for the update.
DrWhy
Enthusiast
Posts: 38
Liked: 2 times
Joined: May 12, 2015 7:05 pm
Full Name: Caleb
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by DrWhy »

Hello, can you please provide another update on the status of this feature request?
veremin
Product Manager
Posts: 20270
Liked: 2252 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: FEATURE REQUEST - Speed Up the Planned Failback Process

Post by veremin »

It didn't make it into release, having been superseded by features with higher priority. Thanks.
Post Reply

Who is online

Users browsing this forum: karsten123 and 73 guests