Hey guys, I'd like to chime in here, albeit a bit late. Hopefully someone sees this! Please excuse the long post, but I just want to be clear and detailed.
Environment and situation:
Tenant replicating to a cloud connect replication environment (hosted by me)
Test VM: 350GB provisioned, only about 50GB actually used
100 mbit/s pipe between tenant and service provider
I replicated the test VM in a few hours with no issue. Did a partial failover, which worked spectacularly. It only took about 5 minutes from me starting the failover, to me being able to ping the VM in the DR site. To simulate some data change, I downloaded a 2.8GB ISO (and for fun, it was a Veeam ISO) and left it in my downloads folder. I let the VM run for a few hours, just sitting there, not doing anything.
I go to failback, and I was pretty surprised by the result, not in a good way. I did a quick rollback, utilizing CBT and restoring the VM to the original location, however, for 2.8GB worth of changes, my VM was still down for nearly an hour! What I don't understand is that the job log says "Replicating restore point for Hard disk 1 (350.0 GB) 2.6 GB processed", and that part only took 3 minutes, 57 seconds. That 2.6GB lines up closely with the size of the ISO I downloaded.
What I don't understand, is that the next phase, "Replicating changes Hard disk 1 (350.0 GB) 37.6 GB processed" took 43 minutes, all while the VM was powered off. I noticed that it was going with an average speed of 15 - 20 MB/s
Link to a picture of the failback log: https://ibb.co/hyAsy7
My questions are:
1. What in the world is it doing while "Replicating changes Hard disk 1"? I thought a large portion of the changed data was copied during the "Replicating restore point for Hard disk 1" phase. If that isn't the case, what does that phase do then?
2. Why did it take so long for a VM that presumably only had 2.8GB of changed data? I understand logs and what not make changes and take space, but even if we double it, that should have only taken roughly 5 minutes at an average speed of 18 MB/s.
I guess I'm just confused here as to what is happening and why we can't leverage the awesome replication features Veeam has built in to essentially do a few reverse replications as someone here has stated, power off, then do one more quick replication (5-10 minutes tops), then power on.
Failing over is nice and simple, but quite frankly, I'm terrified to use it because of the implications of falling back. I don't want to have to tell my client "We can fail you over, but honestly I have no idea how long you'll be down while we fail you back, and I have no idea when Veeam will actually decide to take you down to finish the failover". If a 50GB VM with roughly 3GB of changed data took that long, what if we have to fail over a client's Exchange server for an extended period of time? I have no idea how long it would take for a 3tb Exchange server that potentially has a week worth of changed data on it and I don't want to find out.
The only solution I can see around this is to build a VPN tunnel between our DR site and our customer's network, then use another Veeam server to replicate the changes from our DR site to the customer site, turn off the replicas during a maintenance window, run another replication to get the changed data, then turn the customer's servers on in the original production environment. It seems like this would give me the flexibility I need to determine exactly when they go down, while also giving me the least amount of downtime. However, I absolutely know my networking guys are going to say "Why didn't we do that in the first place, and why don't we just do the normal replications from the tenant that way as well? And honestly, I don't have an answer for that because it seems to make more sense to do it that way than to deal with the uncertainty and mystery around Veeam's built-in failback process that it seems like we have to utilize for Cloud Connect.
Now, if I've missed something, and there's maybe some slick feature I'm not aware of, or that what I'm experiencing is out of the ordinary after update 2 which gives us the option to use CBT to skip calculating disk digests (we're on U3 by the way), please, let me know. I would love to know about it. Actually, I'm really begging to know about it at this point
Make me look dumb, I don't care, I just want to know.
If this is normal behavior, then please take this as my feature request to continue development here and us the same mechanism you already have that works wonderfully to replicate hot data over, but for the replication back. If anyone has any better suggestions than my VPN tunnel idea in a cloud connect environment, please chime in.