I haven't read all of the posts here so hopefully this isn't a repeat. I was performing some tests doing planned fail over jobs this weekend in preparation for our yearly BCP test. I noticed the long fail back times and this gave me great concern. We have several VMs that need to be failed back with changes after our BCP test is complete. One of these is a very large file server. About 10Tb big (maybe 5 when you consider deleted blocks). There is absolutely no way I can fail that over to DR and then fail back with changes. It would take up to 48 hours to recalculate disk digests on the way back. I typically rely on Windows DFSR for file servers however it doesn't work reliably with this server for some reason.
One flaw I see in this process is that the job shuts down the replica in DR during the fail back, then it performs a replication in which it calculates disk digests.
Why does the job need to shut down the server in DR first? Why can't it simply run the replication first while keeping the server online? I know you can do that when doing a normal replication job while also calculating digests so why isn't it the same coming back? If you simply changed the process to leave the VM online in DR while the job replicates back that would help a lot. I could then leave the VM online in DR while it is replicating back. Once that first pass is done the job can shut down the VM, perform a second pass replication which should be quick and then power back up in production. It isn't perfect but it is better than hours or even days of down time just to do a fail back operation.