We have a very large file server that can take a long time to replicate to our DR. I've been experimenting with backing up the VM to a local disk and then have a replication job run after the backup is completed. So far this has been working very well having the backup job run every 6 hours. In most cases, the jobs do not take very long with this setup.
The file server does have deduplication enabled and once a month, does a thorough Garbage Collection and Scrub as part of the deduplication process. The problem with this is it does cause a lot of data churning on the system which in turn makes for a larger than average backup and replica to transmit. This caused the replication to take longer than normal. The backup started, finished, and then kicked another replication job while the last one was still running.
What happened was the latest replication stopped the running job. From the looks of it, the DR snapshot never got merged and another snapshot was then added to the chain. The 2nd replica didn't have enough time to finish and got killed by yet another replication job. Another unmerged snapshot was then added to the mix. What finally happened was that the DR disk got full of snapshots and it wasn't able to do anything at all after the disk space was all used up.
Is it possible to add some sort of overlap management in case of something like this happens? For example, have an option that sees that the job is already running and not run until after it is completed? Or perhaps if it is running, skip until the job is idle?