foggy wrote:Ideally, if the specified copy interval length is enough, all VMs should be in the same state, provided each of them has the fresh restore point to copy.
The copy interval is plenty long enough. As I suggested, & in order to keep things as synchronised as possible with the normal window, yesterday I disabled this particular copy job and enabled it again just before leaving the office at around 17:00, when the sync point for the copy job is 20:40, the parent local backup starts at 20:30 & generally runs for around 20-30 mins.
I checked it remotely last night at around 23:30 and things were progressing exactly as I would have expected.
I came in this morning to find that, for some unknown reason the current "start time" for the copy job session was around 08:30 (so it had been running for about 28mins at that point) and it was processing again with a bunch of VMs with a status of "latest restore point already copied", some in progress & some pending.
Looking at the stats for the job run yesterday, the job claims to have run for 15:29 hrs before it reported "Copy interval has expired" and failed the remaining 14 VMs. However, this was at around 01:00 according to the logfiles!! Also, I don't know how it could have run for 15:29hrs when it was started (from a disabled state) at around 16:55, and failed at 01:00!
So, it appears to have sat there doing nothing from 01:00 until some time around 08:30ish at which point it has started copying again, going back to the "split-brain" scenario where some restore points are "already copied" but some are not (the ones currently in progress). So, once again, I'm going to end up with a half-completed copy jobs that will likely keep the restore points locked until some arbitrary sync time-out that I can't fathom, probably some time tomorrow (though god knows when) which again means I get no off-site replicas process until that has been completed.
foggy wrote:This is as well expected, since the job is writing to the backup chain and provided target cache is enabled, target agent holds a lock on the files.
So why is this lock held for hours when there are perfectly good restore points that could be copied?? (See above)
I'd suggest to just let the job do its job (sorry
) and do not bother with manual sync, since it could cause a bit more mess. Instead, I'd investigate the reasons of why the VMs went out of sync in the first place, probably you have a too short interval, which doesn't allow the job to process all VMs data.
I appreciate your suggestion but have learned from past experience that it can take DAYS for such jobs to sort themselves out, something I can't afford to explain to management when they come to me asking what our RPO/RTO details are.
I originally logged a ticket (01788614) for this issue one server but am now getting it on our second server, and the original server appears to be getting through the copy interval but is now complaining at the merge process, claiming "ailed to merge full backup file Error: Unable to find scale-out repository extent with previous backup files." even though they are there, functioning and with no errors at all (I've just uploaded a fresh set of log files for that ticket).
I would also like to get the second B&R server copy interval issue looked at, would you suggest I upload the logs for that one to the same case, or would I be better opening a fresh case?
Apologies if I come across rather terse, but having loved Veeam for the last 3yrs or so for its "it just works" feeling, I seem to be having more & more problems for no good reason since upgrading to v9...