Struggles with backup copy job logic :-/

pkelly_sts · Post by **pkelly_sts** » May 10, 2016 11:04 am this post

I really do struggle getting my head around the "logic" of backup copy jobs sometimes!

Thankfully It doesn't happen too often but once again my copy job somehow seems to have got "out of sync" and I'm left staring at a job that for some reason has copied the data for, let's say, half of the VMs in the parent job (the parent job having completed perfectly in one pass) and it's sitting "waiting for restore points" with the remaining VMs. However, if I click "sync now" then it'll start processing those remaining VMs, but then the previous half of the VMs have a state of "latest restore point is already copied" so to my eyes it seems to then sit there again waiting for those VMs next restore points.

All the while this is going on, the restore points in the repository seem to be in a "locked" state meaning that nothing else can be done with them (e.g. replica-from-backup etc.)

What's even more confusing is I then end up with a job that (in this case right now) is sitting at "66% complete" with a stats window full of a mixed-up bunch of sync statuses which is hard to understand (see above) and seemingly nothing copying at all.

The only option I can see (which I'm waiting to try again now) is to force the sync as above, and after that sync has completed, disable the job, then enable it again so it hopefully starts a "clean" cycle again.

Clearly I must be missing something obvious as I haven't really seen much on this subject in the forums?

pkelly_sts · Post by **pkelly_sts** » May 10, 2016 11:12 am this post

And further to that, having now (temporarily) disabled the copy job, I can see it now "Building VMs List" and it's 2.5 mins into doing *something* when it should actually be disabled! It even has a "Next run" status of <DISABLED>, but with a bunch of VMs "Pending".

Whilst I typed this if's now finally changed to "stopping".

It REALLY doesn't make much sense to me at all...

Post by **foggy** » May 10, 2016 12:07 pm this post

Ideally, if the specified copy interval length is enough, all VMs should be in the same state, provided each of them has the fresh restore point to copy.

pkelly_sts wrote:However, if I click "sync now" then it'll start processing those remaining VMs, but then the previous half of the VMs have a state of "latest restore point is already copied" so to my eyes it seems to then sit there again waiting for those VMs next restore points.

This is expected, since at that moment the restore points for these VMs that are currently in the source repository were already processed by the copy job.

pkelly_sts wrote:All the while this is going on, the restore points in the repository seem to be in a "locked" state meaning that nothing else can be done with them (e.g. replica-from-backup etc.)

This is as well expected, since the job is writing to the backup chain and provided target cache is enabled, target agent holds a lock on the files.

pkelly_sts wrote:The only option I can see (which I'm waiting to try again now) is to force the sync as above, and after that sync has completed, disable the job, then enable it again so it hopefully starts a "clean" cycle again.

I'd suggest to just let the job do its job (sorry

) and do not bother with manual sync, since it could cause a bit more mess. Instead, I'd investigate the reasons of why the VMs went out of sync in the first place, probably you have a too short interval, which doesn't allow the job to process all VMs data.

pkelly_sts · Post by **pkelly_sts** » May 11, 2016 9:23 am this post

foggy wrote:Ideally, if the specified copy interval length is enough, all VMs should be in the same state, provided each of them has the fresh restore point to copy.

The copy interval is plenty long enough. As I suggested, & in order to keep things as synchronised as possible with the normal window, yesterday I disabled this particular copy job and enabled it again just before leaving the office at around 17:00, when the sync point for the copy job is 20:40, the parent local backup starts at 20:30 & generally runs for around 20-30 mins.

I checked it remotely last night at around 23:30 and things were progressing exactly as I would have expected.

I came in this morning to find that, for some unknown reason the current "start time" for the copy job session was around 08:30 (so it had been running for about 28mins at that point) and it was processing again with a bunch of VMs with a status of "latest restore point already copied", some in progress & some pending.

Looking at the stats for the job run yesterday, the job claims to have run for 15:29 hrs before it reported "Copy interval has expired" and failed the remaining 14 VMs. However, this was at around 01:00 according to the logfiles!! Also, I don't know how it could have run for 15:29hrs when it was started (from a disabled state) at around 16:55, and failed at 01:00!

So, it appears to have sat there doing nothing from 01:00 until some time around 08:30ish at which point it has started copying again, going back to the "split-brain" scenario where some restore points are "already copied" but some are not (the ones currently in progress). So, once again, I'm going to end up with a half-completed copy jobs that will likely keep the restore points locked until some arbitrary sync time-out that I can't fathom, probably some time tomorrow (though god knows when) which again means I get no off-site replicas process until that has been completed.

foggy wrote:This is as well expected, since the job is writing to the backup chain and provided target cache is enabled, target agent holds a lock on the files.

So why is this lock held for hours when there are perfectly good restore points that could be copied?? (See above)

I'd suggest to just let the job do its job (sorry ) and do not bother with manual sync, since it could cause a bit more mess. Instead, I'd investigate the reasons of why the VMs went out of sync in the first place, probably you have a too short interval, which doesn't allow the job to process all VMs data.

I appreciate your suggestion but have learned from past experience that it can take DAYS for such jobs to sort themselves out, something I can't afford to explain to management when they come to me asking what our RPO/RTO details are.

I originally logged a ticket (01788614) for this issue one server but am now getting it on our second server, and the original server appears to be getting through the copy interval but is now complaining at the merge process, claiming "ailed to merge full backup file Error: Unable to find scale-out repository extent with previous backup files." even though they are there, functioning and with no errors at all (I've just uploaded a fresh set of log files for that ticket).

I would also like to get the second B&R server copy interval issue looked at, would you suggest I upload the logs for that one to the same case, or would I be better opening a fresh case?

Apologies if I come across rather terse, but having loved Veeam for the last 3yrs or so for its "it just works" feeling, I seem to be having more & more problems for no good reason since upgrading to v9...

Paul

Post by **foggy** » May 11, 2016 10:43 am this post

I'd suggest to open a new case, since the first one is already discussing different issue (if I'm getting you right). You can reference your previous case for clarity.

What is your 'copy every' setting, btw?

pkelly_sts · Post by **pkelly_sts** » May 11, 2016 10:46 am this post

OK will do now. Copy every is only 24hrs so it's not even as if I'm trying to get multiple copies / day...

Post by **foggy** » May 11, 2016 10:52 am this post

Got it. Let's see what our engineers could say after reviewing the log files.

pkelly_sts · Post by **pkelly_sts** » May 11, 2016 11:16 am this post

Now logged as 01796138. The current copy process has now "completed" in that it has processed those VMs that failed in the first pass but again it's now sitting on "39% complete" with a bunch of VMs "pending" and the others "success" so I'm stuck in the half-n-half cycle again until, presumably, another "copy interval expired" error after which at some point it might start again cleanly.

I shall leave it this time pending support taking a look at it in its current state...

R&D Forums

Struggles with backup copy job logic :-/

Re: Struggles with backup copy job logic :-/

Re: Struggles with backup copy job logic :-/

Re: Struggles with backup copy job logic :-/

Re: Struggles with backup copy job logic :-/

Re: Struggles with backup copy job logic :-/

Re: Struggles with backup copy job logic :-/

Re: Struggles with backup copy job logic :-/

Who is online