Comprehensive data protection for all workloads
Post Reply
billcouper
Service Provider
Posts: 153
Liked: 34 times
Joined: Dec 18, 2017 8:58 am
Full Name: Bill Couper
Contact:

I don't understand Backup Copy jobs

Post by billcouper » 1 person likes this post

I don't think I understand Backup Copy jobs very well. They seem illogical to me, or perhaps it is my logic that is flawed.

I have 41 backup jobs that run at 6pm. They take a few hours to complete usually. Each primary backup job has two associated copy jobs.
As each primary backup completes, the associated copies start. The backup copies process quickly and everything works great. All the copies complete in very short order and I have 3 copies of the backup data every night.

Each morning when I come in to work I check Veeam and everything has completed. No jobs 'working' or otherwise in progress. All expected restore points are on disk where they are meant to be. Great.
But then, at 12pm every day I get 'Failed' notification for every single copy job that I have (82 of them).
The notification says "Error: Another job process already in progress."

Every backup copy job is configured to Copy every 1 Day starting at 12:00 PM
The backup copy jobs start at 12pm and have nothing to process to begin with, which is fine. Once the primary backups create new restore points the copy jobs do their thing and are WELL AND TRULY finished before 12pm the following day.

The reason I start the copy jobs at 12pm and have them do "nothing" for at least 6 hours is that it gives time for the backup health check to complete during business hours, which is what I want. Backup at night. Verify during the day. I have spread the health checks across the entire month so there is an even distribution of data to health check each day.
Why does Veeam not let you schedule health checks SEPARATE to the actual job? Anyway, that is an unrelated gripe.

Other than the potentially false positive 'Failed' message each day I don't find anything wrong with any of my copy jobs. Why am I getting the fail messages every day?



13/02/2018 12:01:03 PM :: Failed to start job [Customer Name - GFS Retention] Error: Another job process already in progress.
13/02/2018 12:01:10 PM :: Job Customer Name - GFS Retention cannot be started. Timeout: 24.2966094 sec
billcouper
Service Provider
Posts: 153
Liked: 34 times
Joined: Dec 18, 2017 8:58 am
Full Name: Bill Couper
Contact:

Re: I don't understand Backup Copy jobs

Post by billcouper »

Today, when the job cycle started again at 12pm, half my copy jobs actually ran and copied stuff.... that should have been copied last night... :(
I really don't understand copy jobs!
nmdange
Veteran
Posts: 528
Liked: 144 times
Joined: Aug 20, 2015 9:30 pm
Contact:

Re: I don't understand Backup Copy jobs

Post by nmdange »

This isn't really obvious, but for best results, configure the backup copy jobs to start at the same time as the backup jobs. The copy job will start and then wait for the backup job to finish, and then once the backup job finishes, it will immediately start copying data.
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: I don't understand Backup Copy jobs

Post by foggy »

Right, this is the recommended approach. When the backup copy job starts the new interval, it looks for the new restore points created by the corresponding source backup job(s) since its last cycle and starts copying data. So the fact that half of the jobs started copying something means that the latest data available was not yet copied during the last copy interval. As for the false positive mentioned in your original post, I'd ask support engineers to review job debug logs to identify why another job process was active at that time and what particular process it was.
mcvosi
Enthusiast
Posts: 66
Liked: 8 times
Joined: Jun 14, 2011 1:55 pm
Full Name: Matthew Vaughan
Contact:

Re: I don't understand Backup Copy jobs

Post by mcvosi »

Can you elaborate on this? So, what you're saying is to have a standard backup job that starts at 8 PM, configure the copy job to also start at 8?
Zew
Veteran
Posts: 377
Liked: 86 times
Joined: Mar 17, 2015 9:50 pm
Full Name: Aemilianus Kehler
Contact:

Re: I don't understand Backup Copy jobs

Post by Zew » 2 people like this post

Not the first person to get confused by this, the fact I have to read a complex user manual and understand pretty complex computer logic to configure a copy job as been my pet peeve with Veeam for a while.

I've asked many time to have a "dumb/simple" feature to simply select a Backup job or file, Pick a destination, pick a scheduled time. Done. But nope, Veeam will simply state that the existing system supports far more complex scenarios.

While I don't deny that, fro small business with very simple setups and requirements this becomes more painful than it needs to be.

Basically start here... https://helpcenter.veeam.com/docs/backu ... tml?ver=95 or https://helpcenter.veeam.com/docs/backu ... tml?ver=95

And read basically all the inter workings of a Backup Copy job to fully understand it, then maybe you'll get your backup copy jobs to work as intended. Hopefully.
billcouper
Service Provider
Posts: 153
Liked: 34 times
Joined: Dec 18, 2017 8:58 am
Full Name: Bill Couper
Contact:

Re: I don't understand Backup Copy jobs

Post by billcouper »

Thanks nmdange/foggy, I will adjust the start time for my copy jobs and see if they start copying the data when they should!
The reason I start the copy interval at 12pm is so that health checks can complete during business hours. I would much prefer that, than have the copy interval start at 8pm, health check for 8 hours, start copying at 4am and potentially still be copying into business hours. In the mornings I will be installing updates, rebooting servers and otherwise doing any required maintenance. Being able to separately schedule health checks would help in this regard, but I know that is not possible. Sigh.

I think I would much prefer the Tape Backup logic on a copy job. Tape Backup works so well!
Example: 41 primary jobs backup successfully. Tape backup starts. Tape backup puts job 1 onto tape. I manually run job 1 again, while the tape backup is still running. Tape backup will add the new backup files for job 1 onto tape during the same job - NOT the next day. This is the sort of logic I expected from copy jobs.

Logically, I don't understand why my copy jobs would be 'idle' for 12-14 hours, then only start copying last nights backup data when a new cycle starts. It is pretty absurd tbh. Any new restore point created during the copy interval 12pm-12pm should be copied immediately - NOT the next day. And if a job has to be re-run for any reason, it should be copied immediately too.
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: I don't understand Backup Copy jobs

Post by foggy »

mcvosi wrote:Can you elaborate on this? So, what you're saying is to have a standard backup job that starts at 8 PM, configure the copy job to also start at 8?
Correct. In this case backup copy job will monitor the corresponding source backup job and start copying data once new restore points appear in repository.
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: I don't understand Backup Copy jobs

Post by foggy »

billcouper wrote:Any new restore point created during the copy interval 12pm-12pm should be copied immediately - NOT the next day.
If you tie the copy interval up to the source backup job start time, restore points will be copied immediately once they appear (just keep in mind that backup copy syncs a single VM state during each interval, so if the second restore point appears during the same interval, it will not be processed).
wwx500
Novice
Posts: 4
Liked: 1 time
Joined: Jul 11, 2017 8:57 pm
Full Name: William Wallhausser
Contact:

Re: I don't understand Backup Copy jobs

Post by wwx500 » 1 person likes this post

+1 Zew
billcouper
Service Provider
Posts: 153
Liked: 34 times
Joined: Dec 18, 2017 8:58 am
Full Name: Bill Couper
Contact:

Re: I don't understand Backup Copy jobs

Post by billcouper »

@foggy what is the logic behind only copying one restore point per session? if two restore points for a primary backup job are created during the copy interval, what happens to the second one? the next time primary job runs it will CBT from the latest restore point, which isn't even in the GFS storage now, so what does the backup copy process? does that mean the copy job falls a day behind? will it ever catch up?
let me explain what i am saying, because what I wrote is not very clear. With a once per 24hr backup copy schedule consider this:
1/1 6pm Primary backup
1/1 6pm Copy backup
2/1 11am Primary backup run again manually for whatever reason (CBT changes made since 1/1 6pm)
2/1 6pm Primary backup (CBT changes made since 2/1 11am)
2/1 6pm Copy backup
What does the Copy on 2nd jan process? And when would the Primary backup from 2nd jan 6pm be copied? The way I understand your reply, the copy job would be a day behind and never be able to catch up. If the primary job is run again manually it will just get even worse. This is why I said the tape backup logic is superior! IMO copy jobs should copy all restore points, not just one per session. That would negate any requirement to run multiple copy windows per day (it would negate the scheduling of the copy job entirely, it will just run all the time and process things as it sees them).

----------

Moving on.. so Microsoft released server 2016 february rollup last week. Gostev recently blogged that it will include an updated REFS driver, which fixes "all the issues". Sure enough, I haven't had a repo server lock up in the past week, and CPU isn't on 100% during block clone operations... however, 'fast clone' is no longer fast... what have Microsoft done? I see 'fast clone' operations taking many many many hours now, like this one which is running currently:
18/02/2018 6:41:25 PM :: Merging oldest restore point into full backup file [GFS] (33% done) [fast clone]
That has been running for 17.5 hours now and is only 33% done... :( This has basically destroyed our ability to keep any GFS retention (since the repo is busy no other jobs are processing during this 17.5 hour period). In 6 more hours the copy window will have expired and there is no way it will be finished. What happens to the merge operation when the 24hr window expires? I have 36 more copy jobs waiting patiently for resource availability :(
Now that CPU usage isn't pegged at 100% and the repo servers are not locking up, I can probably increase the simultaneous jobs per extent for the GFS storage, hopefully that improves things a little as far as actually getting incremental points copied to GFS storage - but it won't make fast clone any faster, if anything it will slow it down.

I'm also now seeing failures like this one pop up in the backup copy history:
18/02/2018 6:00:37 PM :: Failed to merge full backup file Error: Unable to find scale-out repository extent with previous backup files. (for storage [eea7038e-a164-40d3-aeb5-511d834ef87c])

And this one:
16/02/2018 10:39:01 PM :: Failed to merge full backup file Error: The system cannot find the file specified.
Failed to get attributes for file [C:\Mount\HDD02\Backups\CEO_1_-_GFS_Retention\CENT-A9-MB1.vm-168323D2018-02-09T120000.vbk]
Agent failed to process method {ReFs.IsIntegrityStreamSame}.

Could these be caused by failures to complete processing during the specified window? Or are they Microsoft issues? I certainly didn't delete that vbk.
Veeam is falling apart around me and I am starting to wish we hadn't migrated away from our old backup software. It was far more light-weight on cpu/ram usage and storage consumption. It didn't have fancy stuff but it worked. And tbh fancy stuff is useless if backups are unreliable.
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: I don't understand Backup Copy jobs

Post by tsightler »

billcouper wrote: 1/1 6pm Primary backup
1/1 6pm Copy backup
2/1 11am Primary backup run again manually for whatever reason (CBT changes made since 1/1 6pm)
2/1 6pm Primary backup (CBT changes made since 2/1 11am)
2/1 6pm Copy backup
What does the Copy on 2nd jan process? And when would the Primary backup from 2nd jan 6pm be copied? The way I understand your reply, the copy job would be a day behind and never be able to catch up. If the primary job is run again manually it will just get even worse. This is why I said the tape backup logic is superior! IMO copy jobs should copy all restore points, not just one per session. That would negate any requirement to run multiple copy windows per day (it would negate the scheduling of the copy job entirely, it will just run all the time and process things as it sees them).
Backup copy jobs copy one restore point per interval. The idea behind this is that, unlike with tape, which is typically connected locally to the repository, it's not uncommon for the backup copy target to be significantly slower than the primary, perhaps across a WAN or just slower storage in general, like a dedupe appliance. For example, many customers run a primary backup job every 4 hours, but only keep a daily restore point offsite because they don't have the bandwidth. If they did want to keep every restore point, they can set the interval of the BCJ to 4 hours, although there currently no way to copy manually created points.

In you example, on 1/1 the backup copy will wait for the restore point from the primary backup job to be created and then copy it. When the primary creates a new point at 11AM, the backup copy job will just ignore this because it's already copied one point during the interval. On 2/1, at 6PM the new backup copy interval will start and it will start waiting for the primary backup job to complete it's backup, and then it will copy that restore points. Thus, on the primary you will have 3 restore points, from 1/1 at 6PM, & 11AM, and from 2/1 at 6PM, but, on the backup copy job, you'll only have two restore points, from 1/1 at 6PM and from 2/1 at 6PM. This is OK because backup copy jobs do not copy the backup files themselves, but rather the blocks needed for any restore point, it simply requests all blocks that are needed to create the 2/1 restore point, even if some of those blocks are in the 11AM restore point on the primary.

billcouper wrote:Moving on.. so Microsoft released server 2016 february rollup last week. Gostev recently blogged that it will include an updated REFS driver, which fixes "all the issues". Sure enough, I haven't had a repo server lock up in the past week, and CPU isn't on 100% during block clone operations... however, 'fast clone' is no longer fast... what have Microsoft done? I see 'fast clone' operations taking many many many hours now, like this one which is running currently:
18/02/2018 6:41:25 PM :: Merging oldest restore point into full backup file [GFS] (33% done) [fast clone]
That has been running for 17.5 hours now and is only 33% done... :( This has basically destroyed our ability to keep any GFS retention (since the repo is busy no other jobs are processing during this 17.5 hour period). In 6 more hours the copy window will have expired and there is no way it will be finished. What happens to the merge operation when the 24hr window expires? I have 36 more copy jobs waiting patiently for resource availability :(
Now that CPU usage isn't pegged at 100% and the repo servers are not locking up, I can probably increase the simultaneous jobs per extent for the GFS storage, hopefully that improves things a little as far as actually getting incremental points copied to GFS storage - but it won't make fast clone any faster, if anything it will slow it down.
Actually, the Feb CU did not include the ReFS fixes. The last I new there was going to be a one-off update released around Feb 22nd that could be installed manually, and hopefully it will be included in the March CU.

billcouper wrote:I'm also now seeing failures like this one pop up in the backup copy history:
18/02/2018 6:00:37 PM :: Failed to merge full backup file Error: Unable to find scale-out repository extent with previous backup files. (for storage [eea7038e-a164-40d3-aeb5-511d834ef87c])

And this one:
16/02/2018 10:39:01 PM :: Failed to merge full backup file Error: The system cannot find the file specified.
Failed to get attributes for file [C:\Mount\HDD02\Backups\CEO_1_-_GFS_Retention\CENT-A9-MB1.vm-168323D2018-02-09T120000.vbk]
Agent failed to process method {ReFs.IsIntegrityStreamSame}.

Could these be caused by failures to complete processing during the specified window? Or are they Microsoft issues? I certainly didn't delete that vbk.
Veeam is falling apart around me and I am starting to wish we hadn't migrated away from our old backup software. It was far more light-weight on cpu/ram usage and storage consumption. It didn't have fancy stuff but it worked. And tbh fancy stuff is useless if backups are unreliable.
It's really difficult to know, it sounds like the repositories are becoming unresponsive enough to be marked offline, perhaps during the merge, and that could be caused by the ReFS issue, but I'd strongly suggest opening a support case.
Post Reply

Who is online

Users browsing this forum: Bing [Bot], Google [Bot] and 273 guests