Discussions related to using object storage as a backup target.
Post Reply
collinp
Expert
Posts: 230
Liked: 13 times
Joined: Feb 14, 2012 8:56 pm
Full Name: Collin P
Contact:

Offload jobs to capacity tier hang at 99%

Post by collinp »

Ticket #07038856

We've had issues with the scaleout offloads from performance tier to capacity tier (Azure) hanging at 99%. Every day the scaleout offload job hangs at 99%. So if there are 30 days in the month, there will be 30 jobs hanging at 99%, one per day. It started when we upgraded to veeam 12. We are on the latest version but haven't upgraded to 12.1 yet.

The last 2 lines of the job will read:

Object storage cleanup failed: Timed out waiting for the backup files to be released, cancelling the job
Waiting for required backup files to be released by another job: ScaleOutBackup2 Offload

We've tried changing the object storage gateway servers to different ones. We've also tried changing max uploads to Azure and maximum concurrent tasks on the object repository. Is this some sort of bug? It appears that the Veeam software is trying to cancel the job but it's not cancelling. The top right corner of the offload job shows Job progress as all objects have been processed. We have to restart the Veeam services to clear these hung jobs. I'm not sure if this is something that we can safely ignore. It clutters the console with these hung jobs.
Mildur
Product Manager
Posts: 8735
Liked: 2294 times
Joined: May 13, 2017 4:51 pm
Full Name: Fabian K.
Location: Switzerland
Contact:

Re: Offload jobs to capacity tier hang at 99%

Post by Mildur »

Hi Collin

Thank you for the case number. I checked the case and our internal bug tracking system.
It doesn't seem to be a known bug. I have asked if we can escalate the case to the next tier.

Best,
Fabian
Product Management Analyst @ Veeam Software
collinp
Expert
Posts: 230
Liked: 13 times
Joined: Feb 14, 2012 8:56 pm
Full Name: Collin P
Contact:

Re: Offload jobs to capacity tier hang at 99%

Post by collinp »

Can anyone confirm that Veeam is designed to allow 2 offload jobs to run at the same time for the same repository? So if the offload job doesn't finish within 4 hours, is the software designed to allow the next overlapping offload job to run at the same time? In every single case where they overlap, I get these error messages:

Object storage cleanup failed: Timed out waiting for the backup files to be released, cancelling the job
Waiting for required backup files to be released by another job: ScaleOutBackup2 Offload
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Offload jobs to capacity tier hang at 99%

Post by Gostev »

Confirming, however do remember that as always, there should be task slots available on the repository in question. If there are no more task slots available, additional jobs will not be able to start offloading in principle.
collinp
Expert
Posts: 230
Liked: 13 times
Joined: Feb 14, 2012 8:56 pm
Full Name: Collin P
Contact:

Re: Offload jobs to capacity tier hang at 99%

Post by collinp »

The jobs hang at 99% so they are essentially finished and there are no maximum concurrent task limits set.
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Offload jobs to capacity tier hang at 99%

Post by Gostev »

Not necessarily finished. This can be confirmed with debug logs but what you perceive as "hang" is likely the checkpoint processing stage when the retention policy is being applied, and this is quite "heavy" operation which can take a long time.
collinp
Expert
Posts: 230
Liked: 13 times
Joined: Feb 14, 2012 8:56 pm
Full Name: Collin P
Contact:

Re: Offload jobs to capacity tier hang at 99%

Post by collinp »

I haven't seen a single one complete. Prior to restarting the Veeam services, some had been running for 30 days. The activity graph showed no activity. I will upload the logs to the case today.
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Offload jobs to capacity tier hang at 99%

Post by Gostev »

Sounds good. By the way, if they confirm the job is indeed in the checkpoint processing stage, then upgrading to 12.1 should help a lot as there are tons of optimizations in this particular department.
collinp
Expert
Posts: 230
Liked: 13 times
Joined: Feb 14, 2012 8:56 pm
Full Name: Collin P
Contact:

Re: Offload jobs to capacity tier hang at 99%

Post by collinp »

I just wanted to clarify that the scaleout offloads are designed this way. This has been my experience:

1) Two offload jobs aren't designed to run at the same time - if the first one doesn't complete in 4 hours, the 2nd one will run and overlap with the first causing the error "Waiting for required backup files to be released by another job"
2) In order for the Scaleout Repository rescan to complete successfully, it requires us to manually stop and disable all jobs and offloads targeting the Scaleout Repository first - kb4303. To prevent the "performance tier is not synchronized with the capacity tier" error. So far we have been doing this and the scaleout rescans have progressed for 80 hours but we don't know if it will take days or weeks to finish which puts us at risk with disabled backups.
3) When offloads run they sequentially go through tens of thousands of objects (in our case) with the message "....vib has already been offloaded, skipping". If the vib has already been offloaded, why isn't the record in the database and indexed somewhere. Why does the software have to go out and check again something that has already happened? This is taking hours just to check for something that has already been offloaded successfully.


My question is, are their optimizations for Scaleout offloads and rescans in the future where the software will handle all of this automatically?
Gostev
Chief Product Officer
Posts: 31561
Liked: 6725 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Offload jobs to capacity tier hang at 99%

Post by Gostev »

I don't believe there's a notion of "per-job offload" to start with. Rather, each 4 hours (by default) all backup chains are analyzed for offload candidates and all newly determined ones are added into a single pipeline of the offload process. So nothing really happens every 4 hours except the offload queue is getting extended with additional backups. At least this is how it worked since inception of this functionality and I don't believe this has ever changed. But @veremin would know for sure.
veremin
Product Manager
Posts: 20284
Liked: 2258 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: Offload jobs to capacity tier hang at 99%

Post by veremin »

Your understanding is correct, the regular offload sessions should not interfere with each other and should add restore points to the processing queue instead.

I've briefly checked the case, and I'd recommend escalating the ticket to a higher tier for further investigation - the experienced behavior does not seem expected.

Thanks!
Post Reply

Who is online

Users browsing this forum: No registered users and 5 guests