-
- Expert
- Posts: 239
- Liked: 13 times
- Joined: Feb 14, 2012 8:56 pm
- Full Name: Collin P
- Contact:
Offload jobs to capacity tier hang at 99%
Ticket #07038856
We've had issues with the scaleout offloads from performance tier to capacity tier (Azure) hanging at 99%. Every day the scaleout offload job hangs at 99%. So if there are 30 days in the month, there will be 30 jobs hanging at 99%, one per day. It started when we upgraded to veeam 12. We are on the latest version but haven't upgraded to 12.1 yet.
The last 2 lines of the job will read:
Object storage cleanup failed: Timed out waiting for the backup files to be released, cancelling the job
Waiting for required backup files to be released by another job: ScaleOutBackup2 Offload
We've tried changing the object storage gateway servers to different ones. We've also tried changing max uploads to Azure and maximum concurrent tasks on the object repository. Is this some sort of bug? It appears that the Veeam software is trying to cancel the job but it's not cancelling. The top right corner of the offload job shows Job progress as all objects have been processed. We have to restart the Veeam services to clear these hung jobs. I'm not sure if this is something that we can safely ignore. It clutters the console with these hung jobs.
We've had issues with the scaleout offloads from performance tier to capacity tier (Azure) hanging at 99%. Every day the scaleout offload job hangs at 99%. So if there are 30 days in the month, there will be 30 jobs hanging at 99%, one per day. It started when we upgraded to veeam 12. We are on the latest version but haven't upgraded to 12.1 yet.
The last 2 lines of the job will read:
Object storage cleanup failed: Timed out waiting for the backup files to be released, cancelling the job
Waiting for required backup files to be released by another job: ScaleOutBackup2 Offload
We've tried changing the object storage gateway servers to different ones. We've also tried changing max uploads to Azure and maximum concurrent tasks on the object repository. Is this some sort of bug? It appears that the Veeam software is trying to cancel the job but it's not cancelling. The top right corner of the offload job shows Job progress as all objects have been processed. We have to restart the Veeam services to clear these hung jobs. I'm not sure if this is something that we can safely ignore. It clutters the console with these hung jobs.
-
- Product Manager
- Posts: 9393
- Liked: 2502 times
- Joined: May 13, 2017 4:51 pm
- Full Name: Fabian K.
- Location: Switzerland
- Contact:
Re: Offload jobs to capacity tier hang at 99%
Hi Collin
Thank you for the case number. I checked the case and our internal bug tracking system.
It doesn't seem to be a known bug. I have asked if we can escalate the case to the next tier.
Best,
Fabian
Thank you for the case number. I checked the case and our internal bug tracking system.
It doesn't seem to be a known bug. I have asked if we can escalate the case to the next tier.
Best,
Fabian
Product Management Analyst @ Veeam Software
-
- Expert
- Posts: 239
- Liked: 13 times
- Joined: Feb 14, 2012 8:56 pm
- Full Name: Collin P
- Contact:
Re: Offload jobs to capacity tier hang at 99%
Can anyone confirm that Veeam is designed to allow 2 offload jobs to run at the same time for the same repository? So if the offload job doesn't finish within 4 hours, is the software designed to allow the next overlapping offload job to run at the same time? In every single case where they overlap, I get these error messages:
Object storage cleanup failed: Timed out waiting for the backup files to be released, cancelling the job
Waiting for required backup files to be released by another job: ScaleOutBackup2 Offload
Object storage cleanup failed: Timed out waiting for the backup files to be released, cancelling the job
Waiting for required backup files to be released by another job: ScaleOutBackup2 Offload
-
- Chief Product Officer
- Posts: 31525
- Liked: 7047 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Offload jobs to capacity tier hang at 99%
Confirming, however do remember that as always, there should be task slots available on the repository in question. If there are no more task slots available, additional jobs will not be able to start offloading in principle.
-
- Expert
- Posts: 239
- Liked: 13 times
- Joined: Feb 14, 2012 8:56 pm
- Full Name: Collin P
- Contact:
Re: Offload jobs to capacity tier hang at 99%
The jobs hang at 99% so they are essentially finished and there are no maximum concurrent task limits set.
-
- Chief Product Officer
- Posts: 31525
- Liked: 7047 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Offload jobs to capacity tier hang at 99%
Not necessarily finished. This can be confirmed with debug logs but what you perceive as "hang" is likely the checkpoint processing stage when the retention policy is being applied, and this is quite "heavy" operation which can take a long time.
-
- Expert
- Posts: 239
- Liked: 13 times
- Joined: Feb 14, 2012 8:56 pm
- Full Name: Collin P
- Contact:
Re: Offload jobs to capacity tier hang at 99%
I haven't seen a single one complete. Prior to restarting the Veeam services, some had been running for 30 days. The activity graph showed no activity. I will upload the logs to the case today.
-
- Chief Product Officer
- Posts: 31525
- Liked: 7047 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Offload jobs to capacity tier hang at 99%
Sounds good. By the way, if they confirm the job is indeed in the checkpoint processing stage, then upgrading to 12.1 should help a lot as there are tons of optimizations in this particular department.
-
- Expert
- Posts: 239
- Liked: 13 times
- Joined: Feb 14, 2012 8:56 pm
- Full Name: Collin P
- Contact:
Re: Offload jobs to capacity tier hang at 99%
I just wanted to clarify that the scaleout offloads are designed this way. This has been my experience:
1) Two offload jobs aren't designed to run at the same time - if the first one doesn't complete in 4 hours, the 2nd one will run and overlap with the first causing the error "Waiting for required backup files to be released by another job"
2) In order for the Scaleout Repository rescan to complete successfully, it requires us to manually stop and disable all jobs and offloads targeting the Scaleout Repository first - kb4303. To prevent the "performance tier is not synchronized with the capacity tier" error. So far we have been doing this and the scaleout rescans have progressed for 80 hours but we don't know if it will take days or weeks to finish which puts us at risk with disabled backups.
3) When offloads run they sequentially go through tens of thousands of objects (in our case) with the message "....vib has already been offloaded, skipping". If the vib has already been offloaded, why isn't the record in the database and indexed somewhere. Why does the software have to go out and check again something that has already happened? This is taking hours just to check for something that has already been offloaded successfully.
My question is, are their optimizations for Scaleout offloads and rescans in the future where the software will handle all of this automatically?
1) Two offload jobs aren't designed to run at the same time - if the first one doesn't complete in 4 hours, the 2nd one will run and overlap with the first causing the error "Waiting for required backup files to be released by another job"
2) In order for the Scaleout Repository rescan to complete successfully, it requires us to manually stop and disable all jobs and offloads targeting the Scaleout Repository first - kb4303. To prevent the "performance tier is not synchronized with the capacity tier" error. So far we have been doing this and the scaleout rescans have progressed for 80 hours but we don't know if it will take days or weeks to finish which puts us at risk with disabled backups.
3) When offloads run they sequentially go through tens of thousands of objects (in our case) with the message "....vib has already been offloaded, skipping". If the vib has already been offloaded, why isn't the record in the database and indexed somewhere. Why does the software have to go out and check again something that has already happened? This is taking hours just to check for something that has already been offloaded successfully.
My question is, are their optimizations for Scaleout offloads and rescans in the future where the software will handle all of this automatically?
-
- Chief Product Officer
- Posts: 31525
- Liked: 7047 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Offload jobs to capacity tier hang at 99%
I don't believe there's a notion of "per-job offload" to start with. Rather, each 4 hours (by default) all backup chains are analyzed for offload candidates and all newly determined ones are added into a single pipeline of the offload process. So nothing really happens every 4 hours except the offload queue is getting extended with additional backups. At least this is how it worked since inception of this functionality and I don't believe this has ever changed. But @veremin would know for sure.
-
- Product Manager
- Posts: 20307
- Liked: 2270 times
- Joined: Oct 26, 2012 3:28 pm
- Full Name: Vladimir Eremin
- Contact:
Re: Offload jobs to capacity tier hang at 99%
Your understanding is correct, the regular offload sessions should not interfere with each other and should add restore points to the processing queue instead.
I've briefly checked the case, and I'd recommend escalating the ticket to a higher tier for further investigation - the experienced behavior does not seem expected.
Thanks!
I've briefly checked the case, and I'd recommend escalating the ticket to a higher tier for further investigation - the experienced behavior does not seem expected.
Thanks!
Who is online
Users browsing this forum: No registered users and 14 guests