Offloading Jobs stuck at 99% for Days

jshapiro · Mar 23, 2023 9:30 pm

Hello:

I have a SOBR using Wasabi for the capacity tier. Recently, I swapped in a new storage bucket with object lock to set up immutability for my backups. Initial offloading took about a week to get my most recent backup chains into the capacity tier. What I've noticed, now that I'm mostly caught up, is that general offloading jobs still kick off and seem to progress nicely till they hit 99%. They seem essentially done, but they hang for days at 99% without transferring any more data, and I don't know what the system is doing. The jobs do close out eventually, but I'm left with more general offloading jobs kicking off and hanging for a similar amount of time. I'm on v12. What is going on? What logs should I check? Should I cancel these long running jobs?

Post by **Mildur** » Mar 23, 2023 9:56 pm this post

Hi Jonathan

I strongly recommend to open a case with our customer support. We cannot solve such issues over this forum.
Without a case number this topic may be deleted by a moderator.

Best,
Fabian

jshapiro · Post by **jshapiro** » Mar 24, 2023 12:46 pm this post

Thanks. I opened a case. Case #05968616

jshapiro · Post by **jshapiro** » Mar 29, 2023 4:42 pm this post

5 days later, and all I heard was that my ticket would be sent to the object storage group. Nobody from that group contacted me.

Post by **EWMarco** » Mar 30, 2023 6:07 am this post

All I can tell you right now is that we see similar things.

Post by **EWMarco** » Mar 30, 2023 6:59 am this post

I think it's a grave oversight that some things are not displayed in the job transcripts... like checkpoint cleanups. You only get to see a message, when it fails. And since some of them take a day or more, they usually get interrupted by the next backup cycle.

I have no idea what that does to data integrity but I'm assuming we keep loosing literal days on our offloading on things like this.

Post by **Mildur** » Mar 30, 2023 7:10 am this post

Hi @jshapiro

I'm very sorry that you had to wait for two days.
Please let me know if it happens again.

You can also use the Escalate to Support Management option:
https://www.veeam.com/kb2320

Hi @EWMarco
Indeed. Session details don't display everything. There are a lot of background tasks which are only visible in our debug logs.
If you see the same issue, please open a support case and provide me with the case number. Thank you.

Best,
Fabian

jshapiro · Post by **jshapiro** » Mar 30, 2023 12:45 pm this post

I finally heard back from support, and their log analysis possibly revealed something. They noticed that when I upgraded from version 11 to 12, I was set in Direct Mode for connection type to the capacity tier. Support told me to change that to Connect through a gateway server, and then select my preferred gateway(s). In my case, I selected my storage server so it could offload directly to Wasabi over the Internet. I made this change late yesterday, and at the time I did it, I had two offload jobs stuck at 99%. Both wrapped up into the night, and another one started at 2:00 AM this morning and finished very quickly. Maybe this was the issue. Right now, my Veeam server is idle. I will continue to watch it.

Post by **Mildur** » Mar 30, 2023 12:53 pm this post

Hi Jonathan

Thanks for the update.

Best,
Fabian

jshapiro · Post by **jshapiro** » Apr 03, 2023 4:59 pm this post

Seemed to be better for a couple of days, but now I have some offload jobs stuck at 99% again. I updated the ticket notes.

slide999 · Post by **slide999** » Apr 11, 2023 12:01 pm this post

same exact issue..opening ticket. Case # 06003117.

Post by **veremin** » Apr 11, 2023 12:05 pm this post

Kindly post its number here, once it's opened. This way we can follow and assist the investigation. Thanks!

nathanrsafti · Post by **nathanrsafti** » Apr 11, 2023 8:31 pm this post

Been seeing the same thing here ever since upgrading to V12 and switching S3 repo to Gateway per case 05878826. Offloads stuck at 99%, usually takes 24hrs+ to remove a single checkpoint from S3. Everything eventually goes through but very very slow. Seems to be a toss-up whether or not the offloads will complete quickly or stall every night.

TonioRoffo · Post by **TonioRoffo** » Apr 19, 2023 8:58 am this post

I've got the same going on with V11.

A small server being offloaded to WASABI, something takes a long time for very small incrementals:

The delta's are truly only in the megabyte range (500mb to a few gigs at most) - usually the offload completes within minutes, sometimes it takes up to half an hour and one even took 10 hours.

in the logs I have repetitions of this:

[19.04.2023 10:54:51.040] < 17344> srv | Waiting for the next server command.
[19.04.2023 10:54:51.040] < 17344> srv | _______________________________________________________________________________
[19.04.2023 10:54:54.739] < 17344> srv | retrieved command: 154 (HandleRemoteArchClient(154))
[19.04.2023 10:54:54.739] < 17344> arh | Cleaning up storage blocks in archive
[19.04.2023 10:54:54.739] < 17344> arh | Using local client for archive repository '6996b00e-2a4b-43cd-8288-bfcb79831333'.
[19.04.2023 10:54:55.037] < 17344> srv | Command successfully processed, elapsed: 0.3020
[19.04.2023 10:54:55.037] < 17344> srv |
[19.04.2023 10:54:55.037] < 17344> srv | Waiting for the next server command.
[19.04.2023 10:54:55.037] < 17344> srv | _______________________________________________________________________________
[19.04.2023 10:54:57.184] < 17344> srv | retrieved command: 154 (HandleRemoteArchClient(154))
[19.04.2023 10:54:57.184] < 17344> arh | Cleaning up storage blocks in archive
[19.04.2023 10:54:57.184] < 17344> arh | Using local client for archive repository '6996b00e-2a4b-43cd-8288-bfcb79831333'.
[19.04.2023 10:54:57.353] < 17344> srv | Command successfully processed, elapsed: 0.1650
[19.04.2023 10:54:57.353] < 17344> srv |
[19.04.2023 10:54:57.353] < 17344> srv | Waiting for the next server command.
[19.04.2023 10:54:57.353] < 17344> srv | _______________________________________________________________________________
[19.04.2023 10:55:10.296] < 17344> srv | retrieved command: 154 (HandleRemoteArchClient(154))
[19.04.2023 10:55:10.296] < 17344> arh | Cleaning up storage blocks in archive
[19.04.2023 10:55:10.296] < 17344> arh | Using local client for archive repository '6996b00e-2a4b-43cd-8288-bfcb79831333'.
[19.04.2023 10:56:00.946] < 17344> srv | Command successfully processed, elapsed: 50.6590
[19.04.2023 10:56:00.946] < 17344> srv |
[19.04.2023 10:56:00.946] < 17344> srv | Waiting for the next server command.
[19.04.2023 10:56:00.946] < 17344> srv | _______________________________________________________________________________
[19.04.2023 10:56:01.942] < 17344> srv | retrieved command: 154 (HandleRemoteArchClient(154))
[19.04.2023 10:56:01.942] < 17344> arh | Cleaning up storage blocks in archive
[19.04.2023 10:56:01.942] < 17344> arh | Using local client for archive repository '6996b00e-2a4b-43cd-8288-bfcb79831333'.

Edit: about to open a case

TonioRoffo · Post by **TonioRoffo** » Apr 19, 2023 9:29 am this post

Opened a case #06018563

TheJourney · Post by **TheJourney** » Apr 19, 2023 2:52 pm this post

Same issue, using Wasabi as well. Guess ill open a Ticket...

Apr 21, 2023 3:38 pm

Same issue, but using Cloudian storage. I'll be opening a ticket as soon as I'm allowed to do so by Veeam, but in the meantime, watching this thread like a hawk...

jshapiro · May 02, 2023 2:01 pm

The Veeam engineer gave me some regedits to apply to the Veeam server to optimize for Wasabi. I applied these, and offload jobs seemed to run better for a few weeks. They are once again getting stuck at 99% for days and getting stacked up. I just opened another ticket. Here's the regedits I had applied:

New-ItemProperty -Path 'HKLM:\SOFTWARE\Veeam\Veeam Backup and Replication\' -Name 'S3ConcurrentTaskLimit' -Value "10" -PropertyType DWORD -Force

New-ItemProperty -Path 'HKLM:\SOFTWARE\Veeam\Veeam Backup and Replication\' -Name 'S3RequestTimeoutSec' -Value "900" -PropertyType DWORD -Force

New-ItemProperty -Path 'HKLM:\SOFTWARE\Veeam\Veeam Backup and Replication\' -Name 'S3RequestRetryTotalTimeoutSec' -Value "9000" -PropertyType DWORD -Force

Post by **DeanCTS** » May 03, 2023 1:16 pm this post

jshapiro wrote: ↑Apr 03, 2023 4:59 pm Seemed to be better for a couple of days, but now I have some offload jobs stuck at 99% again. I updated the ticket notes.

How is it looking like now that you've applied regedit optimizations for the backup service to Wasabi? Any major improvements?

thebdur · Post by **thebdur** » May 12, 2023 1:49 pm this post

I'm having this same issue. Applied the same registry keys and the offloading does go through eventually, but they take an abnormally long time. The upload of the data will take 5-15 minutes, then it will set at 99% for all VMs for an hour or more.

RonanD · Post by **RonanD** » May 17, 2023 11:38 am this post

I think your job stay block on "cleaning"

i have had the same problem, solved on this case Case #04958240

[HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication]
"StgIndexCleanupTaskSize"=hex(b):00,28,00,00,00,00,00,00
"StgIndexEnableCache"=dword:00000001
"StgIndexUploadTaskSize"=hex(b):00,28,00,00,00,00,00,00

StgIndexCleanupTaskSize : 2800
StgIndexEnableCache : 1
StgIndexUploadTaskSize : 2800

You need more RAM and can double the value for increase the process

jshapiro · Post by **jshapiro** » May 22, 2023 12:41 pm this post

I stripped out 1 of the 3 registry tweaks the original Veeam engineer provided for Wasabi S3 storage. The one I stripped limited S3 Concurrent Task Limit to 10. This caused offloading to run way too slowly. Not enough offload threads running. I opened another support ticket with Veeam to complain that offload jobs were back to getting stuck at 99%. The job would get stuck with specific VM's within the backup job getting stuck deleting checkpoints. At this stage, it could take days for some of them to process, and there wasn't a lot of visual feedback anything is happening. Anyway, the solution was to update Veeam to 12.0.0.1420_20230412. This update includes a number of fixes including jobs taking a long time to delete checkpoints. With the S3 Concurrent Task Limit tweak removed, and with the update, things have been good.

orty229 · Post by **orty229** » May 24, 2023 1:49 pm this post

Any updates for the issue with us with 11a?

Post by **veremin** » May 24, 2023 3:35 pm this post

The R&D team believes that the original issue was caused by unoptimized enumeration logic existing in v12 prior to the latest build (12.0.0.1420 P20230412).

Some of the object storage repository operations (offload, rescan, checkpoint deletion, etc.) relied on that mechanism and experienced performance degradation as a result.

The build 12.0.0.1420 P20230412 enhanced the procedure dramatically and got rid of extensive requests in a few places.

We recommend updating to the latest product version and seeing whether it solves the problem.

Thanks!

Post by **veremin** » May 24, 2023 3:41 pm this post

orty229 wrote: ↑May 24, 2023 1:49 pm Any updates for the issue with us with 11a?

The issue reported in this thread is caused by part of the code that has not existed in pre-v12 product versions. So even if the symptoms are similar, the causes must be completely different.

So I suggest you create your ticket (and forum thread as well) and provide the debug logs for further investigation to a support engineer.

Thanks!

Post by **DeanCTS** » May 25, 2023 5:05 am this post

RonanD wrote: ↑May 17, 2023 11:38 am I think your job stay block on "cleaning"

i have had the same problem, solved on this case Case #04958240

[HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication]
"StgIndexCleanupTaskSize"=hex(b):00,28,00,00,00,00,00,00
"StgIndexEnableCache"=dword:00000001
"StgIndexUploadTaskSize"=hex(b):00,28,00,00,00,00,00,00

StgIndexCleanupTaskSize : 2800
StgIndexEnableCache : 1
StgIndexUploadTaskSize : 2800

You need more RAM and can double the value for increase the process

Could you let us know on what those numbers are based off? Specific amount of RAM or? What should I config in case of 8 virtual cores and 16GB of RAM?

orty229 · Post by **orty229** » May 25, 2023 12:53 pm this post

veremin wrote: ↑May 24, 2023 3:35 pm The R&D team believes that both the original issue was caused by unoptimized enumeration logic existing prior to 12.0.0.1420 P20230412.

Some of the object storage repository operations (offload, rescan, checkpoint deletion, etc.) relied on that mechanism and experienced performance degradation as a result.

The build 12.0.0.1420 P20230412 enhanced the procedure dramatically and got rid of extensive requests in a few places.

We recommend updating to the latest product version and seeing whether it solves the problem.

Thanks!

So 11a not supported now?

Post by **veremin** » May 25, 2023 4:20 pm this post

Kindly, read my latest response. I feel I meant the opposite:

The issue reported in this thread is caused by part of the code that has not existed in pre-v12 product versions. So even if the symptoms are similar, the causes must be completely different.

So I suggest you create your own ticket (and forum thread as well) and provide the debug logs for further investigation to a support engineer.

Thanks for understanding.

orty229 · Post by **orty229** » May 26, 2023 1:22 pm this post

So will "unoptimized enumeration logic existing prior to 12.0.0.1420 P20230412" be fixed in 11a that is supported? Or is it only a v12, kinda think some v11 have reported the same thing with Wasabi been pushing.

Post by **Gostev** » May 26, 2023 1:39 pm this post

As Vladimir explains, the issue discussed in this thread was first introduced in V12 and it is fixed in V12 P20230412. If you have a similar issue with V11a, please open a support case for investigation, as this would be something totally unrelated to OP's issue.

R&D Forums

Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Re: Offloading Jobs stuck at 99% for Days

Who is online