Stalling Jobs

Feb 26, 2025 3:21 pm

Hi All,

We are currently experiencing issues with Veeam Backup For Microsoft 365. We are running the latest version of 8.1, the main controller server also runs Postgresql and NATS-SERVER. We have two Proxy Pools consisting of 4 Linux Proxies each. We use one Proxy Pool for onboarding new customers and the other for BAU jobs. We split each tenant into 3 jobs, Exchange, OneDrive and SharePoint/Teams. Each customer also gets 3 Wasabi Buckets (Exchange, OneDrive and SharePoint/Teams) which are used as an individual repo per job (I.E Customer A Exchange Backup targets Customer A Exchange Repo etc). We are seeing a mixture of slow or stalled jobs. The only "fix" we currently have is to stop Veeam Services, rename the streams folder in NATS and restart the controller server. This seems to give VBO a kick to start jobs again, but having to do this all the time is not great. I have enabled monitoring in the NATS config file to allow us with additional monitoring to try and pin point what is causing these issues. But so far, we are unsure what is causing the problems. I have noticed that we are getting API errors, approx 5% of API requests are errors, I am very new to NATS so I am not sure what is causing this. We have a load of available "out the box" monitoring but we also have the ability to configure custom monitors, which I have been doing. Has anyone got any advice on what things we should be specifically monitoring, especially with regards to NATS-SERVER and Postgresql that I can look at adding in. Or has anyone else experienced similar issues? If so, what have you done to resolve this?

**Edit** Monitoring of the controller server seems fine, no spikes in CPU/Memory. Also the resources on our Proxies are good too, averaging around 50% max usage on all proxies.

We have a support request open # 07607359 but we are not getting anywhere very fast at the moment.

Thanks in advance!
Keiron

Feb 27, 2025 7:14 am

have been having exact same issues for about last 1-2 weeks, that's also the only fix I'm aware of, have had to do it a few times and I don't like it. also seeing low load, low proxy utilisation, slow backup speeds and stalling jobs we can't stop or interact with in any way.
I have been working with support on several different performance related cases so will update if I hear anything. things really going from bad to worse for me the last few weeks with Veeam 365 and getting quite frustrated tbh

edit: to provide further context, we are setup in a very similar way to you, 8 proxy's using proxy pooling, each tenant has 3 jobs, to 3 separate wasabi repos

Post by **keironbell** » Feb 27, 2025 9:57 am this post

Its good to know its not only us in this situation, also makes me feel better about our setup hearing yours is very similar haha!

Hopefully we get an update soon.

Post by **pat_ren** » Feb 27, 2025 3:27 pm this post

So far I had success by limiting the number of jobs running, I'm not sure how many jobs you have but i found once it got over about 200 running at once the issue happened much more regularly, we have over 700 jobs on this particular server.
I created a script to run the jobs from oldest to newest most recent backup, 3 times per day but check how many jobs are running first and only queue up new jobs when there is 100 or less running and this does seem to have helped a bit as a workaround.

Post by **pat_ren** » Feb 28, 2025 3:20 am this post

To update my previous post, that hasn't really helped and although I have jobs very slowly progressing, like a few per hour, speed is extremely slow and I have a bunch of jobs which seem unresponsive and just stuck in a running state without any ability to interact with them. Had a good approx 12 hours before the issue returned after resetting everything, pretty frustrating. Will collect some logs and open another new support case for this.

Adding my case to hopefully get a fix soon
Case #07619220

Feb 28, 2025 7:39 am

Yeah we are still having issues too. I did notice that a new update is available yesterday: https://www.veeam.com/kb4711

We are going to apply this update and hopefully see some improvement. The last update I got from support on our issue was that it it could potentially be NATS causing the issue "but since v8 has so many moving parts, finding what causes service failures has become more complex". So, not sure where we go from here...

Will run the latest patch and monitor jobs then update my case.

Feb 28, 2025 8:07 am

Support will likely provide the advice below to implement for the stalling jobs. 9 times out of 10 this is also the solution, but it all remains quite fragile

1. Stop Veeam services on controller and remote proxies
2. Stop nats-server service
3. Delete C:\ProgramData\Veeam\Backup365\nats\jetstream
4. Start nats-server service
5. Start Veeam services on controller and remote proxies

Post by **pat_ren** » Feb 28, 2025 11:23 am this post

That's unfortunately a well known fix now.

So far I've had some success with my job manager script, limiting the max number of jobs to 50 and letting the script schedule everything, my backlog of jobs is slowly catching up.
Not keen to install the latest update yet as I'm running quite a few private fixes and not sure how it will affect those but if you guys try it and have some good results please let me know. Thanks

aeph · Post by **aeph** » Mar 03, 2025 7:16 am this post

Since the update to 8.1.0.305 I have the problem that my largest job with approx. 13k objects is no longer completed. After about 5600-6600 objects the job hangs and does not continue.
Even deleting the Jetstream folder and restarting the job leads to the same problem again. I can always start the job but it hangs again and again.

I have now updated to 8.1.0.3503 but the same problem occurs. Does anyone here have a solution for how I can get the job to run again or how I can keep it running when it hangs? I always lose too much time by deleting the Jetstream folder and restarting. Is there a way to give this heart beat when stalling/ hanging?

It's really annoying!

UPDATE: The restart of all proxies seems to have given this job a kick. I hope that it now runs through and doesn't hang again.

Post by **Proact** » Mar 03, 2025 12:04 pm this post

Having the same issue for some months now which has been reported via Veeam case: #07616517

Our workaround is currently:
1. Restart the proxy server which have stalled jobs i.e. no disk activity from the Veeam proxy service
2. Stop the jobs which are stalled - and wait until the jobs are stopped
3. Start the jobs again

aeph · Post by **aeph** » Mar 03, 2025 3:00 pm this post

After about 400 objects my job is stalling again..

I can't cancel the job and start again because 1) I would lose too much time and 2) in this case it wouldn't help. I will open a high prio ticket.

Post by **pat_ren** » Mar 04, 2025 4:15 am this post

aeph wrote: ↑Mar 03, 2025 7:16 am Since the update to 8.1.0.305 I have the problem that my largest job with approx. 13k objects is no longer completed. After about 5600-6600 objects the job hangs and does not continue.
Even deleting the Jetstream folder and restarting the job leads to the same problem again. I can always start the job but it hangs again and again.

I have now updated to 8.1.0.3503 but the same problem occurs. Does anyone here have a solution for how I can get the job to run again or how I can keep it running when it hangs? I always lose too much time by deleting the Jetstream folder and restarting. Is there a way to give this heart beat when stalling/ hanging?

It's really annoying!

UPDATE: The restart of all proxies seems to have given this job a kick. I hope that it now runs through and doesn't hang again.

what's the status of the job when it's stalling? is it working on actual objects or stuck on 'resolving objects: X' and not working on any specific data?
i've received a few private fixes for similar issues but not so much one specific job, more like many jobs struggling to complete
have also made significant changes to each proxy config to improve performance
these may not be applicable to others to recommended to just collect logs and work with support and escalate as needed to get some assistance

aeph · Post by **aeph** » Mar 04, 2025 5:57 am this post

The result is: the job isn't doing any progress since hours (even the processing rate, read and write rate stays the same). And stopping the job by pressing the stop button results in a "Queued" state that will not end until i delete the jetstream folder. The job is now running since 40 hours but nearly finishing (i had to restart all the proxies 2 times because the job was hanging 2 times so far). The restart is giving the job a kick and the job is resuming after about 5-10 minutes after the restart.

One more thing: The OneDrive job from the same orga is hanging in the "resolving objects state" (10+ hours) while this 13k Objects SharePoint job is running.

All the other jobs are very fast since 8.1.0.3503. Yesterday a Exchange-Only job with 2500 Objects took only 13 minutes.

Provided the support team a lot of logs.

My case id: #07621637

Post by **Polina** » Mar 04, 2025 10:48 am this post

Hi All,

First, it's highly recommended to upgrade to the latest patch 8.1.0.3503, which includes several improvements to the processing logic.
Next, please continue working with Support - your cooperation is important for RnD to faster troubleshoot and resolve the problem.

Thanks!

Post by **t7MevELx0** » Mar 04, 2025 7:28 pm this post

I'm seeing these issues as well. It worked great on 8.05. As soon as I upgraded that environment to 8.1, that's when all of my copy jobs stopped working. They haven't completed since. I have a lot of jobs that stall and I can't cancel or stop them; I have to stop the services and clear the nats jetstream dir out. I have backup repo's that have been stuck "indexing" since the first upgrade to 8.1.

I'm on 8.1.0.3503 and that hasn't fixed anything.

Right now support wants me to remove the copy jobs, copy repo, and re add them to see if that does anything.

I'm wondering if its a postgresql/nats performance issue. I'm considering moving them off to dedicated VMs. I just find it odd that so many people have these issues.

Post by **Polina** » Mar 05, 2025 12:21 pm this post

If your copy jobs become unresponsive before upgrading 8.1.0.3503, you still need to clean up NATS streams. After that, the issue should be resolved and the root cause fixed.

aeph · Mar 07, 2025 8:29 am

Had a video call with Veeam Support today.

It seems that the root cause is identified and Veeam is already working on a hotfix (should be available at the beginning of next week).
Issue happens mostly for SharePoint/Teams Jobs with more than 5k Objects and immutability active.

Workaround could be to split SharePoint job into multiple jobs (teams, personal sites, ..).

As soon as I receive the hotfix, I will try it out.

Post by **pat_ren** » Mar 07, 2025 10:27 am this post

We aren't using immutability for any of our jobs. We do have sharepoint/teams jobs over 5k objects, but not too many.
The issue just seems random from my experience, works fine for 3 days in a row, happens twice in same day after that.

After too much frustration, I have built a restart function into my job manager so I can detect the issue and fix it automatically and then resume jobs without any manual intervention, it's just a workaround but that's how i've been dealing with it for now.

I'm also working with support on several issues, this being one of them, but no root cause has been determined yet. I do have logs from my job manager script which pinpoints the exact times when the problem begins so I should hopefully be able to get to the bottom of it soon.

Post by **t7MevELx0** » Mar 12, 2025 1:48 pm this post

This is still a major issue in our environment. I haven't seen a copy job process correctly since upgrading to 8.1. We are on version 8.1.0.3503. I've escalated the case to priority 2.

These jobs were running fine on version 8.05.

aeph · Post by **aeph** » Mar 12, 2025 3:29 pm this post

@t7MevELx0 same for me. I am still waiting for the private hotfix mentioned in my previous post.
Copy jobs of smaller organizations are working but copy jobs of my larger orgas are failing.

Post by **t7MevELx0** » Mar 12, 2025 8:45 pm this post

Yeah, that's the same issue I'm having. I'm not getting any response about a hotfix from support though. My jobs have been broken for ages.

Post by **Polina** » Mar 13, 2025 8:37 am this post

@t7MevELx0

Please share your support case ID, I don't see it in the thread.

Thanks!

Mar 13, 2025 12:26 pm

@Polina Case #07583409

I've had this open since Jan 27th. I've been trying to escalate it.

aeph · Post by **aeph** » Mar 15, 2025 6:38 am this post

I got a hotfix, installed it and guess what: it still doesn't work. Job has now been hanging for 12 hours at exactly the same processing rate, objects processed and number of objects.

it is frustrating

Post by **admcomputing** » Mar 19, 2025 7:59 am this post

We too are having similar problems but just for the backup copy jobs that appear to get stuck either on 1 or 0 items remaining and the jobs sit there for a day before we must manually restart them. I wonder if its due to all the copy jobs (<100) starting at the same time? We run 1 backup job per customer and 1 backup copy to immutable storage per customer. Would splitting them out help?

Post by **Polina** » Mar 19, 2025 4:30 pm this post

@admcomputing
Simultaneous start of a large number of jobs can provide an increased load on the database causing the issue. Do you have any possibility to distribute them in time?

Post by **admcomputing** » Mar 19, 2025 4:39 pm this post

Thanks Polina.
I've just spent the last hour manually adjusting each job individually so I will see how this goes.
I thought NATs was supposed to take care of this

Mar 19, 2025 4:51 pm

I feel your pain - I wished everything always worked the way it is supposed to

Jokes aside, RnD is investing a lot of effort these days to resolve such issues as stuck jobs and many others. Some fixes can already be delivered via support today, others are more complex and will be included in the next product updates.

Mar 26, 2025 8:22 am

I'm still getting this issue every few days. I have a custom script I use to manage our job start times, to limit how many jobs are running and only start X amount of jobs and it has helped a lot with the issue and consistency of backups as a whole.
I also built in a function to detect this issue when jobs are stalled and automatically fix it so it has saved me a lot of time from walking in after a weekend and finding all jobs stuck and a massive backlog of jobs to run.
I believe my case is with R&D too.

Mar 27, 2025 4:31 pm

I'm having the same issue. I have a ticket open for other issues (jobs hanging on public folder mailboxes, job history defaulting to showing jobs from September rather than current/recent jobs, had missing restore points but that part appears to be resolved possibly or at least jobs are generally running correctly). When this issue occurred, with our server, I was given the "stop services, rename/delete the NATS stream folder and start services" instructions and it ran for 3 weeks before happening again.

Single server, no additional proxy's, backing up direct to object storage (Wasabi), no immutability on these repositories though I do have one repo that is using immutability, but is unaffected so far). Running 8.1.0.305 Case #07538517

R&D Forums

Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Re: Stalling Jobs

Who is online