Maintain control of your Microsoft 365 data
keironbell
Service Provider
Posts: 33
Liked: 16 times
Joined: Mar 17, 2021 5:21 pm
Full Name: Keiron Bell
Contact:

Stalling Jobs

Post by keironbell » 2 people like this post

Hi All,

We are currently experiencing issues with Veeam Backup For Microsoft 365. We are running the latest version of 8.1, the main controller server also runs Postgresql and NATS-SERVER. We have two Proxy Pools consisting of 4 Linux Proxies each. We use one Proxy Pool for onboarding new customers and the other for BAU jobs. We split each tenant into 3 jobs, Exchange, OneDrive and SharePoint/Teams. Each customer also gets 3 Wasabi Buckets (Exchange, OneDrive and SharePoint/Teams) which are used as an individual repo per job (I.E Customer A Exchange Backup targets Customer A Exchange Repo etc). We are seeing a mixture of slow or stalled jobs. The only "fix" we currently have is to stop Veeam Services, rename the streams folder in NATS and restart the controller server. This seems to give VBO a kick to start jobs again, but having to do this all the time is not great. I have enabled monitoring in the NATS config file to allow us with additional monitoring to try and pin point what is causing these issues. But so far, we are unsure what is causing the problems. I have noticed that we are getting API errors, approx 5% of API requests are errors, I am very new to NATS so I am not sure what is causing this. We have a load of available "out the box" monitoring but we also have the ability to configure custom monitors, which I have been doing. Has anyone got any advice on what things we should be specifically monitoring, especially with regards to NATS-SERVER and Postgresql that I can look at adding in. Or has anyone else experienced similar issues? If so, what have you done to resolve this?

**Edit** Monitoring of the controller server seems fine, no spikes in CPU/Memory. Also the resources on our Proxies are good too, averaging around 50% max usage on all proxies.

We have a support request open # 07607359 but we are not getting anywhere very fast at the moment.

Thanks in advance!
Keiron
pat_ren
Service Provider
Posts: 94
Liked: 16 times
Joined: Jan 02, 2024 9:13 am
Full Name: Pat
Contact:

Re: Stalling Jobs

Post by pat_ren » 1 person likes this post

have been having exact same issues for about last 1-2 weeks, that's also the only fix I'm aware of, have had to do it a few times and I don't like it. also seeing low load, low proxy utilisation, slow backup speeds and stalling jobs we can't stop or interact with in any way.
I have been working with support on several different performance related cases so will update if I hear anything. things really going from bad to worse for me the last few weeks with Veeam 365 and getting quite frustrated tbh

edit: to provide further context, we are setup in a very similar way to you, 8 proxy's using proxy pooling, each tenant has 3 jobs, to 3 separate wasabi repos
keironbell
Service Provider
Posts: 33
Liked: 16 times
Joined: Mar 17, 2021 5:21 pm
Full Name: Keiron Bell
Contact:

Re: Stalling Jobs

Post by keironbell »

Its good to know its not only us in this situation, also makes me feel better about our setup hearing yours is very similar haha!

Hopefully we get an update soon.
pat_ren
Service Provider
Posts: 94
Liked: 16 times
Joined: Jan 02, 2024 9:13 am
Full Name: Pat
Contact:

Re: Stalling Jobs

Post by pat_ren »

So far I had success by limiting the number of jobs running, I'm not sure how many jobs you have but i found once it got over about 200 running at once the issue happened much more regularly, we have over 700 jobs on this particular server.
I created a script to run the jobs from oldest to newest most recent backup, 3 times per day but check how many jobs are running first and only queue up new jobs when there is 100 or less running and this does seem to have helped a bit as a workaround.
pat_ren
Service Provider
Posts: 94
Liked: 16 times
Joined: Jan 02, 2024 9:13 am
Full Name: Pat
Contact:

Re: Stalling Jobs

Post by pat_ren »

To update my previous post, that hasn't really helped and although I have jobs very slowly progressing, like a few per hour, speed is extremely slow and I have a bunch of jobs which seem unresponsive and just stuck in a running state without any ability to interact with them. Had a good approx 12 hours before the issue returned after resetting everything, pretty frustrating. Will collect some logs and open another new support case for this.

Adding my case to hopefully get a fix soon
Case #07619220
keironbell
Service Provider
Posts: 33
Liked: 16 times
Joined: Mar 17, 2021 5:21 pm
Full Name: Keiron Bell
Contact:

Re: Stalling Jobs

Post by keironbell » 1 person likes this post

Yeah we are still having issues too. I did notice that a new update is available yesterday: https://www.veeam.com/kb4711

We are going to apply this update and hopefully see some improvement. The last update I got from support on our issue was that it it could potentially be NATS causing the issue "but since v8 has so many moving parts, finding what causes service failures has become more complex". So, not sure where we go from here...

Will run the latest patch and monitor jobs then update my case.
MaartenA
Service Provider
Posts: 106
Liked: 40 times
Joined: Oct 31, 2021 7:03 am
Full Name: maarten
Contact:

Re: Stalling Jobs

Post by MaartenA » 1 person likes this post

Support will likely provide the advice below to implement for the stalling jobs. 9 times out of 10 this is also the solution, but it all remains quite fragile

1. Stop Veeam services on controller and remote proxies
2. Stop nats-server service
3. Delete C:\ProgramData\Veeam\Backup365\nats\jetstream
4. Start nats-server service
5. Start Veeam services on controller and remote proxies
pat_ren
Service Provider
Posts: 94
Liked: 16 times
Joined: Jan 02, 2024 9:13 am
Full Name: Pat
Contact:

Re: Stalling Jobs

Post by pat_ren »

That's unfortunately a well known fix now.

So far I've had some success with my job manager script, limiting the max number of jobs to 50 and letting the script schedule everything, my backlog of jobs is slowly catching up.
Not keen to install the latest update yet as I'm running quite a few private fixes and not sure how it will affect those but if you guys try it and have some good results please let me know. Thanks
aeph
Enthusiast
Posts: 81
Liked: 9 times
Joined: Sep 26, 2024 11:02 am
Contact:

Re: Stalling Jobs

Post by aeph »

Since the update to 8.1.0.305 I have the problem that my largest job with approx. 13k objects is no longer completed. After about 5600-6600 objects the job hangs and does not continue.
Even deleting the Jetstream folder and restarting the job leads to the same problem again. I can always start the job but it hangs again and again.

I have now updated to 8.1.0.3503 but the same problem occurs. Does anyone here have a solution for how I can get the job to run again or how I can keep it running when it hangs? I always lose too much time by deleting the Jetstream folder and restarting. Is there a way to give this heart beat when stalling/ hanging?

It's really annoying!


UPDATE: The restart of all proxies seems to have given this job a kick. I hope that it now runs through and doesn't hang again.
Proact
Service Provider
Posts: 2
Liked: never
Joined: Jul 12, 2021 10:06 am
Contact:

Re: Stalling Jobs

Post by Proact »

Having the same issue for some months now which has been reported via Veeam case: #07616517

Our workaround is currently:
1. Restart the proxy server which have stalled jobs i.e. no disk activity from the Veeam proxy service
2. Stop the jobs which are stalled - and wait until the jobs are stopped
3. Start the jobs again
aeph
Enthusiast
Posts: 81
Liked: 9 times
Joined: Sep 26, 2024 11:02 am
Contact:

Re: Stalling Jobs

Post by aeph »

After about 400 objects my job is stalling again..

I can't cancel the job and start again because 1) I would lose too much time and 2) in this case it wouldn't help. I will open a high prio ticket.
pat_ren
Service Provider
Posts: 94
Liked: 16 times
Joined: Jan 02, 2024 9:13 am
Full Name: Pat
Contact:

Re: Stalling Jobs

Post by pat_ren »

aeph wrote: Mar 03, 2025 7:16 am Since the update to 8.1.0.305 I have the problem that my largest job with approx. 13k objects is no longer completed. After about 5600-6600 objects the job hangs and does not continue.
Even deleting the Jetstream folder and restarting the job leads to the same problem again. I can always start the job but it hangs again and again.

I have now updated to 8.1.0.3503 but the same problem occurs. Does anyone here have a solution for how I can get the job to run again or how I can keep it running when it hangs? I always lose too much time by deleting the Jetstream folder and restarting. Is there a way to give this heart beat when stalling/ hanging?

It's really annoying!


UPDATE: The restart of all proxies seems to have given this job a kick. I hope that it now runs through and doesn't hang again.
what's the status of the job when it's stalling? is it working on actual objects or stuck on 'resolving objects: X' and not working on any specific data?
i've received a few private fixes for similar issues but not so much one specific job, more like many jobs struggling to complete
have also made significant changes to each proxy config to improve performance
these may not be applicable to others to recommended to just collect logs and work with support and escalate as needed to get some assistance
aeph
Enthusiast
Posts: 81
Liked: 9 times
Joined: Sep 26, 2024 11:02 am
Contact:

Re: Stalling Jobs

Post by aeph »

The result is: the job isn't doing any progress since hours (even the processing rate, read and write rate stays the same). And stopping the job by pressing the stop button results in a "Queued" state that will not end until i delete the jetstream folder. The job is now running since 40 hours but nearly finishing (i had to restart all the proxies 2 times because the job was hanging 2 times so far). The restart is giving the job a kick and the job is resuming after about 5-10 minutes after the restart.

One more thing: The OneDrive job from the same orga is hanging in the "resolving objects state" (10+ hours) while this 13k Objects SharePoint job is running.

All the other jobs are very fast since 8.1.0.3503. Yesterday a Exchange-Only job with 2500 Objects took only 13 minutes.

Provided the support team a lot of logs.

My case id: #07621637
Polina
Veeam Software
Posts: 3461
Liked: 830 times
Joined: Oct 21, 2011 11:22 am
Full Name: Polina Vasileva
Contact:

Re: Stalling Jobs

Post by Polina »

Hi All,

First, it's highly recommended to upgrade to the latest patch 8.1.0.3503, which includes several improvements to the processing logic.
Next, please continue working with Support - your cooperation is important for RnD to faster troubleshoot and resolve the problem.

Thanks!
t7MevELx0
Service Provider
Posts: 63
Liked: 9 times
Joined: Feb 06, 2024 6:55 pm
Contact:

Re: Stalling Jobs

Post by t7MevELx0 »

I'm seeing these issues as well. It worked great on 8.05. As soon as I upgraded that environment to 8.1, that's when all of my copy jobs stopped working. They haven't completed since. I have a lot of jobs that stall and I can't cancel or stop them; I have to stop the services and clear the nats jetstream dir out. I have backup repo's that have been stuck "indexing" since the first upgrade to 8.1.

I'm on 8.1.0.3503 and that hasn't fixed anything.

Right now support wants me to remove the copy jobs, copy repo, and re add them to see if that does anything.

I'm wondering if its a postgresql/nats performance issue. I'm considering moving them off to dedicated VMs. I just find it odd that so many people have these issues.
Polina
Veeam Software
Posts: 3461
Liked: 830 times
Joined: Oct 21, 2011 11:22 am
Full Name: Polina Vasileva
Contact:

Re: Stalling Jobs

Post by Polina »

If your copy jobs become unresponsive before upgrading 8.1.0.3503, you still need to clean up NATS streams. After that, the issue should be resolved and the root cause fixed.
aeph
Enthusiast
Posts: 81
Liked: 9 times
Joined: Sep 26, 2024 11:02 am
Contact:

Re: Stalling Jobs

Post by aeph » 1 person likes this post

Had a video call with Veeam Support today.

It seems that the root cause is identified and Veeam is already working on a hotfix (should be available at the beginning of next week).
Issue happens mostly for SharePoint/Teams Jobs with more than 5k Objects and immutability active.

Workaround could be to split SharePoint job into multiple jobs (teams, personal sites, ..).

As soon as I receive the hotfix, I will try it out.
pat_ren
Service Provider
Posts: 94
Liked: 16 times
Joined: Jan 02, 2024 9:13 am
Full Name: Pat
Contact:

Re: Stalling Jobs

Post by pat_ren »

We aren't using immutability for any of our jobs. We do have sharepoint/teams jobs over 5k objects, but not too many.
The issue just seems random from my experience, works fine for 3 days in a row, happens twice in same day after that.

After too much frustration, I have built a restart function into my job manager so I can detect the issue and fix it automatically and then resume jobs without any manual intervention, it's just a workaround but that's how i've been dealing with it for now.

I'm also working with support on several issues, this being one of them, but no root cause has been determined yet. I do have logs from my job manager script which pinpoints the exact times when the problem begins so I should hopefully be able to get to the bottom of it soon.
t7MevELx0
Service Provider
Posts: 63
Liked: 9 times
Joined: Feb 06, 2024 6:55 pm
Contact:

Re: Stalling Jobs

Post by t7MevELx0 »

This is still a major issue in our environment. I haven't seen a copy job process correctly since upgrading to 8.1. We are on version 8.1.0.3503. I've escalated the case to priority 2.

These jobs were running fine on version 8.05.
aeph
Enthusiast
Posts: 81
Liked: 9 times
Joined: Sep 26, 2024 11:02 am
Contact:

Re: Stalling Jobs

Post by aeph »

@t7MevELx0 same for me. I am still waiting for the private hotfix mentioned in my previous post.
Copy jobs of smaller organizations are working but copy jobs of my larger orgas are failing.
t7MevELx0
Service Provider
Posts: 63
Liked: 9 times
Joined: Feb 06, 2024 6:55 pm
Contact:

Re: Stalling Jobs

Post by t7MevELx0 »

Yeah, that's the same issue I'm having. I'm not getting any response about a hotfix from support though. My jobs have been broken for ages.
Polina
Veeam Software
Posts: 3461
Liked: 830 times
Joined: Oct 21, 2011 11:22 am
Full Name: Polina Vasileva
Contact:

Re: Stalling Jobs

Post by Polina »

@t7MevELx0

Please share your support case ID, I don't see it in the thread.

Thanks!
t7MevELx0
Service Provider
Posts: 63
Liked: 9 times
Joined: Feb 06, 2024 6:55 pm
Contact:

Re: Stalling Jobs

Post by t7MevELx0 » 1 person likes this post

@Polina Case #07583409

I've had this open since Jan 27th. I've been trying to escalate it.
aeph
Enthusiast
Posts: 81
Liked: 9 times
Joined: Sep 26, 2024 11:02 am
Contact:

Re: Stalling Jobs

Post by aeph »

I got a hotfix, installed it and guess what: it still doesn't work. Job has now been hanging for 12 hours at exactly the same processing rate, objects processed and number of objects.

it is frustrating
admcomputing
Service Provider
Posts: 26
Liked: 4 times
Joined: Sep 27, 2010 11:01 am
Full Name: ADM Computing Ltd
Contact:

Re: Stalling Jobs

Post by admcomputing »

We too are having similar problems but just for the backup copy jobs that appear to get stuck either on 1 or 0 items remaining and the jobs sit there for a day before we must manually restart them. I wonder if its due to all the copy jobs (<100) starting at the same time? We run 1 backup job per customer and 1 backup copy to immutable storage per customer. Would splitting them out help?
Polina
Veeam Software
Posts: 3461
Liked: 830 times
Joined: Oct 21, 2011 11:22 am
Full Name: Polina Vasileva
Contact:

Re: Stalling Jobs

Post by Polina »

@admcomputing
Simultaneous start of a large number of jobs can provide an increased load on the database causing the issue. Do you have any possibility to distribute them in time?
admcomputing
Service Provider
Posts: 26
Liked: 4 times
Joined: Sep 27, 2010 11:01 am
Full Name: ADM Computing Ltd
Contact:

Re: Stalling Jobs

Post by admcomputing »

Thanks Polina.
I've just spent the last hour manually adjusting each job individually so I will see how this goes.
I thought NATs was supposed to take care of this :)
Polina
Veeam Software
Posts: 3461
Liked: 830 times
Joined: Oct 21, 2011 11:22 am
Full Name: Polina Vasileva
Contact:

Re: Stalling Jobs

Post by Polina » 1 person likes this post

I feel your pain - I wished everything always worked the way it is supposed to :)

Jokes aside, RnD is investing a lot of effort these days to resolve such issues as stuck jobs and many others. Some fixes can already be delivered via support today, others are more complex and will be included in the next product updates.
pat_ren
Service Provider
Posts: 94
Liked: 16 times
Joined: Jan 02, 2024 9:13 am
Full Name: Pat
Contact:

Re: Stalling Jobs

Post by pat_ren » 2 people like this post

I'm still getting this issue every few days. I have a custom script I use to manage our job start times, to limit how many jobs are running and only start X amount of jobs and it has helped a lot with the issue and consistency of backups as a whole.
I also built in a function to detect this issue when jobs are stalled and automatically fix it so it has saved me a lot of time from walking in after a weekend and finding all jobs stuck and a massive backlog of jobs to run.
I believe my case is with R&D too.
dloseke
Service Provider
Posts: 72
Liked: 35 times
Joined: Jul 13, 2018 3:33 pm
Full Name: Derek M. Loseke
Location: Omaha, NE, US
Contact:

Re: Stalling Jobs

Post by dloseke » 1 person likes this post

I'm having the same issue. I have a ticket open for other issues (jobs hanging on public folder mailboxes, job history defaulting to showing jobs from September rather than current/recent jobs, had missing restore points but that part appears to be resolved possibly or at least jobs are generally running correctly). When this issue occurred, with our server, I was given the "stop services, rename/delete the NATS stream folder and start services" instructions and it ran for 3 weeks before happening again.

Single server, no additional proxy's, backing up direct to object storage (Wasabi), no immutability on these repositories though I do have one repo that is using immutability, but is unaffected so far). Running 8.1.0.305 Case #07538517
Derek M. Loseke, Senior Systems Engineer | Veeam Legend 2022-2024 | VMSP/VMTSP | VCP6-DCV | VSP/VTSP | CCNA | https://technotesanddadjokes.com | @dloseke
Post Reply

Who is online

Users browsing this forum: No registered users and 45 guests