Copy job stuck 4 days with no failure notification

billeuze · Post by **billeuze** » May 12, 2021 4:31 pm this post

Case # 04793751

Here I am not wanting to focus on why my copy job was stuck. Here I am focusing on the fact that a daily scheduled backup job (that normally takes less than 15 minutes) was in a stuck state for more than 4 days and during that time:
[*]Veeam did not seem to notice there was a problem, at least it did not inform me of any problem.
[*]During the time it was stuck no new daily jobs were run and again I had no notification of these missed jobs.
[*]If I had not noticed this myself, the stuck condition would probably have continued unnoticed by Veeam forever.

Veeam should be able to detect such stuck jobs but it appears as if it doesn't

further notes:
When I discovered this "stuck" job I tried to stop it but it would not stop. I waited some time and finally restarted the B&R server and the repository - that stopped the job. After the server restart I was able to manually run the job and it appeared to complete, at least it created the backup file but then later I noticed it was stuck on "Performing backup files health check (3% done)". Just to be sure it wasn't just slow, I left it in this state for 5 hours and it did not advance beyond 3%. After the 5 hours I tried to stop the job. This time I waited one hour (just in case it was slow to stop) before restarting the server. This was all happening during the day when no other backups were running, In other words the B&R server, proxies and repositories were not running any other jobs at the same time.

After that second server restart I manually ran the job in "active full" mode and it completed. later than night the regularly scheduled incremental ran and this time successfully completed the monthly health check. The job has been running correctly ever since (other than for failures due to the below issue).

I had a job fail because of "Error: All target backup proxies are offline". Indeed I checked the proxy and it was in a blue screen saying "Your PC Ran into a problem and has to restart" and "Stop code: HAL INTIALIZATION FAILED". I restarted the proxy (server 2019 VM) and everything was OK for a couple of days when the exact same thing happened. So I brought up a brand new server and enrolled it as proxy and got rid of the faulty one. This was just yesterday so I won't know for a while if that will stop this from happening. But it is the same proxy as would have been running the undetected stuck job that is the subject of this post. It is possible that the proxy was in early stages of failure but had not yet completely crashed.

Again I am not trying here to solve why my job got stuck, but if this failing proxy is why it failed, this might give the developers some information on what Veeam has to monitor in order to detect such stuck jobs and report them to the sysadmin.

May 12, 2021 4:55 pm

Veeam One can monitor unusual Job durations and therefore will see a stuck Job.
Also for Service Provider, there is the Veeam Avaibility Console for Service Provider. It can detect stuck Jobs too.

But you are right. There should be a notification from the standalone vbr server, if a job doesn‘t finish until the next scheduled runtime.

Post by **foggy** » May 12, 2021 4:57 pm this post

Right, Veeam ONE allows getting a notification in case the job takes longer than expected or there was no successful backup for the VM during the specified RPO period. Thanks!

May 12, 2021 5:15 pm

Mildur wrote: May 12, 2021 4:55 pmBut you are right. There should be a notification from the standalone vbr server, if a job doesn‘t finish until the next scheduled runtime.

It's all good until the backup server itself hangs, or the monitoring process hangs or crashes... think it's unlikely? <looks at the latest patch notes> tsk tsk tsk

Overall, its a bit like having politicians monitoring themselves for corruption, and notify the country when they perform one

Perhaps some corner cases can be addressed, but overall you're destined to fail sooner or later by taking this approach.

As a rule of thumb, no system out there can properly monitor itself very well. At best, it can detect some issues in some cases.
This is especially true in technology, where the monitoring system HAS to be something external => Veeam ONE.

All of this of course does not mean adding this capability makes no sense, or that it is useless. Definitely not!
The only bad thing about it, is it will put more customers on a wrong track of not using Veeam ONE.

May 12, 2021 5:33 pm

No worries, Anton. I‘m on your site.

We will be using Veeam One soon. So many good reports. I have already prepared a collection of necessary reports for our management Board to have a decision about to use Veeam One, which will make my life easier as a backup admin

There are so many things to manually check, if one has no Veeam One installed. It‘s ok for small environments todo manual checks, but i find a monitoring Tool very important, if you have dozens or hundreds of objects to backup.

billeuze · Post by **billeuze** » May 12, 2021 11:35 pm this post

Thanks for all the reply's. So it seems like Veeam one is a good idea. I'll give it a try. I, not sure if my socket based "Veeam Backup Essentials Enterprise" provides a license for Veeam one, but I see that if we don't license it runs as community edition. so worth a try.

Post by **Mildur** » May 13, 2021 3:34 am this post

With Veeam Backup Essentials, Veeam One included.
For the community edition, veeam One is also available to monitor 10 vms or agents for free

R&D Forums

Copy job stuck 4 days with no failure notification

Re: Copy job stuck 4 days with no failure notification

Re: Copy job stuck 4 days with no failure notification

Re: Copy job stuck 4 days with no failure notification

Re: Copy job stuck 4 days with no failure notification

Re: Copy job stuck 4 days with no failure notification

Re: Copy job stuck 4 days with no failure notification

Who is online