Agent-based backups for Windows and Linux, centralized agent management
Post Reply
hpadm
Influencer
Posts: 13
Liked: 1 time
Joined: May 18, 2021 1:55 pm
Location: Slovakia
Contact:

Windows gateway agent did not reconnect NFS after NAS restart

Post by hpadm »

After doing a firmware update on my Synology NAS and rebooting it, I discovered that all backups from then on failed with
Failed to call RPC function 'NfsIsExist': The remote procedure call was cancelled. RPC function call failed. Function name: [DoRpcWithBinary]. Target machine: [***:6160].
When trying to query information about the repository through the UI (host discovery, job properties, backup infrastructure rescan), there would be a 1 hour (3600s) timeout on the NFS query, hanging the entire UI if the action involved a dialog window.

The required manual step to recover from this state was to restart the "Veeam Agent for Microsoft Windows" service, and the "Veeam Installer Service". Then the next time there was an operation requiring NFS access, VeeamAgent.exe and/or VeeamDeploymentSvc.exe would successfully open the NFS connection. Restarting just the agent would cause repo discovery to succeed, but the subsequent backup job would still fail - as I learned when I tried to reproduce this failure.

(I originally submitted this as free support case #05457428 , however it timed out. Posting here for reference because searching for the above error message did not reveal any past documented cases. The workaround is enough for now but unacceptable for production use.)

Dima P.
Product Manager
Posts: 13485
Liked: 1314 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Windows gateway agent did not reconnect NFS after NAS restart

Post by Dima P. »

Hello hpadm,

NFS is configured as repository in Veeam B&R server, right? Have you checked the repository state in Veeam B&R console after update has been performed? Thanks!

hpadm
Influencer
Posts: 13
Liked: 1 time
Joined: May 18, 2021 1:55 pm
Location: Slovakia
Contact:

Re: Windows gateway agent did not reconnect NFS after NAS restart

Post by hpadm »

If the repository reboots when the Veeam agent / gateway services are already running, B&R will report it as 'Unavailable', and the UI will hang for 1 hour whenever any access operation is attempted with the repository.

If then the Veem Agent service is restarted and a rescan is performed, the repository becomes immediately available. However, a backup job will still fail with the same error as mentioned above. The Veeam Installer service also makes its own NFS connections and also has to be restarted for everything to work right.

From the observed behavior, it seems like the NFS component was programmed to only open a connection at startup (or on first use), and keep it open forever. And there is no recovery in case the connection is dropped; there's not even higher-level error handling of this scenario. Meanwhile, I assume that if I had used SMB, it would handle outages no problem. I assume it has something to do with the fact that Veeam uses its own custom NFS client (does not require the 'NFS client' windows server OS feature to be installed), and that the code is not yet fully developed.

Dima P.
Product Manager
Posts: 13485
Liked: 1314 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Windows gateway agent did not reconnect NFS after NAS restart

Post by Dima P. »

If NFS was added as Veeam B&R repository backup agent will connect with assigned NFS gateway server on the first place. Open NFS repository and go thru the wizard clicking Next Next to make sure that the NFS changes (i.e. upgrade) are recognized and synced with the components installed on the gateway server. Then try to re-run the job from the backup agent side and let us know how it goes. Thank you!

hpadm
Influencer
Posts: 13
Liked: 1 time
Joined: May 18, 2021 1:55 pm
Location: Slovakia
Contact:

Re: Windows gateway agent did not reconnect NFS after NAS restart

Post by hpadm »

There were no externally visible changes, the files are still there in the same location. No rescan was necessary and backups work fine.
I can reproduce the failure by just rebooting the NAS. Everytime the NAS is rebooted, all backup jobs will fail until the gateway's services are restarted.

(In my specific configuration to eliminate duplicated traffic, every host is effectively its own backup proxy (by using Agent mode), and every host is its own NAS gateway (by configuring a gateway on each host and setting its backup job to use that gateway). I assume the issue would be the same if the NAS gateway was configured on a dedicated host.)

Dima P.
Product Manager
Posts: 13485
Liked: 1314 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Windows gateway agent did not reconnect NFS after NAS restart

Post by Dima P. »

hpadm,

After the NFS repository settings were 'renewed' did you get the same error on the agent side during backup? Have you tried to re-send the job configuration to the agent side (assuming you are using managed by agent job type to configure agent backup)? Thanks!

hpadm
Influencer
Posts: 13
Liked: 1 time
Joined: May 18, 2021 1:55 pm
Location: Slovakia
Contact:

Re: Windows gateway agent did not reconnect NFS after NAS restart

Post by hpadm »

The master B&R server pushes agent policy updates every day at midnight and before every backup job, so it should be ok.
The same failure also happens for hyper-v on-host backup, for the backup gateways configured for those hosts.

Well, now I'm confused. I tested the following, and it worked fine...
1. Reboot NAS
2. Rescan repositories
3. Manually start all backup jobs
- each storage gateway successfully opened its NFS connection, and the backups completed.

So what I'm going to test next is to just reboot the NAS and don't touch B&R at all. Then tomorrow I'll check if the periodic nightly backup jobs succeeded or not.

hpadm
Influencer
Posts: 13
Liked: 1 time
Joined: May 18, 2021 1:55 pm
Location: Slovakia
Contact:

Re: Windows gateway agent did not reconnect NFS after NAS restart

Post by hpadm »

All overnight backup jobs failed. They all hang for 1 hour and then retry, in a loop.
The only NFS connection I see is from VeeamDeploymentSvc.exe on the B&R host, probably opened when I started B&R console.
Repository rescan operation hangs the UI for 1 hour, like before.
So... yeah. This is how to reproduce the failure.

Dima P.
Product Manager
Posts: 13485
Liked: 1314 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Windows gateway agent did not reconnect NFS after NAS restart

Post by Dima P. »

hpadm,

Probably the issue is with long repository rescan interval, let me check if we can somehow tune this with QA team. Meanwhile, can I ask you to raise a support case and share the case ID with me? We will need your B&R debug logs to review the behavior. Thank you!

hpadm
Influencer
Posts: 13
Liked: 1 time
Joined: May 18, 2021 1:55 pm
Location: Slovakia
Contact:

Re: Windows gateway agent did not reconnect NFS after NAS restart

Post by hpadm »

It is not the rescan interval. I just checked the 'system' logs, and they show automatic host discovery (+repo rescan) happening 3 times prior to the backup attempt. All of them failed to communicate with the repository with the same error as in the original post. So there is some strange inconsistency regarding reconnection when it's handled manually in B&R console vs scheduled activities.
EDIT: The first entry in the backup repositories section, that uses automatic host selection and is only used for storing configuration backups, rescans successfully. This is an oddity.

Also I do not think this should be a periodic rescan issue (= polling). The agents should immediately restore their dropped idle connection when they detect the state change. There should not be a window of delay where backup jobs could just fail like that. As I said before, I'm pretty sure SMB connections don't have this sort of downside.

The issue# is in the first post. As you might have noticed, I'm struggling to properly understand this because it is an unhandled failure of internal state management.

One oddity I'm noticing in the system logs, is the "Infrastructure Rescan" periodic job. The start/end times indicate that often it takes exactly 5 whole days to run. And I cannot double-click the entries to view the details.

Dima P.
Product Manager
Posts: 13485
Liked: 1314 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Windows gateway agent did not reconnect NFS after NAS restart

Post by Dima P. »

Should have missed that, sorry for confusion. Will ask support folks to review your logs.

Post Reply

Who is online

Users browsing this forum: No registered users and 4 guests