Slow job initialisation since moving to 9.5

Post by **ian0x0r** » Dec 20, 2016 10:13 am this post

This is already an open support case, case ID 02008794

I have noticed that there is a real delay in jobs starting when utilising multiple NICS and preferred networks in Veeam. Let me give you a break down of my current setup and what the issue is.

VEEAM01 (Management Server). IP address 172.16.10.15
VEEAMREPO01 (Proxy / ReFS repository). IP address 172.16.10.23, 10.0.99.25, 10.0.99.26
VEEAMVSAN1 (Proxy) . IP address 172.16.10.14, 10.0.99.27

The 172.16.10.x is a /16 network and is acting as the management network. The 10.0.99.x /24 network is the data network used for backup traffic. 10.0.99.x /24 is not routable.

My assumption is that the VBR management server should be able to co-ordinate jobs on the proxy and repository servers to utilise the 10.0.99.x network without it needing to be able to talk to the 10.0.99.x network. Is this assumption correct?

What I have found is a ton of errors in the task log similar to below

Code: Select all

.2016 18:33:27] <52> Error    Failed to connect to agent's endpoint '10.0.99.27:2505'. Host: 'VeeamVSAN1'.
[19.12.2016 18:33:27] <52> Error    A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.0.99.27:2505 (System.Net.Sockets.SocketException)
[19.12.2016 18:33:27] <52> Error       at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
[19.12.2016 18:33:27] <52> Error       at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
[19.12.2016 18:33:27] <52> Error       at Veeam.Backup.Common.CNetSocket.Connect(IPEndPoint remoteEp)
[19.12.2016 18:33:27] <52> Error       at Veeam.Backup.AgentProvider.CAgentEndpointConnecter.ConnectToAgentEndpoint(ISocket socket, IAgentEndPoint endPoint

This results in the job taking an absolute age to start.

As a test I added an additional NIC to the VEEAM01 (management server) on the 10.0.99.x subnet an re ran a job. Low and behold the job started pretty much instantly. A couple of test jobs I re-ran have ran 80% quicker because there are no errors in the task log anymore as above.

So the question is, WHY does the proxy need to establish a connection with the management server on the preferred network that is assigned for data moving?

A colleague of mine is having a very similar issue, case ID 02002661.

My environment is VMware, his is Hyper V.

Thanks,

Ian

Post by **ian0x0r** » Dec 20, 2016 10:20 am this post

Just to add to this quickly, this has nothing to do with the NIC binding order as defined in this article https://technet.microsoft.com/en-us/lib ... 3eedb0322f and is not even applicable in server 2016 as discussed in this article https://blogs.technet.microsoft.com/net ... indows-10/

Ian

Post by **Cragdoo** » Dec 20, 2016 10:32 am this post

Hello the case ID 02002661 is my case, and thought I'd add a little detail

VBR1 is located on the 172.16.236.x subnet , and HYPV Hosts are all in the 172.21.80.x subnet. VBR1 only knows the HYPV hosts on 172.21.80x (defined in DNS), and 172.21.84.x is non routable from 172.16.236.x

What we are seeing, similar to Ian above, are entries in the logs , where the VBR server appears to be trying to establish connections on the non-routable sub net

Code: Select all

Failed to connect to agent's endpoint '172.21.84.x:2503'. Host: 'hypv4'.
[12.12.2016 06:52:07] <65> Error    A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 172.21.84.104:2503 (System.Net.Sockets.SocketException)
[12.12.2016 06:52:07] <65> Error       at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
[12.12.2016 06:52:07] <65> Error       at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
[12.12.2016 06:52:07] <65> Error       at Veeam.Backup.Common.CNetSocket.Connect(IPEndPoint remoteEp)
[12.12.2016 06:52:07] <65> Error       at Veeam.Backup.AgentProvider.CAgentEndpointConnecter.ConnectToAgentEndpoint(ISocket socket, IAgentEndPoint endPoint)
[12.12.2016 06:52:07] <65> Info     [NetSocket] Connect

We do have network traffic rules defined in VBR console, but I would have thought this only applies to data traffic not management traffic?

Hope the extra info helps

gsroute · Post by **gsroute** » Dec 20, 2016 10:33 am this post

Hi Ian,

Case ID is 02002661 is one I've opened, thanks for including it as we do have similar problems.

In my words on what we've seen since updating to 9.5 is:

Backup jobs are taking considerably longer to run, a job can be broken down into a few sections
1. The initialisation where the backup server is connecting to everything to setup all connections between hosts, proxies and repositories.
2. Data transfer
3. cleanup

Sections 1 and 3 are now taking a vey long time but the data transfer times are normal. An example of one job, it would take around 10 minutes to run from start to finish backing up 3 VMs incrementally, now that job will run for 55 mins of which only 6 minutes are the data processing time. This was 10 mins in v9 and 55 mins in 9.5, of which we've been running since 2nd December.

During the initilisation period I can see in the logs that the backup server is trying to connect to the Hyper-V hosts, move the Change Block Tracking data to another host to work as a proxy but then repeat this every minute until either is works or fails. It is also picking up all IP addresses on the host, two of these IPs are not connectable from the backup server as they are not routable due to them being private networks for SMB, although the repos are piggy backing on that same network, however this part works ok.

There is a second issue where I have a few VMs that will not backup at all, I've moved these to other volumes and these are starting to work better, although slow.
Thanks
Graeme.

Post by **ian0x0r** » Dec 20, 2016 10:49 am this post

This maybe the proof to show that the backup manager is using the data mover network.

Post by **Cragdoo** » Jan 09, 2017 12:20 pm this post

anyone care to comment ??

The latest update on either case, is a discussion with QA about this behaviour.

Post by **PTide** » Jan 18, 2017 4:23 pm this post

Hi Craig,

VBR1 is located on the 172.16.236.x subnet , and HYPV Hosts are all in the 172.21.80.x subnet. VBR1 only knows the HYPV hosts on 172.21.80x (defined in DNS), and 172.21.84.x is non routable from 172.16.236.x

I'm confused with your setup a little bit - as far as I can see from the picture the repository is located in the 172.21.84.x subnet that is not routable from VBR subnet. That might be a dumb question but how did you manage to add a repo from that network? Repo should be accessible from VBR.

Thanks

Jan 18, 2017 8:33 pm

This biggest part I'm struggling with is what changed with 9.5 because the preferred network settings have, as far as I know, always prioritized connections for any communications with the Veeam agent, and that included communications from the VBR server to the agents running on proxies/repos even in earlier versions. Curious, what version were you running previously? Do you have some logs of backups when this was working as it would be interesting to compare?

I'm guessing that the delay is being exasperated by the fact that the firewall is probably configured to silently drop TCP traffic rather than actually reject the traffic (i.e. send proper ICMP response that destination is unreachable). This will cause every attempt to wait until the connection timeout instead of immediately retrying. Perhaps you could either tweak the firewall rules to refuse, rather than silently drop, the traffic from the VBR server to the SMB network, we would still try and fail a lot of times, but instead of waiting 20 seconds (or whatever the default timeout is), it should fail almost instantly each time. If you can't tweak the firewall rules themselves then you should be able to add some outbound rules the Windows firewall to immediate reject all traffic to the unreachable network. This more hides the problem vs fixing it, but it might be workable.

Post by **ian0x0r** » Jan 30, 2017 1:54 pm this post

Thanks guys,

Just wanted to confirm for my use case at least that setting outbound firewall rule on management server to reject traffic rather silently drop has worked around the issue. There is no further delay in job initialisation.

Thanks for your replies all, and the guys in support for putting in the time and effort to look for a resolution.

Ian

R&D Forums

Slow job initialisation since moving to 9.5

Re: Slow job initialisation since moving to 9.5

Re: Slow job initialisation since moving to 9.5

Re: Slow job initialisation since moving to 9.5

Re: Slow job initialisation since moving to 9.5

Re: Slow job initialisation since moving to 9.5

Re: Slow job initialisation since moving to 9.5

Re: Slow job initialisation since moving to 9.5

Re: Slow job initialisation since moving to 9.5

Who is online