An existing connection was forcibly closed by the remote host

Blerrie-Backups · Mar 11, 2019 6:06 pm

Hallo

Hoping someone can help me with this VERY strange one here. Been struggling with this issue for weeks now and even with a call logged with support, finding a resolution seems to be very tedious.

We have a client with the below setup:

- 1x External SQL server hosted on a vSphere platform
- 1x VEEAM server hosted on the same vSphere platform (ONLY used for VEEAM operations)
- VEEAM SOR(Scale Out Repository) with 2x extends comprising of:
o 2x 97TB volumes hosted on a DellEMC Unity array (727GB of FAST cache and 194TB of NL-SAS Drives)
o These extends are presented to 2x separate physical HPE BL460 Gen8
 2x 8-core CPU’s and 64GB memory; dual 10Gb NIC’s and dual 8GB HBA’s
 2x 8-core CPU’s and 96GB memory; dual 10Gb NIC’s and dual 8GB HBA’s
- 1x HPE MSL with 4x LTO6 drives connected via SAN to the 2x repository servers above.
- 3x Backup proxy servers used for hot-add transports to the repository servers
- 10Gb low latency link between PRD and DR/BCP site (Simply a light up fibre, not traversing any firewall)
- 1x DataDomain at the DR/BCP site
- 1x Backup proxy at the DR/BCP site set at the mount and gateway server to the DataDomain unit hosted on this site

Now, all was going like a Boing, till we installed update 4 for VEEAM B&R 9.5 After this update, we are now constantly getting the below error on ANY backup, replication and copy to tape job:

Error: write: An existing connection was forcibly closed by the remote host. Failed to download disk. Reconnectable protocol device was closed. Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}.

We have logged 2x call’s and out of frustration, this has never gotten anywhere. Have one open with a big question at the moment. The support guys keeps on reverting and saying that this must be network related. The is NO antivirus installed on any of the VEEAM servers in the equation, nor is there any firewall’s enabled or configured between, or on the above servers, thus ruling the usual out.

We suspect this error is related to either the software, or a configuration on the VEEAM repository servers, thus we did the below:

- Checked and updated the firmware and drivers of all components on these 2x servers
- Checked for any outstanding Windows updates on all 6x the servers in the VEEAM B&R environment
- Implemented and then after a failure reverted the following KB: https://www.veeam.com/kb1781
- Implemented and then after a failure reverted the following KB: https://www.veeam.com/kb1781

The fact that this error pops up even when we copy data from the same repositories that the tape library is connected to, thus not even traversing the LAN, makes me think that support is looking in the wrong place for a solution to our issue.

What we have noticed, on the reapplication and copy to tape jobs, not so much on the regular backup jobs, but I think this is due to the fact that these run much quicker and completes in less time, is that all runs as it should, with 20 stream active and the rest in a pending/queued state. This all works awesome copying and replicating at record speeds, but after about 2-4 hours, all goes haywire and we start getting the “connection was forcible closed” error, on running and even queued tasks, thus causing a mass failure.

When a scheduled job tries to run, the job start, snapshots gets created, all is ready to add the drives to the hot-add proxies, but then we encounter the connection issue. (Example below)

Code: Select all

2019/03/11 4:20:35 PM :: Queued for processing at 2019/03/11 4:20:35 PM  
2019/03/11 4:20:35 PM :: Required backup infrastructure resources have been assigned  
2019/03/11 4:20:35 PM :: Using DAPVDR03-Extent-01 scale-out repository extent  
2019/03/11 4:20:40 PM :: VM processing started at 2019/03/11 4:20:40 PM  
2019/03/11 4:20:40 PM :: VM size: 3,3 TB  
2019/03/11 4:20:45 PM :: Getting VM info from vSphere  
2019/03/11 4:20:50 PM :: Production datastore VNX20-0103-ATP01-006 is getting low on free space (610,6 GB left), and may run out of free disk space completely due to open snapshots.  
2019/03/11 4:20:50 PM :: Creating VM snapshot  
2019/03/11 4:21:03 PM :: Saving [VNX20-0103-ATP01-006] DAPFIL04/DAPFIL04.vmx  
2019/03/11 4:21:03 PM :: Saving [VNX20-0103-ATP01-006] DAPFIL04/DAPFIL04.vmxf  
2019/03/11 4:21:04 PM :: Saving [VNX20-0103-ATP01-006] DAPFIL04/DAPFIL04.nvram  
2019/03/11 4:21:04 PM :: Using backup proxy dapvbr06.directaxis.co.za for disk Hard disk 1 [hotadd]  
2019/03/11 4:21:04 PM :: Using backup proxy dapvbr05.directaxis.co.za for disk Hard disk 2 [hotadd]  
2019/03/11 4:21:04 PM :: Using backup proxy dapvbr04.directaxis.co.za for disk Hard disk 3 [hotadd]  
2019/03/11 4:21:39 PM :: Hard disk 1 (0,0 B) 0,0 B read at 0 KB/s [CBT] 
2019/03/11 4:22:03 PM :: Hard disk 3 (0,0 B) 0,0 B read at 0 KB/s [CBT] 
2019/03/11 4:22:19 PM :: Hard disk 2 (0,0 B) 0,0 B read at 0 KB/s [CBT] 
2019/03/11 4:23:55 PM :: Removing VM snapshot  
2019/03/11 4:25:20 PM :: Error: write: An existing connection was forcibly closed by the remote host
Failed to download disk.
Reconnectable protocol device was closed.
Failed to upload disk.
Agent failed to process method {DataTransfer.SyncDisk}.
  
2019/03/11 4:25:20 PM :: Network traffic verification detected no corrupted blocks  
2019/03/11 4:25:20 PM :: Processing finished with errors at 2019/03/11 4:25:20 PM

The best part, is how we managed to work around this solution to ensure some level of backups in the customer’s environment. Once this error is encountered, or before the scheduled backups are due to start, we simply reboot the 2x repository servers. Once these comes back up, all backups run’s with no issues.

This makes me think that this is related to some sort of buffer, that fill’s and then only gets cleared on a reboot of the server. And note, ONLY the 2x repository servers are rebooted, none of the other servers in the VEEAM real-estate.

Hopefully someone has recently ran into this exact same issue, as ANY help would be appreciated at this time.

bdufour · Post by **bdufour** » Mar 11, 2019 6:32 pm this post

have you tried another transport mode? like network (just to test)? also, what are the bottleneck stats when things are working?

Blerrie-Backups · Mar 11, 2019 6:43 pm

Hallo Bdufour

Network transport is not really an option, as this runs @ +-180MB/s compared to 480Mb/s plus with hot-add, thus 3-4times less performance gong that way. 

The bottleneck stats are:

Load: Source 70% > Proxy 26% > Network 50% > Target 22%
Primary bottleneck: Source

bdufour · Post by **bdufour** » Mar 11, 2019 8:53 pm this post

i meant just try network transport for testing purposes, not long term. that way u can narrow it down even further - as it seems you have done some good work. its really easy to switch btw transport modes.

your bottleneck stats dont look bad. weird restarting the repo servers fixes it, as your target is the lowest in the mix. i suspect these stats are after a fresh restart of the repo servers. if the jobs run for a few days after the restart, id also suspect that the target percentage will grow much higher up until they need to be restarted.

since restarting the repo servers seems to fix it, have a look on these servers before you restart and see if you have something obvious eating up resources through resource manager (if theyre windows servers).

Blerrie-Backups · Mar 12, 2019 3:58 am

Hallo

Jip, they are Windows servers, Windows 2016 to be exact.

We have checked resource usage when this issue is encountered, but there is nothing that is under heavy strain on these servers. What I also noticed is that when you disable the replication job, this does not seem to happen as often. Keeping in mind that the replication job puts more strain on the environment, as all recent backups needs to be replicated to the DR/BCP site. But this is only set to replicate in non-backup hours, thus not supposed to be under too much strain.

Will see if I can force this issue again and then test with network transport to see if we get the same issue.

Will revert back soon

Blerrie-Backups · Mar 14, 2019 11:30 am

Hallo

Ok, FINALY got behind the thing causing my headaches. (Case ID: 03444709)

The issue was related to a rouge copy job that is configured to copy all backed-up data from the PRD site to the DR/BCP site on a daily basis. (+-130TB full and +-13TB incremental) The problem is that this copy job, has not been completed successfully for some time, due to the changes that was done on the SOR. This caused the streams to get filled on both the source and the destination. After about 2-4 hours into this copy process, something happens where this job fails with the connection was forcibly closed error.

This then does NOT release the streams or sessions on the repository servers. When a regular backup job runs, it gets the exact same connection was closed by remote host error. With the reboot of the repository servers, this then frees up these streams/sessions.

We have since recreated this copy job and are in the process of phasing the backups into this job and we have yet to run into this issue again.

THINK this might be some sort of bug/issue that VEEAM might need to look and fix in the near future.

shartma · Post by **shartma** » Apr 18, 2019 4:07 pm this post

I have been seeing the EXACT same issues since 9.5 4. Support has been less than helpful. I am going to try to disable all of my copy jobs, reboot the servers and see how it goes. BTW 9.5 4a does not fix the issue.

Thanks for your post!!!!!

Post by **foggy** » Apr 18, 2019 4:40 pm this post

Could you please also share the support case number? Thanks!

shartma · Post by **shartma** » Apr 22, 2019 2:02 pm this post

Here is the case # 03453850

Update. Disabling Backup Copy and Replication jobs DID NOT fix my issues. As stated by the original poster @Blerrie-Backups, "This makes me think that this is related to some sort of buffer, that fill’s and then only gets cleared on a reboot of the server. " These are my exact symptoms with the same temporary fix.

jvlad · Post by **jvlad** » Jul 05, 2019 3:53 pm this post

I also did not have issues prior to update 4A and waited for 4a to make sure major bugs were worked out but this one still persists apparently.

My backup copy jobs fail randomly with "existing connection was forcibly closed by the remote host", and they keep pointing to my WAN but worked perfectly fine prior.
for a while there it looked like my WAN accelerator was the issue and i disabled it and worked fine for a few days and than it started again with the connection was forcibly closed by remote host.

If i reboot server it also seems to work fine for a bit and than start again.

I too believe this is related to the update somehow.

Post by **veremin** » Jul 05, 2019 4:20 pm this post

If you haven't solved your issue yet, then, work with the support team directly on finding the root cause.

And remember to escalate the case whenever you feel unsatisfied with the level of support provided.

Thanks!

merku · Post by **merku** » Nov 06, 2022 3:48 pm this post

shartma wrote: ↑Apr 22, 2019 2:02 pm Here is the case # 03453850

Update. Disabling Backup Copy and Replication jobs DID NOT fix my issues. As stated by the original poster @Blerrie-Backups, "This makes me think that this is related to some sort of buffer, that fill’s and then only gets cleared on a reboot of the server. " These are my exact symptoms with the same temporary fix.

Hello,
did you then find the solution? It seems I have the same problem on a version 10.0.1.4854.
Sometime I have errors like "An Existing Connection was Forcibly Closed by the Remote Host", after I reboot the backup server the errors disappear and reappear after some time.

Thanks

jvlad · Post by **jvlad** » Nov 07, 2022 2:46 pm this post

In my case it seems like a lot of these "An existing connection was forcibly closed by the remote host" errors went away after i disabled the DB transaction log files being backed up to a remote DRP site using BackupCopy job.
Edit BackupCopy job and under the "Object" section there is an option which I now unchecked for "Include database transaction log backups (increased bandwidth usage)".
To us disabling this was not the end of the world but for others it might.

Disabled this in an attempt to fix an unrelated issue with random citrix timeouts for the client at that remote site which it did, which then made me think the bandwidth limiter within Veeam does not limit these transnational log file copies across the WAN, but then as another benefit i noticed I was also no longer getting the "An existing connection was forcibly closed by the remote host" within my other backup copy jobs.

I too had to restart the Veeam servers once a week to temporarily resolve the ""An existing connection was forcibly closed by the remote host".

Anyway disabling the BackupCopy db transnational log backup to DRP site fixed 2 different timeout issues for me and its been a couple of months now.

R&D Forums

An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Re: An existing connection was forcibly closed by the remote host

Who is online