Hoping someone can help me with this VERY strange one here. Been struggling with this issue for weeks now and even with a call logged with support, finding a resolution seems to be very tedious.
We have a client with the below setup:
- 1x External SQL server hosted on a vSphere platform
- 1x VEEAM server hosted on the same vSphere platform (ONLY used for VEEAM operations)
- VEEAM SOR(Scale Out Repository) with 2x extends comprising of:
o 2x 97TB volumes hosted on a DellEMC Unity array (727GB of FAST cache and 194TB of NL-SAS Drives)
o These extends are presented to 2x separate physical HPE BL460 Gen8
2x 8-core CPU’s and 64GB memory; dual 10Gb NIC’s and dual 8GB HBA’s
2x 8-core CPU’s and 96GB memory; dual 10Gb NIC’s and dual 8GB HBA’s
- 1x HPE MSL with 4x LTO6 drives connected via SAN to the 2x repository servers above.
- 3x Backup proxy servers used for hot-add transports to the repository servers
- 10Gb low latency link between PRD and DR/BCP site (Simply a light up fibre, not traversing any firewall)
- 1x DataDomain at the DR/BCP site
- 1x Backup proxy at the DR/BCP site set at the mount and gateway server to the DataDomain unit hosted on this site
Now, all was going like a Boing, till we installed update 4 for VEEAM B&R 9.5 After this update, we are now constantly getting the below error on ANY backup, replication and copy to tape job:
Error: write: An existing connection was forcibly closed by the remote host. Failed to download disk. Reconnectable protocol device was closed. Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}.
We have logged 2x call’s and out of frustration, this has never gotten anywhere. Have one open with a big question at the moment. The support guys keeps on reverting and saying that this must be network related. The is NO antivirus installed on any of the VEEAM servers in the equation, nor is there any firewall’s enabled or configured between, or on the above servers, thus ruling the usual out.
We suspect this error is related to either the software, or a configuration on the VEEAM repository servers, thus we did the below:
- Checked and updated the firmware and drivers of all components on these 2x servers
- Checked for any outstanding Windows updates on all 6x the servers in the VEEAM B&R environment
- Implemented and then after a failure reverted the following KB: https://www.veeam.com/kb1781
- Implemented and then after a failure reverted the following KB: https://www.veeam.com/kb1781
The fact that this error pops up even when we copy data from the same repositories that the tape library is connected to, thus not even traversing the LAN, makes me think that support is looking in the wrong place for a solution to our issue.
What we have noticed, on the reapplication and copy to tape jobs, not so much on the regular backup jobs, but I think this is due to the fact that these run much quicker and completes in less time, is that all runs as it should, with 20 stream active and the rest in a pending/queued state. This all works awesome copying and replicating at record speeds, but after about 2-4 hours, all goes haywire and we start getting the “connection was forcible closed” error, on running and even queued tasks, thus causing a mass failure.
When a scheduled job tries to run, the job start, snapshots gets created, all is ready to add the drives to the hot-add proxies, but then we encounter the connection issue. (Example below)
Code: Select all
2019/03/11 4:20:35 PM :: Queued for processing at 2019/03/11 4:20:35 PM
2019/03/11 4:20:35 PM :: Required backup infrastructure resources have been assigned
2019/03/11 4:20:35 PM :: Using DAPVDR03-Extent-01 scale-out repository extent
2019/03/11 4:20:40 PM :: VM processing started at 2019/03/11 4:20:40 PM
2019/03/11 4:20:40 PM :: VM size: 3,3 TB
2019/03/11 4:20:45 PM :: Getting VM info from vSphere
2019/03/11 4:20:50 PM :: Production datastore VNX20-0103-ATP01-006 is getting low on free space (610,6 GB left), and may run out of free disk space completely due to open snapshots.
2019/03/11 4:20:50 PM :: Creating VM snapshot
2019/03/11 4:21:03 PM :: Saving [VNX20-0103-ATP01-006] DAPFIL04/DAPFIL04.vmx
2019/03/11 4:21:03 PM :: Saving [VNX20-0103-ATP01-006] DAPFIL04/DAPFIL04.vmxf
2019/03/11 4:21:04 PM :: Saving [VNX20-0103-ATP01-006] DAPFIL04/DAPFIL04.nvram
2019/03/11 4:21:04 PM :: Using backup proxy dapvbr06.directaxis.co.za for disk Hard disk 1 [hotadd]
2019/03/11 4:21:04 PM :: Using backup proxy dapvbr05.directaxis.co.za for disk Hard disk 2 [hotadd]
2019/03/11 4:21:04 PM :: Using backup proxy dapvbr04.directaxis.co.za for disk Hard disk 3 [hotadd]
2019/03/11 4:21:39 PM :: Hard disk 1 (0,0 B) 0,0 B read at 0 KB/s [CBT]
2019/03/11 4:22:03 PM :: Hard disk 3 (0,0 B) 0,0 B read at 0 KB/s [CBT]
2019/03/11 4:22:19 PM :: Hard disk 2 (0,0 B) 0,0 B read at 0 KB/s [CBT]
2019/03/11 4:23:55 PM :: Removing VM snapshot
2019/03/11 4:25:20 PM :: Error: write: An existing connection was forcibly closed by the remote host
Failed to download disk.
Reconnectable protocol device was closed.
Failed to upload disk.
Agent failed to process method {DataTransfer.SyncDisk}.
2019/03/11 4:25:20 PM :: Network traffic verification detected no corrupted blocks
2019/03/11 4:25:20 PM :: Processing finished with errors at 2019/03/11 4:25:20 PM
This makes me think that this is related to some sort of buffer, that fill’s and then only gets cleared on a reboot of the server. And note, ONLY the 2x repository servers are rebooted, none of the other servers in the VEEAM real-estate.
Hopefully someone has recently ran into this exact same issue, as ANY help would be appreciated at this time.