Socket has been forcibly closed

TitaniumCoder477 · Nov 03, 2017 2:16 am

Veeam Backup & Replication 9.5.0.1038

Background:

We were using a copy job to send backups across a 100/100 VPN connection to a NAS.
The VMs were copying OK, but the diff merge kept taking wayyyyyy too long. I think the job was attempting to pull the data back across the tunnel to perform the merge locally.

So two days ago:

We deployed an Ubuntu Server 16.04.3 LTS server (a VM) in our COLO.
I mounted the NAS share via NFS.
I added the Linux server in Veeam and created a Linux repo mapping to that NFS path.
I removed the copy jobs from DB, rescanned the new repo, and then mapped the job to them.

Issue:

The job runs flawlessly for 15 minutes.
Then the copying VMs fail with "Error: Socket has been forcibly closed. Failed to recover connection."
The pending VMs eventually fail as well (without copying any data).

Things I have tried:

Dropping from 4 simultaneous tasks to 1; no change in symptom.
Throttling; no change in symptom.
Increasing RAM in Linux VM from 4 to 8GB; no change in symptom.
Monitored a continuous ping test across VPN to Linux VM and to NAS repo; also monitored NFS mount; no issues even as symptom occurred religiously.

I opened a ticket with Veeam and submitted logs. Still waiting for a solid indication of the root cause and resolution.

Any ideas?

This is my NFS mount command, just in case someone asks:
192.168.5.20:/volume1/Veeam /media/backups nfs hard,intr,nfsvers=3,tcp,bg,_netdev,auto,nolock 0 0

nitramd · Post by **nitramd** » Nov 03, 2017 2:29 pm this post

Have you checked firewall(s)?

Have you looked in the OS log files for additional clues?

Post by **foggy** » Nov 07, 2017 4:11 pm this post

Hi James, please provide the support case ID for this issue. Thanks.

TitaniumCoder477 · Nov 08, 2017 12:02 am

This is the case number 02368996.

The source-side firewall is a Dell SonicWall. The destination is a Cisco router. Today I did comb through the SonicWall GUI's VPN settings and tweak a few things based on information I found online. Unfortunately, that didn't affect the symptom so I set the variables back to what they were. I also reviewed the logs and noticed something there that I brought to our network engineer's attention, something possibly related to the two routers negotiating.

The odd thing is that when I run a continuous ping test from B&R server to Linux VM and also to repo in destination, the ping tests are virtually flawless. Low latency. Maybe one packet dropped every couple hundred etc.

Historically, the symptom has occurred right at about 15 minutes. But once today it continued to about 30 minutes and then did the same thing.

TitaniumCoder477 · Nov 08, 2017 1:31 pm

I ran the job again this morning while monitoring the Linux agent logs. The job was running fine for about 15 minutes and then the connection died and this showed up:

Code: Select all

Every 1.0s: tail -n 5 Agent.Copy_job.Target_cli.log                                                  Wed Nov  8 08:31:09 2017

[08.11.2017 08:23:46] <140394612381440> cli|       - 14%, workload rps: 99/1/0, client: {381a4508-bd7f-5a43-9228-a15d876ffc84}
[08.11.2017 08:23:57] <140394612381440> cli|       - 14%, workload rps: 99/1/0, client: {381a4508-bd7f-5a43-9228-a15d876ffc84}
[08.11.2017 08:24:08] <140394612381440> cli|       - 14%, workload rps: 99/1/0, client: {381a4508-bd7f-5a43-9228-a15d876ffc84}
[08.11.2017 08:24:17] <49617024> cli| WARN|Input stream was closed.
[08.11.2017 08:24:17] <49617024> cli| ERR |Signal [1] received. Raising signal for default handling.

What I don't understand is why in Veeam B&R it says, "11/8/2017 8:30:13 AM :: Processing [vm name xyz] Error: Socket has been forcibly closed. Failed to recover connection," when if I just right-click > Sync Now and start a new cycle, it runs fine for another 15 minutes or so. Why does the job just hang? What is it waiting for?

nitramd · Post by **nitramd** » Nov 08, 2017 2:32 pm this post

Wonder if it would be a port issue on one of the machines... See this link for Veeam port usage https://helpcenter.veeam.com/docs/backu ... tml?ver=95

Have you checked the event logs on the Windows VBR server? Firewall on same server? I've had somewhat similar issues with some machines in my environment. One thing I've learned is to check the firewall on a machine.

You can use an interesting tool called cryping, it allows you to ping a port - runs on Windows. I've found it to be useful.

Hope this helps.

TitaniumCoder477 · Nov 08, 2017 9:33 pm

CryPing was actually VERY helpful and a new tool in my toolbox--thanks!!

So here is what I did:

Configured Notepad++ to refresh automatically

Started CryPing.exe -p 22 -t 192.168.5.21 > results22.txt

Started another CryPing.exe -p 2500 -t 192.168.5.21 > results2500.txt (since I could see in the log below that 2500 was the negotiated port)

Opened these two files and C:\ProgramData\Veeam\Backup\Copy_job\Job.Copy_job.BackupSync.log in Notepad++ (with silent refresh on etc)

SSH'd to Linux backup repo server and enabled firewall and added rules to allow port 22 and 2500-5000/tcp

On same, used Screen to split and monitor /var/log/ufw.log and Top

Kicked off job

Results:

Ping to SSH was always successful and remained successful throughout.

Ping to 2500 was only successful when job started. I'm assuming this because the Linux agent only listens after it is activated by the SSH command from B&R.

The job ran flawlessly at about 15MB/s for approximately 15 minutes. Then the connection died. See log excerpt below.

Ping to 2500 started timing out at the same time.

UFW on Linux backup repo did not report any denies or drops.

Conclusion: Because port 2500 only responded to pings when I started the B&R job, which leads me to believe the Linux agent then started listening, could it be that when the connection and the pings mutually died, the cause is the Linux agent? Could it be the agent died or dropped the connection for some reason? What logs should I pull to look at this symptom from that angle?

Log excerpt from C:\ProgramData\Veeam\Backup\Copy_job\Job.Copy_job.BackupSync.log. This excerpt starts right after one VM successfully copied and the next one started. Identifying characters have been replaced with XYZ etc.:

Code: Select all

[08.11.2017 15:33:19] <01> Info     [RemoteBackupTaskBuilder] Skipping VM XYZ-DC1 processing: completed
[08.11.2017 15:33:20] <01> Info     [NewRemoteOibAction] About to process source Oib (id='dd3539d0-564f-4b92-8f22-59b8dde294c2', CreationTime='11/7/2017 10:33:29 PM', Point='\\XYZ-SYN-001\Veeam\Backup of Host1\Backup of Host1D2017-11-07T220047.vbk')   into the target storage: '50879984-6da7-441e-9bf5-f1436b3e090b'
[08.11.2017 15:33:20] <01> Info     [ViTaskBuilder] Found source Oib (Storage: '\\XYZ-SYN-001\Veeam\Backup of Host1\Backup of Host1D2017-11-07T220047.vbk', CreationTime: '11/7/2017 10:33:29 PM') for VM 'XYZ-Print'
[08.11.2017 15:33:20] <01> Info     Task total size: 60.0 GB
[08.11.2017 15:33:20] <01> Info           Task progress: '0%', '0' of '64,424,509,440' bytes, object '0' of '0'
[08.11.2017 15:33:20] <01> Info     Task total size: 60.0 GB
[08.11.2017 15:33:20] <01> Info           Task progress: '0%', '0' of '64,424,509,440' bytes, object '0' of '0'
[08.11.2017 15:33:20] <01> Info     [RemoteBackupTaskBuilder] Preparing new task for execution. XYZ-Print
[08.11.2017 15:33:20] <01> Info     [IdleStateVmController] Stopping idle
[08.11.2017 15:33:20] <01> Info     [TaskExecutor] Preparing task XYZ-Print
[08.11.2017 15:33:20] <01> Info     [TaskRepositoryResourceHelper] Adding target repository resource request to task XYZ-Print
[08.11.2017 15:33:20] <01> Info         [Id: 1db0ba92-3466-4569-990a-d034a75679c9, Type: vbk, State: Existing, IsSlotNeeded: True, Access: Read, ObjectId: <no object id>]
[08.11.2017 15:33:20] <01> Info     [TaskRepositoryResourceHelper] Adding source repository resource request to task XYZ-Print
[08.11.2017 15:33:20] <01> Info         [Id: 1db0ba92-3466-4569-990a-d034a75679c9, Type: vbk, State: Existing, IsSlotNeeded: True, Access: Read, ObjectId: <no object id>]
[08.11.2017 15:33:20] <01> Info     [TaskRepositoryResourceHelper] Adding file sessions resource request for 43 files to task XYZ-Print
[08.11.2017 15:33:20] <01> Info     Job size: 16.8 GB (16.8 GB provisioned).
[08.11.2017 15:33:20] <01> Info     [TaskExecutor] Task XYZ-Print prepared
[08.11.2017 15:33:20] <01> Info     [ParallelTaskExecutor] Starting task XYZ-Print execution
[08.11.2017 15:33:20] <01> Info     Waiting for backup infrastructure resources availability
[08.11.2017 15:33:28] <07> Info     [BackupMetaUpdater] Generating backup meta because it was requested by the following threads:
[08.11.2017 15:33:28] <07> Info         [BackupMetaUpdater] Thread <24> CPL-DSM
[08.11.2017 15:33:28] <07> Error    [JobSession] LastUserCryptoSpec is null. Cannot find the user key by the last storage.
[08.11.2017 15:33:28] <07> Info     Created mutex [Global\BackupMeta[0e5a6d41-f687-43d8-ad41-ea05f214fbbe]]
[08.11.2017 15:33:28] <07> Info     Acquired mutex [Global\BackupMeta[0e5a6d41-f687-43d8-ad41-ea05f214fbbe]]
[08.11.2017 15:33:28] <07> Info     [IncrementalMetaGenerator] Generating meta for backup [88a5f8f3-7411-4a96-820d-fbde0cb7680a], usn [2342835]
[08.11.2017 15:33:28] <07> Info     [BackupMeta] Saving backup meta to '|media|backups|seed|Seed job for COLO.vbm'
[08.11.2017 15:33:28] <07> Info     [Ssh] Check existing '/media/backups/seed'
[08.11.2017 15:33:28] <07> Info     [Ssh] Saving binary to '/media/backups/seed/Seed job for COLO.vbm_4_tmp'.
[08.11.2017 15:33:29] <07> Info     [Ssh] Check existing '/media/backups/seed/Seed job for COLO.vbm'
[08.11.2017 15:33:29] <07> Info     [Ssh] Check existing '/media/backups/seed/Seed job for COLO.vbm'
[08.11.2017 15:33:29] <07> Info     [Ssh] Deleting "/media/backups/seed/Seed job for COLO.vbm"
[08.11.2017 15:33:29] <07> Info     [Ssh] Check existing '/media/backups/seed/Seed job for COLO.vbm_4_tmp'
[08.11.2017 15:33:29] <07> Info     [Backup] SetMetaUpdateTimeAndVersion, old value: [11/8/2017 3:33:13 PM,3021] new value: [11/8/2017 3:33:29 PM,3022]
[08.11.2017 15:33:29] <07> Info     Releasing mutex [Global\BackupMeta[0e5a6d41-f687-43d8-ad41-ea05f214fbbe]]
[08.11.2017 15:33:29] <07> Info     [BackupMetaUpdater] Backup meta generated
[08.11.2017 15:33:43] <40> Info             [AP] (3202) command: 'Invoke: Network.UpdateThrottling\n{\n  (EInt32) KbpsWriteBandwidth = 5760\n}\n'
[08.11.2017 15:33:43] <29> Info                   [AP] (3202) output: <VCPCommandResult result="true" exception="" />
[08.11.2017 15:33:43] <31> Info                   [AP] (3202) output: <VCPCommandArgs />
[08.11.2017 15:33:43] <31> Info                   [AP] (3202) output: >
[08.11.2017 15:39:46] <15> Error                  (SSH Transport) error: ChannelError: An existing connection was forcibly closed by the remote host
[08.11.2017 15:39:46] <15> Error                  [AP] (Client) error: An existing connection was forcibly closed by the remote host
[08.11.2017 15:39:46] <15> Error                  (Client) error: ChannelError: An existing connection was forcibly closed by the remote hostAn existing connection was forcibly closed by the remote hostChannelError: An existing connection was forcibly closed by the remote host
[08.11.2017 15:39:47] <48> Error    [ReconnectableSocket][PacketAsyncReceiver] Exception on [b93e3bab-647e-4d36-8a0c-f3b6aae7eb29].
[08.11.2017 15:39:47] <48> Error    Unexpected socket shutdown. (System.Exception)
[08.11.2017 15:39:47] <48> Error       at Veeam.Backup.Common.Reconnect.CPacketAsyncReceiver.PushData(CSocketAsyncEventArgsEx e)
[08.11.2017 15:39:47] <48> Info     [ReconnectableSocket] Nonretryable error on [b93e3bab-647e-4d36-8a0c-f3b6aae7eb29].
[08.11.2017 15:39:47] <48> Info     [ReconnectableSocket] Reconnecting [b93e3bab-647e-4d36-8a0c-f3b6aae7eb29] (shutdown notification)...
[08.11.2017 15:39:47] <48> Info     [NetSocket] Connect
[08.11.2017 15:39:47] <44> Error    [ReconnectableSocket][PacketAsyncReceiver] Exception on [aac8507e-72bf-4412-9d64-3e490eaa53f7].
[08.11.2017 15:39:47] <44> Error    Unexpected socket shutdown. (System.Exception)
[08.11.2017 15:39:47] <44> Error       at Veeam.Backup.Common.Reconnect.CPacketAsyncReceiver.PushData(CSocketAsyncEventArgsEx e)
[08.11.2017 15:39:47] <44> Info     [ReconnectableSocket] Nonretryable error on [aac8507e-72bf-4412-9d64-3e490eaa53f7].
[08.11.2017 15:39:47] <44> Info     [ReconnectableSocket] Reconnecting [aac8507e-72bf-4412-9d64-3e490eaa53f7] (shutdown notification)...
[08.11.2017 15:39:47] <44> Info     [NetSocket] Connect
[08.11.2017 15:39:48] <48> Error    Failed to connect to agent's endpoint '192.168.5.21:2500'. Host: '192.168.5.21'.
[08.11.2017 15:39:48] <48> Error    No connection could be made because the target machine actively refused it 192.168.5.21:2500 (System.Net.Sockets.SocketException)
[08.11.2017 15:39:48] <48> Error       at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
[08.11.2017 15:39:48] <48> Error       at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
[08.11.2017 15:39:48] <48> Error       at Veeam.Backup.Common.CNetSocket.Connect(IPEndPoint remoteEp)
[08.11.2017 15:39:48] <48> Error       at Veeam.Backup.AgentProvider.CAgentEndpointConnecter.ConnectToAgentEndpoint(ISocket socket, IAgentEndPoint endPoint)
[08.11.2017 15:39:48] <44> Error    Failed to connect to agent's endpoint '192.168.5.21:2500'. Host: '192.168.5.21'.
[08.11.2017 15:39:48] <44> Error    No connection could be made because the target machine actively refused it 192.168.5.21:2500 (System.Net.Sockets.SocketException)
[08.11.2017 15:39:48] <44> Error       at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
[08.11.2017 15:39:48] <44> Error       at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
[08.11.2017 15:39:48] <44> Error       at Veeam.Backup.Common.CNetSocket.Connect(IPEndPoint remoteEp)
[08.11.2017 15:39:48] <44> Error       at Veeam.Backup.AgentProvider.CAgentEndpointConnecter.ConnectToAgentEndpoint(ISocket socket, IAgentEndPoint endPoint)
[08.11.2017 15:39:49] <44> Warning  [ReconnectableSocket] Failed to recover [aac8507e-72bf-4412-9d64-3e490eaa53f7].
[08.11.2017 15:39:49] <44> Error    Reconnect operation was aborted. (Veeam.Backup.Common.CRegeneratedTraceException)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.CExceptionUtil.RegenTraceExc(Exception originalExc, String formatString, Object[] args)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.Reconnect.CNetworkReconnectContext.WaitOnRetry(TimeSpan interval, Exception firstChanceEx)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.Reconnect.CReconnectEngine.RunConnectCycle()
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.Reconnect.CReconnectEngine.DoReconnect()
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.Reconnect.CReconnectEngine.PerformReconnectSafe()
[08.11.2017 15:39:49] <44> Error    No connection could be made because the target machine actively refused it 192.168.5.21:2500 (System.Net.Sockets.SocketException)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.AgentProvider.CAgentEndpointConnecter.ConnectToAgentEndpoint(ISocket socket, IAgentEndPoint endPoint)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.AgentProvider.CAgentEndpointConnecter.Connect(IPEndPoint& connectedEndpoint)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.AgentProvider.CReconnectableSessionStarter.Reconnect(CAgentEndpointConnecter connecter, Boolean redirectOutput)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.Reconnect.CReconnectEngine.RunConnectCycle()
[08.11.2017 15:39:49] <48> Warning  [ReconnectableSocket] Failed to recover [b93e3bab-647e-4d36-8a0c-f3b6aae7eb29].
[08.11.2017 15:39:49] <44> Error    Reconnect operation was aborted. (Veeam.Backup.Common.CRegeneratedTraceException)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.CExceptionUtil.RegenTraceExc(Exception originalExc, String formatString, Object[] args)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.Reconnect.CErrorPort.TryRegenExc(Exception ex)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.Reconnect.CInterlockedReconnectEngine.UnlockAfterAccess()
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.Reconnect.CPacketAsyncReceiver.PushData(CSocketAsyncEventArgsEx e)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.Reconnect.CPacketAsyncReceiver.OnRawDataReceived(Object sender, CSocketAsyncEventArgsEx e)
[08.11.2017 15:39:49] <44> Error    Unexpected socket shutdown. (System.Exception)
[08.11.2017 15:39:49] <44> Error       at Veeam.Backup.Common.Reconnect.CPacketAsyncReceiver.PushData(CSocketAsyncEventArgsEx e)
[08.11.2017 15:39:49] <48> Error    Reconnect operation was aborted. (Veeam.Backup.Common.CRegeneratedTraceException)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.CExceptionUtil.RegenTraceExc(Exception originalExc, String formatString, Object[] args)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.Reconnect.CNetworkReconnectContext.WaitOnRetry(TimeSpan interval, Exception firstChanceEx)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.Reconnect.CReconnectEngine.RunConnectCycle()
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.Reconnect.CReconnectEngine.DoReconnect()
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.Reconnect.CReconnectEngine.PerformReconnectSafe()
[08.11.2017 15:39:49] <48> Error    No connection could be made because the target machine actively refused it 192.168.5.21:2500 (System.Net.Sockets.SocketException)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.AgentProvider.CAgentEndpointConnecter.ConnectToAgentEndpoint(ISocket socket, IAgentEndPoint endPoint)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.AgentProvider.CAgentEndpointConnecter.Connect(IPEndPoint& connectedEndpoint)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.AgentProvider.CReconnectableSessionStarter.Reconnect(CAgentEndpointConnecter connecter, Boolean redirectOutput)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.Reconnect.CReconnectEngine.RunConnectCycle()
[08.11.2017 15:39:49] <48> Error    Reconnect operation was aborted. (Veeam.Backup.Common.CRegeneratedTraceException)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.CExceptionUtil.RegenTraceExc(Exception originalExc, String formatString, Object[] args)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.Reconnect.CErrorPort.TryRegenExc(Exception ex)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.Reconnect.CInterlockedReconnectEngine.UnlockAfterAccess()
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.Reconnect.CPacketAsyncReceiver.PushData(CSocketAsyncEventArgsEx e)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.Reconnect.CPacketAsyncReceiver.OnRawDataReceived(Object sender, CSocketAsyncEventArgsEx e)
[08.11.2017 15:39:49] <48> Error    Unexpected socket shutdown. (System.Exception)
[08.11.2017 15:39:49] <48> Error       at Veeam.Backup.Common.Reconnect.CPacketAsyncReceiver.PushData(CSocketAsyncEventArgsEx e)
[08.11.2017 15:39:49] <15> Info                   [AP] (1309) state: closed

TitaniumCoder477 · Nov 15, 2017 12:44 pm

Still getting the same symptom.

nitramd · Post by **nitramd** » Nov 15, 2017 2:35 pm this post

It looks like the socket closure is happening ~6 minutes into the job based on the log file. You might find within one of the Linux log files "connection reset" - you can grep for it. If you find this string, look around at other close by entries to see if there is any indication as to what is causing the issue.

Any word from Support?

TitaniumCoder477 · Nov 17, 2017 6:28 pm

RESOLVED!

We increased the SonicWall TCP timeout from 20 minutes to 60 minutes and ran the job. It ran for a full on hour before timing out! That coincided with SW renewal of support, so today our network engineer jumped on a call with SW support who clarified that TCP packets contain timeout. Apparently even though data is streaming across port 2500, the SW is not seeing a keep alive packet so basically thinks the connection is dead. So at the timeout, it drops the connection. We increased the TCP timeout value to 1440 minutes, and the job is still running at 3 1/2 hours.

I really appreciate everyone who has contributed to figuring this out. Veeam support was right in the end regarding the source of the problem, but I am thankful that everyone was willing examine and consider other possibilities.

nitramd · Post by **nitramd** » Nov 17, 2017 7:52 pm this post

Good news! Thanks for posting back, good information.

R&D Forums

Socket has been forcibly closed

Re: Socket has been forcibly closed

Re: Socket has been forcibly closed

Re: Socket has been forcibly closed

Re: Socket has been forcibly closed

Re: Socket has been forcibly closed

Re: Socket has been forcibly closed

Re: Socket has been forcibly closed

Re: Socket has been forcibly closed

Re: Socket has been forcibly closed

Re: Socket has been forcibly closed

Who is online