We have been running about a dozen replication jobs over a high speed WAN connection for the past 3 months. Until last week these jobs ran without a hitch.
Last week we started seeing numerous retries on the jobs, with several resulting in complete failures. A review of the logs indicates the jobs fail due to a connection error with the vCenter Server.Sample of error: 12/30/2016 10:23:33 AM :: Processing DBnode2 Error: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.0.100.24:443
What is odd, is that the jobs actually do connect to the vCenter Server and often stay connected for an extended period of time and even process data (I have watched the activities on both the Veeam backup server and the target vSphere environment). Some jobs get to almost 100% complete during disk processing before failing. Even more perplexing is that all jobs contain multiple virtual machines and some of the machines within a job will fail while others will succeed. The issue is getting progressively worse, with more an more jobs failing completely.
I have ticket # 02023509 open with Veeam. They are pointing to the vCenter Server as the culprit and so I have checked the vCenter logs, general health and even rebooted it a few times ... no luck. They also are suggesting network instability; however, we are not having any issues with the 20 or so backup copy jobs we also process across this same WAN to the same target vSphere environment using the same vCenter Server.
There have been no changes made to the target vSphere environment and/or vcenter server in over a year. We are running vSphere 6.0 in the target environment and 5.5 u3d in the production environment. The only change that has been made in the past (3) months has been an upgrade from Veeam Availability Suite v9.0 Ent plus to v9.5
Anyone have any ideas or experiencing this same problem?