During replication host disconnects

HendersonD · Post by **HendersonD** » Feb 02, 2019 2:12 pm this post

Our backup jobs use the storage network on the source and destination sides and runs great
Our replication jobs use the storage network on the source side but sends data through the management interface on our ESXi host on the destination side. In other words Direct SAN on the source side and network mode on the destination side.

During replication we quite often get this, Error: Failed to open VDDK disk
After further investigation we found that during replication a host would become disconnected from vCenter for a short period of time. I opened a ticket with VMware and they pointed me towards this article
https://kb.vmware.com/s/article/1005757
It basically says that if the management interface gets congested, the heartbeats from the host to vCenter cannot get through, the host becomes disconnected, and the Veeam replication job fails.

My storage network AND my management network interfaces are all 10 gig. I monitored the management network interfaces during replication and their utilization never went beyond 5%. Any ideas on what is root cause of this problem?

Post by **nielsengelen** » Feb 03, 2019 2:16 pm this post

Finding the root cause of this problem requires insight into the logs. Most likely VMware and maybe Veeam logs. Did you already contact Veeam or VMware support about the matter? It may be a known bug or issue related to the VMware build.

HendersonD · Post by **HendersonD** » Feb 04, 2019 6:02 pm this post

I did open a ticket with Veeam back when I saw replication failing for certain VMs with this shown:
Error: Failed to open VDDK disk

They had me try several different things that did not fix the problem. This was before I realized it was a host disconnect that was causing this problem. I recently opened a ticket with VMWare and they sent me this article
https://kb.vmware.com/s/article/1005757
They want me to increase the time out limit in vCenter per that article. There are two problems with that approach
1. Enter the proper value in Advanced Settings for vCenter cannot be reversed
2. The VMWare article states this "Note: Increasing the timeout is a short-term solution until the network issues can be resolved."

I would rather find the root cause than just mask the issue with a longer timeout. Between the Veeam proxy server and the host is a Juniper switch. It could be switchport misconfiguration like flow-control, it could be an issue on the host, it could be an issue with the nic cards on the host, it could be an issue with the nic card on the proxy server, and the list goes on. Tough one to track down.

I was hoping VMWare through looking at log files could point me in the right direction. When I look at the Juniper switch ports during replication there are dropped packets. During backup there are no dropped packet. Backup uses DirectSan for source and destination. Replication uses DirectSan for source and network through the host management port for destination

Post by **nielsengelen** » Feb 05, 2019 4:54 pm this post

This can indeed be a tough one to resolve. From my experience in the past, we usually tried to go piece by piece. Leverage another NIC (if possible), double check switch configuration (jumbo frames fun back in the day). The main problem is if VMware sees nothing wrong and we can't find anything in our logs, there has to be an external impact by the network as explained. Have you considered basic things like opening a ping for a long time into a log file to see how stable the connectivity is?

Is there anything else running over this connection which may cause the issue?

Post by **foggy** » Feb 07, 2019 10:42 am this post

Not sure how it will affect the performance, but you could switch to hotadd on target.

HendersonD · Post by **HendersonD** » Mar 10, 2019 4:37 pm this post

I have had an open ticket with Veeam, Case # 03414881, for about month on this issue. I also opened a case with VMware about this same issue and so far we cannot solve it. I first looked at everything I could think of including DAC cables, using different network ports, flow control, speed and duplex on switch ports, vSwitch and vKernel settings (teaming, etc). I also increased the timeout between hosts and vCenter to 120 seconds per VMWare support and that changed nothing. I just ran an interesting test using iPerf3.

I started iPerf as a server (listening) on the ESXi host. I then ran iPerf on my physical proxy server. By sending data from the proxy server to the host via its management interface using iperf, I can mimic the traffic flow during a Veeam replication job. iPerf was pushing 7.5Gbps on a 10Gbps interface or about 75% usage. I ran it for 30 minutes and there were no host disconnects. During a Veeam replication job, the management interface uses about 600Mbps or 6% usage and I get disconnects. Certainly the fact an ESXi host loses connection to vCenter during a replication job (the job then fails) has nothing to do with the shear volume of traffic being sent to the management interface. What I cannot explain is how a Veeam replication job is different from the testing I did with iPerf?

HendersonD · Post by **HendersonD** » Mar 26, 2019 5:36 pm this post

We are getting closer to the root cause of this. VMWare took a long look at log files and determined that during replication jobs there are between 70 and 80 requests hitting the VCSA appliance causing the VPXA service to crash. Here is one line from the log file that shows the issue

2019-03-20T10:09:33.065Z warning vpxa[3069707] [Originator@6876 sub=Libs] [NFC ERROR] Sending Nfc error 12: NfcFssrvrOpen: Failed to open '[VeeamReplica] vCenter2_replica/vCenter2_11-000002.vmdk': Too many file pairs specified

My Veeam physical proxy server has 2 CPUs with 16 cores each so I have Max Concurrent Tasks set to 32 for this proxy. I am replicating 14 VMs every hour on the hour and somehow this is opening too many files/connections to the VCSA causing this issue

Has anyone else encountered this? I do have an open ticket with Veeam about this, ticket number 03414881. I asked VMWare if increasing vCPU or RAM on the VCSA would help and they said no, this needs to be solved from the Veeam side

Oct 18, 2019 12:37 pm

Hi HendersonD

We have the exact same problem. I have also had a long support case with vmware about this issue when our replication destination environment was at version 6.5. Root cause of failed replications was the vpxa service crashed on esxi hosts. Because of lack of available ram memory for the service. Alot of available ram memory on the esxi hosts but to low hard limit for vpxa service. Our workaround then was to split the load on more esxi hosts. Then it worked better. VMware told me then that this should be fixed in later versions. But could not say what version. Now after upgrade to vsphere 6.7 u1b. It is really bad again. Not sure of what to do now. Maybe I will try using hotadd instead of NBD. But in theory I can't see that it would help. Maybe make it even worse because I think there could be even more request to vpxa service than with NBD.

Veeam case: 03804284

\Masonit

HendersonD · Post by **HendersonD** » Nov 02, 2019 8:34 pm this post

I am now running ESXi 6.7 Update 3 and still seeing this issue. The only work around I have found is to limit "Max concurrent tasks" on my Veeam proxy server to 10. The Veeam proxy has 32 cores so I am certainly not using the full capability of this server. Having it set to 10 slows down my replication and backup jobs.

HendersonD · Post by **HendersonD** » Nov 02, 2019 9:22 pm this post

masonit,

I am on the newest version of VCSA and ESXi hosts. Did VMware indicate when this might be fixed? I plan on opening up another ticket this week with VMware about this issue. My only work around at the moment is to set "Max concurrent tasks" on the Veeam proxy server to 10. My proxy server has 32 cores so by lowering it to 10 I am slowing down all of my replication and backup jobs.

Dave

Post by **masonit** » Dec 04, 2019 3:44 pm this post

Hi

VMware told me it shoud be "fixed" in newer versions. But could not say what version. I have tried using hotadd on target proxy. But did not solve the problem. Only way to minimize disconnect is, as you said to limit concurrent tasks. But then everything runs much slower and not possible to run as often as we want.

\Magnus

Post by **poulpreben** » Dec 10, 2019 6:16 am this post

We’re facing the same issue on 6.7U3 hosts. Case #03899451 pointed us to this post.

@Magnus: How many tasks did you find being optimal from a stability point of view?

Currently we have configured 10 tasks across two proxies and it’s still failing almost every session. We also followed the VMware KB and increased the timeout to 120 seconds.

For these particular VMs we have an RPO target of 30 minutes, but currently it takes about 1.5 hours (when the sessions are successful). It takes more when the sessions fail due to retries.

Post by **masonit** » Dec 12, 2019 10:24 am this post

Hi Preben

It was with vsphere 6.5 we managed to control it pretty well. I think we had similar limits as you have now. But after upgrade to 6.7 it got even worse. We had to set limit so low that it took forever to get anything done. In 6.5 we replicated to 2 hosts. Now with 6.7 we use 5 hosts as replication target. It's working but there is some retries and we struggle to reach rpo target.

Has anyone any info from vmware when they plan to fix it?

\Masonit

R&D Forums

During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Re: During replication host disconnects

Who is online