Random Replication Failures

NeedsMoreRGB · Post by **NeedsMoreRGB** » Dec 10, 2019 6:26 pm this post

Support Case#: 03838548 - multiple log submissions
2 Sites in active, active arrangement
NetApp HCI - 3 computer, 5 storage nodes, all flash, Mellanox switches
Backup & Replication V 9.5.4.2866 - Proxy / Target is a Windows 2016 VM at both sites
Primary backup storage is a NetApp E-Series iSCSI Target with ReFS for the file system / E-series sync site to site.
VMs are running in vSphere 6.7u3

MPLS (20mbps) and a dedicated replication circuit (200mbps) are connecting these sites. Host file overrides on each Veeam B&R machine force them to use the replication subnet range to communicate with the other site's Veeam / ESXi hosts / vSphere. The NetApp E-Series utilize the same replication circuit.

We currently have very consistent backup jobs from each of our vSphere / Veeam / E-series environments - we almost never see failures. Replication on the other hand is a complete mess. Out of the 40+ machines that we replicate from our primary to backup data center, we will almost always see at least one to two machines, usually grouped together, throw an error like this the one below. That is to say, the other 38 or so will successfully replicate, have their old snapshots / restore points removed to match replication job policy, and generally behave as expected.

Code: Select all

12/10/2019 10:17:07 AM :: Processing VMname Error: Failed to open VDDK disk [[NetApp-HCI-Datastore-02] VMname_replica/VMname.vmdk] ( is read-only mode - [false] )
Logon attempt with parameters [VC/ESX: [ws-vcenter01.name.domain];Port: 443;Login: [vsphere.local\administrator];VMX Spec: [moref=vm-18196];Snapshot mor: [snapshot-18197];Transports: [nbd];Read Only: [false]] failed because of the following errors:
Failed to download disk.
Shared memory connection was closed.
Failed to upload disk.
Agent failed to pro

This error seems to happen to various machines with almost no rhyme or reason that I can discern. When this Veeam failure occurs, seemingly there's some kind of network overload or reset happening: The NetApp E-series boxes fire off email alerts stating that they have lost contact with the partner E-Series. I could start to suspect a network issue between our sites, but with one exception, the ONLY time I get these email alerts is during a Veeam Replication job. The exception was during a known, scheduled outage.

In an attempt to further troubleshoot permission, individual VM, and proxy resource overload issues, our main replication job was removed and a subset of individual (tedious) VM replication jobs were created. These have been a scatter plot of mostly successful replications, with several failures in a limited number of attempts. These individual replication jobs are running one at a time in a chained manner. Some VMs had a single disk, other had multiple disks. Failures seem to happen to both varieties of VMs (singular and multiple disks). Guest operating systems have been Windows 2008R2, 2012R2, 2016 and Linux (I forget specifics at this moment).

Thoughts? Suggestions? Need more details?
Thanks,
David

soncscy · Post by **soncscy** » Dec 10, 2019 8:15 pm this post

Hey David,

If I can ask, are you using NBD or hotadd? VMware has some pretty severe limits for NBD in the form of the NFC protocol, and these are compounded in the most recent releases. If you're going over NBD, I'd use hotadd and just remove the NFC aspect altogether -- almost everything we've had in our shop and in our clients has been around concurrent tasks + NBD.

Since it's your replica that the library fails to access, I'm gonna assume maybe your host just freaks out, or the VDDK proxy just cannot handle the connection (memory maybe?), but we always had a much better time with hotadd.

Remember, with replica, you cannot use any fancy SAN storage stuffs except for the first run because VMware doesn't let you write to snapshots with DirectSAN, so if you're expecting DirectSAN performance on replicas, forget it. It's impossible.

bdufour · Post by **bdufour** » Dec 13, 2019 6:26 pm this post

also check the repo where you are writing your replica metadata too, as it only seems to be with replicas.

also what do the bottleneck stats for these jobs look like when they fail?

R&D Forums

Random Replication Failures

Re: Random Replication Failures

Re: Random Replication Failures

Who is online