2 Sites in active, active arrangement
NetApp HCI - 3 computer, 5 storage nodes, all flash, Mellanox switches
Backup & Replication V 9.5.4.2866 - Proxy / Target is a Windows 2016 VM at both sites
Primary backup storage is a NetApp E-Series iSCSI Target with ReFS for the file system / E-series sync site to site.
VMs are running in vSphere 6.7u3
MPLS (20mbps) and a dedicated replication circuit (200mbps) are connecting these sites. Host file overrides on each Veeam B&R machine force them to use the replication subnet range to communicate with the other site's Veeam / ESXi hosts / vSphere. The NetApp E-Series utilize the same replication circuit.
We currently have very consistent backup jobs from each of our vSphere / Veeam / E-series environments - we almost never see failures. Replication on the other hand is a complete mess. Out of the 40+ machines that we replicate from our primary to backup data center, we will almost always see at least one to two machines, usually grouped together, throw an error like this the one below. That is to say, the other 38 or so will successfully replicate, have their old snapshots / restore points removed to match replication job policy, and generally behave as expected.
Code: Select all
12/10/2019 10:17:07 AM :: Processing VMname Error: Failed to open VDDK disk [[NetApp-HCI-Datastore-02] VMname_replica/VMname.vmdk] ( is read-only mode - [false] )
Logon attempt with parameters [VC/ESX: [ws-vcenter01.name.domain];Port: 443;Login: [vsphere.local\administrator];VMX Spec: [moref=vm-18196];Snapshot mor: [snapshot-18197];Transports: [nbd];Read Only: [false]] failed because of the following errors:
Failed to download disk.
Shared memory connection was closed.
Failed to upload disk.
Agent failed to pro
In an attempt to further troubleshoot permission, individual VM, and proxy resource overload issues, our main replication job was removed and a subset of individual (tedious) VM replication jobs were created. These have been a scatter plot of mostly successful replications, with several failures in a limited number of attempts. These individual replication jobs are running one at a time in a chained manner. Some VMs had a single disk, other had multiple disks. Failures seem to happen to both varieties of VMs (singular and multiple disks). Guest operating systems have been Windows 2008R2, 2012R2, 2016 and Linux (I forget specifics at this moment).
Thoughts? Suggestions? Need more details?
Thanks,
David