Replication of big vm's fail

Delo123 · Post by **Delo123** » Feb 11, 2016 9:19 am this post

case 01687344

We are having some issues replicating big vm's.
This is the third vm's we are seeing this. The first 2 finally succeeded after trying lot's of times.
This one is 4TB and this time the initital seed (we seed the replicas from Backup) failed after 13 hours while replicating the 2nd disk.
The error is: an existing connection was forcibly closed by the remote host (DataTransfer.SyncDisk). Anyone seen this before? Smaller VM replicate just fine.
We have tried hot-add and also switches to nbd, but gives us the same error after some time...
Connectivity is all local, but i was thinking of trying with enabled WAN accelerators, not sure if it's more stable like that...

Post by **foggy** » Feb 11, 2016 9:34 am this post

Guido, WAN accelerators are not designed to work on fast links. Doesn't your case look similar to this one?

Delo123 · Post by **Delo123** » Feb 11, 2016 10:41 am this post

Foggy,

thanks... somehow I missed that post when doing a search

Windows Firewall is disabled on all hosts, NTP seems to be ok for VM's & ESX Hosts.
I just disabled all offloading for all physical and virtual nics and will rerun the job, thx for now!

Delo123 · Post by **Delo123** » Feb 11, 2016 9:28 pm this post

Short Update.
Replication is still running after 10 hours (1.8TB / 30% done) which is a good sign,
however i noticed we had 4 VM's which had error messages in the daily backup of which 3 are DC's:
- failed to perform post backup application-aware processing steps.
- removing snapshot warning (i checked the vm's in vsphere, veeam snapshots were actually deleted in 3-4 seconds)

Also i had 3 vm's which needed consolidation after the backup, however veeam jobs showed successfull. I could consolidate all the vm's by hand in a few seconds...

It's been a long time ago we actually had any backup jobs failing or showing errors like this so somehow it's seems connected to either the replication job or the disabling of offloading (backups are done in direct-san mode with FC)

Delo123 · Post by **Delo123** » Feb 12, 2016 8:31 am this post

2.9TB transferred and still going strong...

Now to find out which of the offloading settings is the one doing harm (TCP, Checksum, larage offload...)

Delo123 · Feb 12, 2016 5:16 pm

Finished

4,5TB in 29 hours without a glitch....
I wil now try to replicate 5 big vm's in parallel and see if that's also stable and also if i can saturate 2x1GB links. But 99,99% offloading was the culprit

Thx again @foggy you should work in 1st line support

nunciate · Post by **nunciate** » Feb 12, 2016 8:02 pm this post

Check your storage latency. Almost anytime I had issues with replicating large VMs it was either network or storage related. Mine was mostly storage because we had very slow old NetApps. After upgrading to super fast SANs we have seen no issues replicating.

Delo123 · Post by **Delo123** » Feb 12, 2016 9:27 pm this post

Thanks... Pretty sure it was offloading... We are on all flash, mostly latency is right around 0

but replicating from backups (seeding) but everything is ok now, tcp offloading was the issue...

Delo123 · Feb 22, 2016 10:29 am

Hmm, doesn't seem to be good after all. During the weekend 2 out 5 replication jobs and their retries failed again. The biggest vm started with an error after 30 minutes (failed to open vddk disk), getting an existing connection was closed error on the next run (after 90 minutes) and not able to connect to dr vcenter (local network) on the next. the 4th run is now running since 27 hours and apparently it had to calculate digest for all disks after the errors

Even worse 50% of our backup jobs also failed with multiple reasons, especially during snapshot creation/removal and all seem linked to connected to either vcenter beeing gone for a few seconds. I also lose ping sometimes when editing a backup job when querying vcenter in the job. However only when these big replication jobs are running, we didn't have any issue with backup jobs for years before we started replicating these bigger vm's but since then this really seems unstable. Dealing with support on this hasn't been a very good. Usually it takes more than a day to get an answer and since these jobs run for 20-30 hours it takes 3 to 4 days to get a simple answer. But until now there hasn't been a single answer which could even somehow remotely be involved here... Sorry for sounding a bit frustrated

R&D Forums

Replication of big vm's fail

Re: Replication of big vm's fail

Re: Replication of big vm's fail

Re: Replication of big vm's fail

Re: Replication of big vm's fail

Re: Replication of big vm's fail

Re: Replication of big vm's fail

Re: Replication of big vm's fail

Re: Replication of big vm's fail

Who is online