Comprehensive data protection for all workloads
Post Reply
Delo123
Veteran
Posts: 361
Liked: 109 times
Joined: Dec 28, 2012 5:20 pm
Full Name: Guido Meijers
Contact:

Replication of big vm's fail

Post by Delo123 »

case 01687344

We are having some issues replicating big vm's.
This is the third vm's we are seeing this. The first 2 finally succeeded after trying lot's of times.
This one is 4TB and this time the initital seed (we seed the replicas from Backup) failed after 13 hours while replicating the 2nd disk.
The error is: an existing connection was forcibly closed by the remote host (DataTransfer.SyncDisk). Anyone seen this before? Smaller VM replicate just fine.
We have tried hot-add and also switches to nbd, but gives us the same error after some time...
Connectivity is all local, but i was thinking of trying with enabled WAN accelerators, not sure if it's more stable like that...
foggy
Veeam Software
Posts: 21069
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Replication of big vm's fail

Post by foggy »

Guido, WAN accelerators are not designed to work on fast links. Doesn't your case look similar to this one?
Delo123
Veteran
Posts: 361
Liked: 109 times
Joined: Dec 28, 2012 5:20 pm
Full Name: Guido Meijers
Contact:

Re: Replication of big vm's fail

Post by Delo123 »

Foggy,

thanks... somehow I missed that post when doing a search :(
Windows Firewall is disabled on all hosts, NTP seems to be ok for VM's & ESX Hosts.
I just disabled all offloading for all physical and virtual nics and will rerun the job, thx for now!
Delo123
Veteran
Posts: 361
Liked: 109 times
Joined: Dec 28, 2012 5:20 pm
Full Name: Guido Meijers
Contact:

Re: Replication of big vm's fail

Post by Delo123 »

Short Update.
Replication is still running after 10 hours (1.8TB / 30% done) which is a good sign,
however i noticed we had 4 VM's which had error messages in the daily backup of which 3 are DC's:
- failed to perform post backup application-aware processing steps.
- removing snapshot warning (i checked the vm's in vsphere, veeam snapshots were actually deleted in 3-4 seconds)

Also i had 3 vm's which needed consolidation after the backup, however veeam jobs showed successfull. I could consolidate all the vm's by hand in a few seconds...

It's been a long time ago we actually had any backup jobs failing or showing errors like this so somehow it's seems connected to either the replication job or the disabling of offloading (backups are done in direct-san mode with FC)
Delo123
Veteran
Posts: 361
Liked: 109 times
Joined: Dec 28, 2012 5:20 pm
Full Name: Guido Meijers
Contact:

Re: Replication of big vm's fail

Post by Delo123 »

2.9TB transferred and still going strong... :)
Now to find out which of the offloading settings is the one doing harm (TCP, Checksum, larage offload...)
Delo123
Veteran
Posts: 361
Liked: 109 times
Joined: Dec 28, 2012 5:20 pm
Full Name: Guido Meijers
Contact:

Re: Replication of big vm's fail

Post by Delo123 » 2 people like this post

Finished :) 4,5TB in 29 hours without a glitch....
I wil now try to replicate 5 big vm's in parallel and see if that's also stable and also if i can saturate 2x1GB links. But 99,99% offloading was the culprit :)

Thx again @foggy you should work in 1st line support ;)
nunciate
Expert
Posts: 247
Liked: 39 times
Joined: May 21, 2013 9:08 pm
Full Name: Alan Wells
Contact:

Re: Replication of big vm's fail

Post by nunciate »

Check your storage latency. Almost anytime I had issues with replicating large VMs it was either network or storage related. Mine was mostly storage because we had very slow old NetApps. After upgrading to super fast SANs we have seen no issues replicating.
Delo123
Veteran
Posts: 361
Liked: 109 times
Joined: Dec 28, 2012 5:20 pm
Full Name: Guido Meijers
Contact:

Re: Replication of big vm's fail

Post by Delo123 »

Thanks... Pretty sure it was offloading... We are on all flash, mostly latency is right around 0 :) but replicating from backups (seeding) but everything is ok now, tcp offloading was the issue...
Delo123
Veteran
Posts: 361
Liked: 109 times
Joined: Dec 28, 2012 5:20 pm
Full Name: Guido Meijers
Contact:

Re: Replication of big vm's fail

Post by Delo123 » 1 person likes this post

Hmm, doesn't seem to be good after all. During the weekend 2 out 5 replication jobs and their retries failed again. The biggest vm started with an error after 30 minutes (failed to open vddk disk), getting an existing connection was closed error on the next run (after 90 minutes) and not able to connect to dr vcenter (local network) on the next. the 4th run is now running since 27 hours and apparently it had to calculate digest for all disks after the errors :(
Even worse 50% of our backup jobs also failed with multiple reasons, especially during snapshot creation/removal and all seem linked to connected to either vcenter beeing gone for a few seconds. I also lose ping sometimes when editing a backup job when querying vcenter in the job. However only when these big replication jobs are running, we didn't have any issue with backup jobs for years before we started replicating these bigger vm's but since then this really seems unstable. Dealing with support on this hasn't been a very good. Usually it takes more than a day to get an answer and since these jobs run for 20-30 hours it takes 3 to 4 days to get a simple answer. But until now there hasn't been a single answer which could even somehow remotely be involved here... Sorry for sounding a bit frustrated :)
Post Reply

Who is online

Users browsing this forum: Amazon [Bot], Ivan239, nikita.kozlenko, RValensise and 144 guests