-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Replication of big vm's fail
case 01687344
We are having some issues replicating big vm's.
This is the third vm's we are seeing this. The first 2 finally succeeded after trying lot's of times.
This one is 4TB and this time the initital seed (we seed the replicas from Backup) failed after 13 hours while replicating the 2nd disk.
The error is: an existing connection was forcibly closed by the remote host (DataTransfer.SyncDisk). Anyone seen this before? Smaller VM replicate just fine.
We have tried hot-add and also switches to nbd, but gives us the same error after some time...
Connectivity is all local, but i was thinking of trying with enabled WAN accelerators, not sure if it's more stable like that...
We are having some issues replicating big vm's.
This is the third vm's we are seeing this. The first 2 finally succeeded after trying lot's of times.
This one is 4TB and this time the initital seed (we seed the replicas from Backup) failed after 13 hours while replicating the 2nd disk.
The error is: an existing connection was forcibly closed by the remote host (DataTransfer.SyncDisk). Anyone seen this before? Smaller VM replicate just fine.
We have tried hot-add and also switches to nbd, but gives us the same error after some time...
Connectivity is all local, but i was thinking of trying with enabled WAN accelerators, not sure if it's more stable like that...
-
- Veeam Software
- Posts: 21139
- Liked: 2141 times
- Joined: Jul 11, 2011 10:22 am
- Full Name: Alexander Fogelson
- Contact:
Re: Replication of big vm's fail
Guido, WAN accelerators are not designed to work on fast links. Doesn't your case look similar to this one?
-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Re: Replication of big vm's fail
Foggy,
thanks... somehow I missed that post when doing a search
Windows Firewall is disabled on all hosts, NTP seems to be ok for VM's & ESX Hosts.
I just disabled all offloading for all physical and virtual nics and will rerun the job, thx for now!
thanks... somehow I missed that post when doing a search
Windows Firewall is disabled on all hosts, NTP seems to be ok for VM's & ESX Hosts.
I just disabled all offloading for all physical and virtual nics and will rerun the job, thx for now!
-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Re: Replication of big vm's fail
Short Update.
Replication is still running after 10 hours (1.8TB / 30% done) which is a good sign,
however i noticed we had 4 VM's which had error messages in the daily backup of which 3 are DC's:
- failed to perform post backup application-aware processing steps.
- removing snapshot warning (i checked the vm's in vsphere, veeam snapshots were actually deleted in 3-4 seconds)
Also i had 3 vm's which needed consolidation after the backup, however veeam jobs showed successfull. I could consolidate all the vm's by hand in a few seconds...
It's been a long time ago we actually had any backup jobs failing or showing errors like this so somehow it's seems connected to either the replication job or the disabling of offloading (backups are done in direct-san mode with FC)
Replication is still running after 10 hours (1.8TB / 30% done) which is a good sign,
however i noticed we had 4 VM's which had error messages in the daily backup of which 3 are DC's:
- failed to perform post backup application-aware processing steps.
- removing snapshot warning (i checked the vm's in vsphere, veeam snapshots were actually deleted in 3-4 seconds)
Also i had 3 vm's which needed consolidation after the backup, however veeam jobs showed successfull. I could consolidate all the vm's by hand in a few seconds...
It's been a long time ago we actually had any backup jobs failing or showing errors like this so somehow it's seems connected to either the replication job or the disabling of offloading (backups are done in direct-san mode with FC)
-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Re: Replication of big vm's fail
2.9TB transferred and still going strong...
Now to find out which of the offloading settings is the one doing harm (TCP, Checksum, larage offload...)
Now to find out which of the offloading settings is the one doing harm (TCP, Checksum, larage offload...)
-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Re: Replication of big vm's fail
Finished 4,5TB in 29 hours without a glitch....
I wil now try to replicate 5 big vm's in parallel and see if that's also stable and also if i can saturate 2x1GB links. But 99,99% offloading was the culprit
Thx again @foggy you should work in 1st line support
I wil now try to replicate 5 big vm's in parallel and see if that's also stable and also if i can saturate 2x1GB links. But 99,99% offloading was the culprit
Thx again @foggy you should work in 1st line support
-
- Veteran
- Posts: 257
- Liked: 40 times
- Joined: May 21, 2013 9:08 pm
- Full Name: Alan Wells
- Contact:
Re: Replication of big vm's fail
Check your storage latency. Almost anytime I had issues with replicating large VMs it was either network or storage related. Mine was mostly storage because we had very slow old NetApps. After upgrading to super fast SANs we have seen no issues replicating.
-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Re: Replication of big vm's fail
Thanks... Pretty sure it was offloading... We are on all flash, mostly latency is right around 0 but replicating from backups (seeding) but everything is ok now, tcp offloading was the issue...
-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Re: Replication of big vm's fail
Hmm, doesn't seem to be good after all. During the weekend 2 out 5 replication jobs and their retries failed again. The biggest vm started with an error after 30 minutes (failed to open vddk disk), getting an existing connection was closed error on the next run (after 90 minutes) and not able to connect to dr vcenter (local network) on the next. the 4th run is now running since 27 hours and apparently it had to calculate digest for all disks after the errors
Even worse 50% of our backup jobs also failed with multiple reasons, especially during snapshot creation/removal and all seem linked to connected to either vcenter beeing gone for a few seconds. I also lose ping sometimes when editing a backup job when querying vcenter in the job. However only when these big replication jobs are running, we didn't have any issue with backup jobs for years before we started replicating these bigger vm's but since then this really seems unstable. Dealing with support on this hasn't been a very good. Usually it takes more than a day to get an answer and since these jobs run for 20-30 hours it takes 3 to 4 days to get a simple answer. But until now there hasn't been a single answer which could even somehow remotely be involved here... Sorry for sounding a bit frustrated
Even worse 50% of our backup jobs also failed with multiple reasons, especially during snapshot creation/removal and all seem linked to connected to either vcenter beeing gone for a few seconds. I also lose ping sometimes when editing a backup job when querying vcenter in the job. However only when these big replication jobs are running, we didn't have any issue with backup jobs for years before we started replicating these bigger vm's but since then this really seems unstable. Dealing with support on this hasn't been a very good. Usually it takes more than a day to get an answer and since these jobs run for 20-30 hours it takes 3 to 4 days to get a simple answer. But until now there hasn't been a single answer which could even somehow remotely be involved here... Sorry for sounding a bit frustrated
Who is online
Users browsing this forum: Bing [Bot] and 278 guests