Problems with making replication work across a WAN/VPN?

mrpackethead · Post by **mrpackethead** » Mar 31, 2009 10:26 pm this post

Hi,

I set up an environment with two esx3.5 servers, and wanted to replicate machines from one to the other. Intially i had both machines on the same site, they were just a single hop away from each other through a layer 3 switch. The point of doing this was that we were able to keep all names and ip addressing the same. The intial replicaiton worked fine, and we did a subsquent replicaiton, it created a small delta file, and applied it. All was good and appeared to have worked.

We then picked up the 'DR' Esx server, and moved it 600km north to the DR site.. We reconfigured the network, so that the servers could once again talk to each other.. And started the replication again.. It got so far and just stops.. There is a 10Mbs-1 connection between the sites.

the data travells across an IPSec VPN, which is configured between two Firewalls.. The quality of the network is very good, it has about 15ms latency, and there is very very little packet loss. I have tryed adjustig the MTU size of the ipsec tunnel, that made no difference. the Firewalls are not dropping any traffic, All traffic is passing.

Has anyone else experienced such a problem? We are trying to diagnose this with veem support at the moment, but as yet we don't have a resolution..

Could someone confirm something for me;

IN a replication task, does the 'data' move from one ESX servers console port to the other ESX servers console port directly? And what ports/protcols does it use?

I am desperately trying to ensure that the network or any elements are not causing the failure.

Any suggestions will be greatly appreciated.

Post by **Gostev** » Apr 01, 2009 11:35 am this post

Hello, what is your support case number? I would like to take a look since it is not clear from your description what is the exact issue/error.

Answering your question, we do not use service console port for data transfer - instead, we use different port (2500 and above, on per job), essentially TCP/IP socket to socket connection using proprietary data transfer protocol. So my guess is that you do not have the required ports open in the firewall.

We did have plans to improve replication for high-latency/unreliable links down the road, including supporting initial replication over physical storage, but from your description it sounds like you should not have any issue given the quality link you have.

Thank you!

mrpackethead · Post by **mrpackethead** » Apr 01, 2009 8:08 pm this post

My apologies, I should have said "console interface to console interface"... I am aware the the data is moved on ports between 2500 and 5000. I was wanting to know if that is udp or tcp. Right now my firewalls are not filtering anything, and ANYTHING is allowed to flow transit the network.

The response i got from Veeam support was pretty poor.. [Ticket#505505]

> I noticed 2 ping timeouts during this run, and that
> definitely signifies network dropouts for the connection. I
> am very sorry for the problem does seem to reside in the
> reliability of this WAN connection. Do you notice any issue
> at all when replicating onsite locally?

No network will be perfect! Espically when you have an application like veeam consuming all available bandwidth.. Something has to give. We can very successfully and reliably move data between these ESX servers using SCP.

Does Veeam require a 'Perfect' network, that never drops a single packet in order to make replicaiton work? If *any* of the data gets lost in transit will this break the replication. We were able to replicate the system when it was local, but now its at the other end of teh WAN, it fails.
I'm really after a simple answer here.. Am i using Veeam in a way that it simply can't work? I need to make a call very shortly if i move to achieve the replicaiton in another manner, or continue to try to resolve the issue. I have had my team working on this for some time now.

Kind Regards

Post by **Gostev** » Apr 02, 2009 10:51 am this post

Minor packet loss is not a problem since it is handled on IP level, as long as you have low latency and you VPN connection does not brake - it should be good enough for us. Looking at the logs, your problem looks to be related to SSH connection timing out for some reason (this probably relates to target ESX SSH server settings like keep-alive timeout, number of simulatenous client connections and so on).

mrpackethead · Post by **mrpackethead** » Apr 07, 2009 2:57 am this post

Well, we are close to giving up. We've now installed another replication product and it appears to be working well. Its a pity becuase we've spent a bunch of $$$ on veeam, and now we probaby will have to spend a bunch of $$$$ on another product.

It worked fine in a LAN environment, but in the WAN it just does'nt cut it for me.

mrpackethead · Post by **mrpackethead** » Apr 10, 2009 7:13 am this post

Well, finally some good news. We have got replicaiton going.. We are currently 50% through replicating several hundred gigabytes, across a VPN on the internet.

I spent 15 odd hours watching packets in and out of the ESX servers that are across the WAN, trying to figure out what was going on. It was way less than obvious! The good news ( at least for the folks at Veeam ) is that it is not a problem with Veeam.

Along with the ports that are specified by Veeam as required for replication, its VERY VERY important that icmp can't fragment mesages can get back to the hosts. In many networks, particually networks that have vpns running across the internet, have icmp filters. ICMP is an integral part of the IP stack, and if its filtered out, you potentially ( as we did ) have major problems. in our case, we had icmp echo allowed, but thats all..

In our case, the ICMP can't fragment message could'nt get back to the source ESX server and it never knew that the packets it was sending were too big, to fit inside the network segment that was the smallest on the network. In our case the icmp vpn tunnel. So our firewalls were fragmenting the packets in order to get them down the link. Packet fragmentation is *VERY* CPU intensive, and eventually the firewalls got a few packets out of order and bad stuff happened, ( it dropped the sessions )..

This was a NIGHTMARE to debug. fistly Veeam logs don't give any clue as to whats going on. Secondly the network never dropped. you could run a ping at the same time, and no packets were ever lost.. you could send variable amounts of data up and down the wire via scp/sftp, and sometimes it would work, other times it wouldn't. The situation was confusing because there is seemly reliable connectivity between the hosts, but every so often it just hung!

You could fix this by setting the MTU of the interface to be smaller than the smallest link on your network but that's a pretty ugly hack. What you need to ensure is that the ICMP messages that carry the fragmentation information can get back to the hosts.

THat over, i think i can sit back and relax for a while.

My suggestion to Veeam people, is that you include this information in your documentation.. I should have picked it up earlier perhaps, but this vpn has been running reliably for many months without issue..

Post by **Gostev** » Apr 10, 2009 10:42 am this post

All I can say is "WOW"... unbelievable troubleshooting effort... kudos for letting us know!

b.thoele · Post by **b.thoele** » Jan 11, 2010 8:14 pm this post

We were having similar issues with long transfers across WAN connections that would break. What tools did you use to troubleshoot this? I was using a ping graph program called Ping Plotter standard. In my case my ping graphs looked like the my ISPs connection was dropping so I stopped looking at the transfer. In my case I was not using a site to site VPN. I was using Veeam network replication mode which allows you to specify host names so that routing across the Internet is possible. I have since switched to vStorage API replication with ESX 4.0 to try to dodge a network drop by reducing my WAN footprint.

The thing is I haven't been able to get vstorage API to work because I haven't been able to figure out how to tell my remote backup client how to connect to the Veeam.exe host since it tries to connect to a private IP 192.168.1.X. With Veeam network replication the connection is point to point with the use of host names or IPs. With vStorage API network mode Veeam.exe routes traffic through a third point. I'm not sure why but Veeam software doesn't allow you to specify the network identify that the remote host will try to connect to. Can someone tell me how to do this? If I moved veeam.exe to the client network would I just be dealing with the server host trying to make the same request to connect to a private IP across the Internet?

[11.01.2010 14:01:45] <09> Info (Client) Service error: Cannot connect to server [192.168.1.5:2500].\n--tr:Client failed to process the command. Command: [connect].
[11.01.2010 14:01:45] <09> Info (Client) Service: closed

Any ideas?

Post by **Gostev** » Jan 11, 2010 8:27 pm this post

By no means I am a network guru, but may be modifying the hosts file on Veeam Backup server would help to connect to the remote host with the required IP?

Jan 11, 2010 10:08 pm

While you can use all the fancy packet captures tools to troubleshoot such problems, it's normally trivial to troubleshoot PMTU issues using ping. Ping can help you discover the maximum MTU between two links, and can even be used to detect "black hole" situation like the original poster experienced where an clueless/overzealous "security" person blocks ICMP arbitrarily without understanding it's impact.

To use ping from an ESX console for PMTU troubleshooting you can use simple command like the following:

ping -c 5 -M do -s 1472 <remote-IP-address>

This tells the system to send 5 ping packets of the size 1472 bytes (the maximum size of a packet that will fit within a default MTU of 1500 without requiring fragmentation). The "magic" in this command is the "-M do" which tells ping to set the "Don't Defragment" bit in the packet. Normally, routers will simply break packets that are too big into smaller parts and send them along in pieces (thus the term "fragment"), however, packets with the DF (Don't Fragment) bit set are not allowed to be fragmented thus the proper response from a router is to return an ICMP Type 3 packet (Destination Unreachable) with the Code field set to 4 (Fragmentation needed and Don't Fragement set"). Of course, if you friendly network guru mistakenly blocks ICMP Type 3 messages, that message will never make it back to the sending host at it will happily keep on sending big packets into the black hole.

So basically, the response to the above "ping" command could be one of three types:

Successful Replys:

Code: Select all

PING 10.1.1.2 (10.1.1.2) 1472(1500) bytes of data.
1480 bytes from 10.1.1.2: icmp_seq=1 ttl=64 time=21.5 ms
1480 bytes from 10.1.1.2: icmp_seq=2 ttl=64 time=21.3 ms

If you see this, you've got an MTU of 1500 all the way through, thus you shouldn't see MTU problems.

Reduced MTU in path:

Code: Select all

PING 10.1.1.2 (10.1.1.2) 1473(1501) bytes of data.
From 10.2.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.2.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.2.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.2.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.2.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1500)

--- 10.1.1.2 ping statistics ---
0 packets transmitted, 0 received, +5 errors

You've got some device/path in the link that needs a lower MTU, but the router is properly sending ICMP Fragment messages so everything should still be good. You can play with the sizes (the -s parameter on the ping command) until the pings succeed to determine the actual MTU of the link but it really shouldn't be necessary.

No Response:

Code: Select all

PING 10.1.1.2 (10.1.1.2) 1472(1500) bytes of data.
--- 10.1.1.2 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4008ms

This is the worst case and the one most likely to cause you problems. Basically any packet that requires fragmentation is just being eaten because the ICMP messages being sent by the router or VPN device are being blocked and are not making it back to the host, thus the host does not know that it need to reduce the MTU size. The ideal fix is to find the clueless network/security admin and use a "corrective device" to make them see the light. However, if this isn't practical you can play with the packet size using the "-s" parameter in the ping command until you find a size that makes it through. In most cases a size of 1400 will get through almost all VPN's, but there are a few exceptions where something like 512 will be required. Once you know the size you can set the MTU size on the ESX hosts console interfaces and life should be good.

Another option, if you don't feel like playing with MTU settings and such on the interfaces, is to enable Linux support for Blackhole Router Detection, or MTU Probing. This feature can be enabled from the Linux console with the following command:

echo 2 > /proc/sys/net/ipv4/tcp_mtu_probing

This will force Linux to follow RFC 2923 to attempt to discover and workaround black hole routers (router which ICMP Fragment packets are blocked or disabled). You'd need to enable this on both ends and it will work it's magic and discover the MTU of a path even when the ICMP fragment packets are blocked. If this fixes your issue you can make it persistent on reboot by adding the following line to the /etc/sysctl.conf file on your ESX host:

net.ipv4.tcp_mtu_probing = 2

Obviously any changes would need to be made and tested on ESX servers on both ends of the link (it's possible that ICMP fragment packets are blocked only on one side of the link, but unlikely).

As far as which IP address Veeam connect too, this should be easily manupulated by hosts tables, DNS entries (a special DNS/domain for the private addresses), or simple adding the host to Veeam via IP address. I'm not even sure I understand the problem you're having, perhaps you can explain your architecture better as far as where hosts are located and how they access each other (public or private address).

Post by **Gostev** » Jan 11, 2010 10:11 pm this post

Tom, do you really know... everything?

huberw · Post by **huberw** » Feb 10, 2010 11:55 pm this post

mrpackethead,

i have a few questions for you... because I *think* I have a similar issue.

I have 2 ESX hosts connected via a site to site VPN between 2 sonicwall firewalls. Site A (production) has a 20/20mbit internet connection, and site B (recovery site) has a 100Mbit ethernet handoff to the internet. I am trying to replicate 2 VMs through the VPN, total 300GB of data.

It seems that my replication jobs get so far, then stop, similar to how you describe. Sometimes I get 10GB through, sometimes I get 200GB through. It's weird.

I understand what you are saying in your explanation on how you solved this problem, I'm just having trouble making sense of Sonicwall's terminology of this feature/setting, and where I might find it.

Also, What type of equipment are you using? What types of transfer speeds are you getting on your replication jobs using your connection? Can you offer any advice as to how I could go about fixing this? Has it been reliable since you have fixed it?

Thanks in advance!

jbeunel · Post by **jbeunel** » Mar 04, 2010 7:07 pm this post

Hello,

Some problem for me on 20Mb/20Mb connection for my veeam backup.
I get only 300KB/s a backup job.
I'll change the MTU of my VEEAM server tomorow.

Regards

Frosty · Post by **Frosty** » Mar 04, 2010 9:15 pm this post

This is bringing back long-distant memories. I once had to troubleshoot an FTP authentication problem. It ended up being an MTU issue too. During the FTP authentication exchange of user+password between client and server, the password was being split across two packets and this was throwing the server off and causing it to reject the connection. I ended up finding this by using a packet sniffer to trap all the packets relating to the exchange of data and then going through them manually with a fine tooth comb. Changing the MTU on our gateway router shifted things enough so that the password was all in its own packet and then things started working normally.

But that's nothing compared to the efforts in the posts above. I takes my hat off to you!

icebun · Post by **icebun** » Mar 05, 2010 4:37 pm this post

Guys,

I have got a site to site VPN (10MB Production, 6MB DR).

While just replicating a single VM (around 20GB), I am getting very poor transfer rate speeds. It's around 380kb/s.

The replication is configured as a Virual Applicance using the vStorage API method.

I am using a pair of Juniper SS140 firewalls and am keen to know if there are any adjustments that need to be made to improve throughput.

Is it the server I need to modify or the Firewalls?

jbeunel · Post by **jbeunel** » Mar 08, 2010 9:17 am this post

Hi,

I changed the MTU on my VEEAM server to 1400 but the speed is still limited at 300KB/s by job.

This is the sniffer capture:

any ideas?
regards

TMassa · Post by **TMassa** » Mar 08, 2010 11:02 pm this post

(This doesn't appear to be your specific problem, but for anyone who may stumble upon this thread, hopefully it will save you the frustration that I incurred)
I experienced an issue where a secured DMZ AD domain was configured using the high-security template from Microsoft. During troubleshooting an application problem, I discovered that PMTU discovery was disabled on all of the servers, thereby forcing all packet sizes to be 576. Needless to say, it caused pitiful performace. Once I disabled the domain policy that was disabling PMTU, and rebooted the environment , normal traffic speeds were acheived.

http://support.microsoft.com/kb/900926

Good luck.

donikatz · Post by **donikatz** » Mar 18, 2010 4:39 pm this post

icebun wrote:Guys,

I have got a site to site VPN (10MB Production, 6MB DR).
While just replicating a single VM (around 20GB), I am getting very poor transfer rate speeds. It's around 380kb/s.
The replication is configured as a Virual Applicance using the vStorage API method.
I am using a pair of Juniper SS140 firewalls and am keen to know if there are any adjustments that need to be made to improve throughput.
Is it the server I need to modify or the Firewalls?

From your post in another thread, looks like this is still a problem for you. Gotta watch the bits & bytes -- I assume you're replicating over a 6 Mb link? What kind of transfer speeds do you get over that link without Veeam, like FTP? What sort of replication speeds do you get locally to that same destination? Gotta work through the different potential factors. Thanks

Post by **sjutras** » Apr 08, 2010 2:54 pm this post

In our case, the ICMP/MTU config was fine, but the problem was due to the Veeam server being on the DR site.

We monitored traffic and found out that even with compression activated on the Veeam server, it would compress onlt once the traffic has reach the Veeam Server, so the .vrb appears to be 149MB but in reality there was around 607MB of traffic generated.

I will backup Veeam server and place it at the HQ site and re-test to see but i am pretty sure it is our issue.

ahahum · Post by **ahahum** » Jul 29, 2010 9:07 pm this post

Am I supposed to be changing the MTU on my firewalls responsible for the IPSec Tunnel? Or on my ESX hosts at each site? Or my Veeam server? Or a combination of these?

I've narrowed it down to anything over 1330 gets fragmented, I'm just unsure where to make the changes.

pesgit · Post by **pesgit** » Jun 15, 2011 5:57 pm this post

Check MTU along a path using the tracepath program on Linux. example: myserver$ tracepath -n 192.168.1.100

R&D Forums

Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Re: Problems with making replication work across a WAN/VPN?

Who is online