iSCSI connection failures during extreme high load NBD

joergr · Post by **joergr** » Dec 02, 2010 7:54 pm this post

A little brainteaser for you guys, i will discuss this later tomorrow with some guys over at vmware (because it is certainly a vmware issue) - but let´s check out if anyone here had this before:

When doing an pure NBD Backup from a physical veeam server to a esxi 4.1 machine to backup a vm located on eql san (ALL equiped with 10 GB interfaces, the veeam server, the esxi 4.1 and also the equallogic system - sometimes THIS appears in the Equallogic Logs:

INFO 02.12.10 16:39:38 10eql2 iSCSI session to target '172.16.150.234:3260, iqn.2001-05.com.equallogic:0-8a0906-cd5e5a007-ed2000000524c8f7-10eql1esxsata1' from initiator '172.16.150.35:59312, iqn.1998-01.com.vmware:esx12-27bd5df6' was closed. iSCSI initiator connection failure. Connection was closed by peer.

Exactly four to six seconds later it reconnects.

INFO 02.12.10 16:39:43 10eql2 iSCSI login to target '172.16.150.234:3260, iqn.2001-05.com.equallogic:0-8a0906-cd5e5a007-ed2000000524c8f7-10eql1esxsata1' from initiator '172.16.150.35:60326, iqn.1998-01.com.vmware:esx12-27bd5df6' successful using standard-sized frames. NOTE: More than one initiator is now logged in to the target.

Now this only happens during extremely high bandwith operations, e.g. when about 40% of the 10 GB is used. It seems the ESXi 4.1 iSCSI Software initiator can´t take more and is failing for a very short period of time.

Any thoughts? And please: This is research. Don´t tell me to use SAN mode - I am curious why that here is happening.

Best regards,
Joerg

Post by **tsightler** » Dec 02, 2010 9:04 pm this post

Have you investigated the ESX server logs? My suspicion is simply that some I/O operation timed out due to the reasonably high load ~500MB/sec). We used to see similar behavior with ESX 3.5, and even some of our busy Linux hosts, because the timeouts were fairly low to perform quick path failover. The command would timeout and would perform a hard reset of the iSCSI link but in the meantime the traffic would continue to flow over the other iSCSI links. We used to see it pretty regularly in the Veeam 3.x days before CBT because we would run multiple jobs simultaneously and push 300-400MB sec out of our Equallogic arrays but we rarely see it anymore.

joergr · Post by **joergr** » Dec 02, 2010 9:33 pm this post

Hi Tom,

first of all THANKS a lot - will examine the ESXi logs tomorrow and check it out. I never ever saw it with SAN Mode, only with NBD mode. It is an vanilla Software ESXi iSCSI Initiator out of the box, no round robin, no multipath at all, so it won´t failover to another pNic. But maybe the vSwitch is initiating the hard reset. I don´t know - will try to find out. Do you by any chance know which log of the huge standard diag package is it where i would find timed out I/O Operations? vCenter Server Alarms reports nothing, btw. , not even an event (as it would show when a iscsi lun disappears for more than 60 secs) - nothing at all. Without taking a close look at the EQL events i wouldn´t even ever saw it. Also SAN Hardquarters report no problems, VEEAM Monitor 5 reports nothing.

Best regards,
Joerg

Post by **tsightler** » Dec 03, 2010 12:02 am this post

Actually, I think the timeout would be at the iSCSI layer, probably a no-op timeout which the initiators use to determine if a link is still alive. It sends a no-op command and expects to see a response in a given time, however, if the queue is very full on the array, the response may not make it in time and the initiator assumes the link is dead and performs a iSCSI reset, which effective forces and retry login.

That being said, I started thinking about your issue a little more, and realized it might be a completely different issue. It could be occurring because your Equallogic array is attempting to "load balance" the traffic across it's ethernet ports. The Equallogic arrays use ARP Redirects to direct traffic from one port to another to attempt to equalize the load between links. If ARP Redirect isn't enabled on the iSCSI initiator, then it can cause a disconnect and reconnect, and perhaps it does even when it is enabled. The Equallogic array perform this "load-balancing" on a schedule, so if the breaks seem to be somewhat at an even interval that might be the issue. Of course, this is assuming you have at least two active links on the EQL side (although, I think the EQL actually performs the load balancing anyway, which is kind of weird).

The ARP Redirect configuration is well documented in Equallogic documents and VMware's iSCSI SAN guide, so you've probably already set that, but I still thought it was worth mentioning.

joergr · Post by **joergr** » Dec 03, 2010 8:07 am this post

Hi Tom,

thanks a lot for your feedback. Yeah a load balancing request from eql was my first thought, too. When examining the eql logs and talking to nashua i found out that a load balancing request is marked and explained as such in the eql logs, thus, i am quite sure, it´s the esxi 4.1 machine or the ixgbe driver on the esxi 4.1 machine causing this short drops. I will investigate further and keep updated.

Anybody any other ideas? Another question to VEEAM would be (that came up from the vmware guys): If using NBD mode, is veeam strictly using standard vstorage API instructions or did you implement addidional tweaks (but i guess the answer is NO, because it is an ESXi machine). So i guess again, VMware has to solve this

Best regards,
Joerg

mcwill · Post by **mcwill** » Dec 03, 2010 8:50 am this post

We had a similar sounding problem with vmware & EQL a year ago just after vSphere was launched, see http://communities.vmware.com/thread/215039 for a long thread that tracked the problem for 9 months before vmware fixed it.

joergr · Post by **joergr** » Dec 03, 2010 10:24 am this post

follow-up. i MAY have found the solution, not 100% verified (needs much more time), but it SEEMS the latest ixgbe driver (2.0.84.9) for intel 10 gb modules on ESXi 4.1 seems to solve the drop problem.
Now what i always find very frustrating with ESXi 4.1: These driver updates have to be installed manually via vsphere cli or vma, no way you get em installed via update manager - and they are also NOT inccluded in ESXi 4.1 firmware updates (not even in the very latest two days ago). Can´t understand that

But what is nice (we checked this): If you install the new ixgbe driver and then do a ESXi 4.1 firmware upgrade with update-manager, the new driver will stay in place.

BTW: If you don´t know how to get the driver version of your ixgbe on your ESXi 4.1 machine, just enable local or remote TSM, login and enter
vsish -e get /net/pNics/vmnic0/properties
(just replace the 0 with your nic number you are looking for). Scroll above (use putty, is much nicer) and check the driver version. Driver is too old or vanilla? You can update it with vsphere cli via
vihostupdate.pl –server [IP address] –username root –install –bundle [CD/DVD]:\offline-bundle\INT-intel-lad-blablabla-offline_bundle-blablabla.blablabla

NOW

If anyone has a solution how to read out the ixgbe driver version with vSphere Client or even UPDATE it, THAT would be more than great.

best regards,
Joerg

R&D Forums

iSCSI connection failures during extreme high load NBD

Re: iSCSI connection failures during extreme high load NBD

Re: iSCSI connection failures during extreme high load NBD

Re: iSCSI connection failures during extreme high load NBD

Re: iSCSI connection failures during extreme high load NBD

Re: iSCSI connection failures during extreme high load NBD

Re: iSCSI connection failures during extreme high load NBD

Who is online