Comprehensive data protection for all workloads
Post Reply
joergr
Veteran
Posts: 391
Liked: 39 times
Joined: Jun 08, 2010 2:01 pm
Full Name: Joerg Riether
Contact:

iSCSI connection failures during extreme high load NBD

Post by joergr »

A little brainteaser for you guys, i will discuss this later tomorrow with some guys over at vmware (because it is certainly a vmware issue) - but let´s check out if anyone here had this before:

When doing an pure NBD Backup from a physical veeam server to a esxi 4.1 machine to backup a vm located on eql san (ALL equiped with 10 GB interfaces, the veeam server, the esxi 4.1 and also the equallogic system - sometimes THIS appears in the Equallogic Logs:

INFO 02.12.10 16:39:38 10eql2 iSCSI session to target '172.16.150.234:3260, iqn.2001-05.com.equallogic:0-8a0906-cd5e5a007-ed2000000524c8f7-10eql1esxsata1' from initiator '172.16.150.35:59312, iqn.1998-01.com.vmware:esx12-27bd5df6' was closed. iSCSI initiator connection failure. Connection was closed by peer.

Exactly four to six seconds later it reconnects.

INFO 02.12.10 16:39:43 10eql2 iSCSI login to target '172.16.150.234:3260, iqn.2001-05.com.equallogic:0-8a0906-cd5e5a007-ed2000000524c8f7-10eql1esxsata1' from initiator '172.16.150.35:60326, iqn.1998-01.com.vmware:esx12-27bd5df6' successful using standard-sized frames. NOTE: More than one initiator is now logged in to the target.

Now this only happens during extremely high bandwith operations, e.g. when about 40% of the 10 GB is used. It seems the ESXi 4.1 iSCSI Software initiator can´t take more and is failing for a very short period of time.

Any thoughts? And please: This is research. Don´t tell me to use SAN mode - I am curious why that here is happening.

Best regards,
Joerg
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: iSCSI connection failures during extreme high load NBD

Post by tsightler »

Have you investigated the ESX server logs? My suspicion is simply that some I/O operation timed out due to the reasonably high load ~500MB/sec). We used to see similar behavior with ESX 3.5, and even some of our busy Linux hosts, because the timeouts were fairly low to perform quick path failover. The command would timeout and would perform a hard reset of the iSCSI link but in the meantime the traffic would continue to flow over the other iSCSI links. We used to see it pretty regularly in the Veeam 3.x days before CBT because we would run multiple jobs simultaneously and push 300-400MB sec out of our Equallogic arrays but we rarely see it anymore.
joergr
Veteran
Posts: 391
Liked: 39 times
Joined: Jun 08, 2010 2:01 pm
Full Name: Joerg Riether
Contact:

Re: iSCSI connection failures during extreme high load NBD

Post by joergr »

Hi Tom,

first of all THANKS a lot - will examine the ESXi logs tomorrow and check it out. I never ever saw it with SAN Mode, only with NBD mode. It is an vanilla Software ESXi iSCSI Initiator out of the box, no round robin, no multipath at all, so it won´t failover to another pNic. But maybe the vSwitch is initiating the hard reset. I don´t know - will try to find out. Do you by any chance know which log of the huge standard diag package is it where i would find timed out I/O Operations? vCenter Server Alarms reports nothing, btw. , not even an event (as it would show when a iscsi lun disappears for more than 60 secs) - nothing at all. Without taking a close look at the EQL events i wouldn´t even ever saw it. Also SAN Hardquarters report no problems, VEEAM Monitor 5 reports nothing.

Best regards,
Joerg
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: iSCSI connection failures during extreme high load NBD

Post by tsightler »

Actually, I think the timeout would be at the iSCSI layer, probably a no-op timeout which the initiators use to determine if a link is still alive. It sends a no-op command and expects to see a response in a given time, however, if the queue is very full on the array, the response may not make it in time and the initiator assumes the link is dead and performs a iSCSI reset, which effective forces and retry login.

That being said, I started thinking about your issue a little more, and realized it might be a completely different issue. It could be occurring because your Equallogic array is attempting to "load balance" the traffic across it's ethernet ports. The Equallogic arrays use ARP Redirects to direct traffic from one port to another to attempt to equalize the load between links. If ARP Redirect isn't enabled on the iSCSI initiator, then it can cause a disconnect and reconnect, and perhaps it does even when it is enabled. The Equallogic array perform this "load-balancing" on a schedule, so if the breaks seem to be somewhat at an even interval that might be the issue. Of course, this is assuming you have at least two active links on the EQL side (although, I think the EQL actually performs the load balancing anyway, which is kind of weird).

The ARP Redirect configuration is well documented in Equallogic documents and VMware's iSCSI SAN guide, so you've probably already set that, but I still thought it was worth mentioning.
joergr
Veteran
Posts: 391
Liked: 39 times
Joined: Jun 08, 2010 2:01 pm
Full Name: Joerg Riether
Contact:

Re: iSCSI connection failures during extreme high load NBD

Post by joergr »

Hi Tom,

thanks a lot for your feedback. Yeah a load balancing request from eql was my first thought, too. When examining the eql logs and talking to nashua i found out that a load balancing request is marked and explained as such in the eql logs, thus, i am quite sure, it´s the esxi 4.1 machine or the ixgbe driver on the esxi 4.1 machine causing this short drops. I will investigate further and keep updated.

Anybody any other ideas? Another question to VEEAM would be (that came up from the vmware guys): If using NBD mode, is veeam strictly using standard vstorage API instructions or did you implement addidional tweaks (but i guess the answer is NO, because it is an ESXi machine). So i guess again, VMware has to solve this ;-)

Best regards,
Joerg
mcwill
Enthusiast
Posts: 64
Liked: 10 times
Joined: Jan 16, 2010 9:47 am
Full Name: Iain McWilliams
Contact:

Re: iSCSI connection failures during extreme high load NBD

Post by mcwill »

We had a similar sounding problem with vmware & EQL a year ago just after vSphere was launched, see http://communities.vmware.com/thread/215039 for a long thread that tracked the problem for 9 months before vmware fixed it.
joergr
Veteran
Posts: 391
Liked: 39 times
Joined: Jun 08, 2010 2:01 pm
Full Name: Joerg Riether
Contact:

Re: iSCSI connection failures during extreme high load NBD

Post by joergr »

follow-up. i MAY have found the solution, not 100% verified (needs much more time), but it SEEMS the latest ixgbe driver (2.0.84.9) for intel 10 gb modules on ESXi 4.1 seems to solve the drop problem.
Now what i always find very frustrating with ESXi 4.1: These driver updates have to be installed manually via vsphere cli or vma, no way you get em installed via update manager - and they are also NOT inccluded in ESXi 4.1 firmware updates (not even in the very latest two days ago). Can´t understand that ;-)

But what is nice (we checked this): If you install the new ixgbe driver and then do a ESXi 4.1 firmware upgrade with update-manager, the new driver will stay in place.

BTW: If you don´t know how to get the driver version of your ixgbe on your ESXi 4.1 machine, just enable local or remote TSM, login and enter
vsish -e get /net/pNics/vmnic0/properties
(just replace the 0 with your nic number you are looking for). Scroll above (use putty, is much nicer) and check the driver version. Driver is too old or vanilla? You can update it with vsphere cli via
vihostupdate.pl –server [IP address] –username root –install –bundle [CD/DVD]:\offline-bundle\INT-intel-lad-blablabla-offline_bundle-blablabla.blablabla

NOW ;-) If anyone has a solution how to read out the ixgbe driver version with vSphere Client or even UPDATE it, THAT would be more than great.

best regards,
Joerg
Post Reply

Who is online

Users browsing this forum: Bing [Bot], Semrush [Bot] and 85 guests