The word from Gostev on network traffic corruption

Oct 08, 2012 7:31 am

Now I want know who is the NIC vendor Gostev is talking about... Maybe Broadcom?

"Traffic corruption is typically caused by malfunctioning network equipment - for example, a router with a bad RAM module with a sticky bit. And the more hops network traffic has to pass – such as in case of backup or replication over WAN – the bigger the chance of corruption is. Now, as you probably already know, TCP specification does include a checksum to mitigate the risks of errors being introduced into a TCP segment during its travel across the network, and this is what should be catching any data corruption - at least in theory. However, during our investigation, we have found that NICs from one very well-known hardware vendor were passing TCP packets with corrupted payload onto the OS, instead of rejecting them - despite those packets having invalid checksums! This was absolutely shocking finding for the team (less shocking for me, as I have already heard all sorts of bad feedback on this vendor’s networking hardware before – including on Veeam forums)."

Marco

Jamie Pert · Post by **Jamie Pert** » Oct 08, 2012 11:35 am this post

name and shame for the benefit of the Veeam community who may be currently diagnosing a problem where this is the cause!

Post by **Gostev** » Oct 08, 2012 3:22 pm this post

Marco, yep - you guessed it. But, was not it easy to guess?

Post by **m.novelli** » Oct 08, 2012 4:12 pm this post

That's a shame... we sell Dell hardware and all server are equipped with Broadcom NIC
Some months ago we spent some time trying to get Broadcom NICs working with jumbo frame on an iSCSI network... without luck
Enabling jumbo frames got freezing ESXi hypervisor

Marco

Post by **Gostev** » Oct 08, 2012 5:26 pm this post

Well, we cannot know if all Broadcom NIC models are affected by this issue, or just certain ones - those which happened to be used in that specific environment. But I don't know what were those models anyway.

Oct 08, 2012 6:25 pm

Over the years there have been a myriad of known issues with various NICs and TCP offload features such as TSO, which move the segmentation (and thus checksumming) down to the hardware layer. It's likely that the problem was specific to some combination of nic hardware/firmware, hypervisor version, and perhaps even specific OS/driver combination.

For example, quite a few years ago on our Dell R610s we saw major issues with TCP checksums when using the VMXNET3 driver, but not with E1000, but this only occurred on Linux systems. We could not reproduce the behavior on Windows boxes, and we could resolve the issue by disabling TSO within the Linux OS using ethtool. That being said, the problem was eventually corrected with a firmware update to the onboard NIC.

Oct 08, 2012 8:48 pm

I've seen recently dramatic improvements between firmware releases on broadcom nics, and as Tom said even weird problems literally disappear applying those firmware. No real way to narrow results to specific ESXi/VM/nic versions. What I can say for sure is Intel chipset (in vSphere at least) are really more stable and predictable ones.
Sadly also HP or SuperMicro (the servers we use in our datacenter) come with broadcom onboard...

Luca.

R&D Forums

The word from Gostev on network traffic corruption

Re: THE WORD FROM GOSTEV

Re: THE WORD FROM GOSTEV

Re: The word from Gostev on network traffic corruption

Re: The word from Gostev on network traffic corruption

Re: The word from Gostev on network traffic corruption

Re: The word from Gostev on network traffic corruption

Who is online