Intermittent Network Connectivity with ESXi 5.0 and DL380p Gen8 Servers with 331FLR NICs

So, I was recently troubleshooting some interesting intermittent network connectivity. Here’s the layout:

Upon examination of the environment, the following issues presented on three hosts in the problem cluster contained only of new Gen8 DL380p servers running ESXi 5.0.0:

  • CDP was not functioning on two vSwitch network uplinks that were not trunked and plugged into physical ports in access mode
  • CDP was sporadically functioning on portgroups with VLAN tagging;
  • The observed IP address ranges on the network adapters kept changing every 10 minutes or so
  • VM traffic was intermittently failing entirely (without a presenting failure of the network link)
  • The VMkernel log was filled with messages pertaining to the NetQ feature
  • No changes were being made on the physical network switches and the physical switches were configured properly
  • The tg3 driver version of the 331FLR was 3.123b.v50.1-1OEM.500.0.0.472560

Articles of Note:

These symptoms exactly describe the problem highlighted in the article, and would be especially highlighted in a VMware cluster with high network utilization, which is  exactly the case in the environment I was troubleshooting.

We followed the article’s recommendation and performed the following workaround:
Disabled the NetQ on the network adapter on one of the hosts following the official KB
Verified from the command line that NetQ was disabled
Verified that the vmkernel log was not filling with messages pertaining to the NetQ feature
Tested Ping traffic on a test VM on the host that we disabled NetQ on, there were no ping drops
vMotioned the test VM to a host, which at the time had NetQ enabled, and saw several ping drops
vMotioned the test VM back and saw no network packets were dropped after the vMotion was completed

The question I had for VMware support, was does the updated version of the driver linked above (tg3-3.123b.v50.1-682322.zip) permanently address this issue?

The response was:

Since the release notes for the updated driver (tg3-3.123b.v50.1-682322.zip) above make no mention of fixing this specific issue, we have to assume that the updated driver does not permanently remediate the situation. This means the only method of addressing this issue is to do exactly what we have already done in the documented workaround. VMware still recommended updating the driver simply for the sake of being on the latest version, but that it will not specifically address the issue.

Release notes:

v3.123b (April 03, 2012)
========================
    Fixes
    —–
        1) Problem: (CQ62172) 5720 hangs when running software iSCSI with
                    TSO enabled.
           Cause  : TSO packets with total size equal to or bigger than 64K
                    are transmitted in this setup.  Hardware has a limitation
                    that individual BDs cannot approach 64K in TSO.
           Change : Set dma_limit to 32K for all TSO capable chips when
                    compiled for VMWare.  Driver will break up any BDs bigger
                    than 32K.  Linux limits TSO size to less than 64K, so no
                    workaround is needed.
           Impact : All TSO capable chips under VMWare.

    Enhancements
    ————
        1) Change : Added psod_on_tx_timeout parameter for VMWare debugging.
           Impact : None.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.