So, I was recently troubleshooting some interesting intermittent network connectivity. Here’s the layout:
Upon examination of the environment, the following issues presented on three hosts in the problem cluster contained only of new Gen8 DL380p servers running ESXi 5.0.0:
- CDP was not functioning on two vSwitch network uplinks that were not trunked and plugged into physical ports in access mode
- CDP was sporadically functioning on portgroups with VLAN tagging;
- The observed IP address ranges on the network adapters kept changing every 10 minutes or so
- VM traffic was intermittently failing entirely (without a presenting failure of the network link)
- The VMkernel log was filled with messages pertaining to the NetQ feature
- No changes were being made on the physical network switches and the physical switches were configured properly
- The tg3 driver version of the 331FLR was 3.123b.v50.1-1OEM.500.0.0.472560
Articles of Note:
- The official KB: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2035701
- An excellent wordpress writeup from a fellow consultant: http://rcmtech.wordpress.com/2012/08/15/vmware-esxesxi-issues-with-broadcom-5719-quad-port-gb-nic/
- The updated (4/13/2012) driver: https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXi50-Broadcom-tg3-3123bv501&productId=229&download=true&fileId=2501008005&secureParam=&downloadType=#product_downloads
- The quickspecs on the 331FLR: http://h18000.www1.hp.com/products/quickspecs/14214_na/14214_na.pdf
These symptoms exactly describe the problem highlighted in the article, and would be especially highlighted in a VMware cluster with high network utilization, which is exactly the case in the environment I was troubleshooting.
We followed the article’s recommendation and performed the following workaround:
Disabled the NetQ on the network adapter on one of the hosts following the official KB
Verified from the command line that NetQ was disabled
Verified that the vmkernel log was not filling with messages pertaining to the NetQ feature
Tested Ping traffic on a test VM on the host that we disabled NetQ on, there were no ping drops
vMotioned the test VM to a host, which at the time had NetQ enabled, and saw several ping drops
vMotioned the test VM back and saw no network packets were dropped after the vMotion was completed
The question I had for VMware support, was does the updated version of the driver linked above (tg3-3.123b.v50.1-682322.zip) permanently address this issue?
The response was:
Since the release notes for the updated driver (tg3-3.123b.v50.1-682322.zip) above make no mention of fixing this specific issue, we have to assume that the updated driver does not permanently remediate the situation. This means the only method of addressing this issue is to do exactly what we have already done in the documented workaround. VMware still recommended updating the driver simply for the sake of being on the latest version, but that it will not specifically address the issue.
Release notes:
v3.123b (April 03, 2012)
========================
Fixes
—–
1) Problem: (CQ62172) 5720 hangs when running software iSCSI with
TSO enabled.
Cause : TSO packets with total size equal to or bigger than 64K
are transmitted in this setup. Hardware has a limitation
that individual BDs cannot approach 64K in TSO.
Change : Set dma_limit to 32K for all TSO capable chips when
compiled for VMWare. Driver will break up any BDs bigger
than 32K. Linux limits TSO size to less than 64K, so no
workaround is needed.
Impact : All TSO capable chips under VMWare.
Enhancements
————
1) Change : Added psod_on_tx_timeout parameter for VMWare debugging.
Impact : None.