Kernel panic/crash, bnx2 flow control flooding and network outages

Sven Ulland sveniu at opera.com
Thu Oct 27 06:56:57 CDT 2011


After observing 6-7 different occasions over the last 6 months or so
where a crashed M610 blade manages to take down some or all other
blades in the same enclosure, we have finally found the culprit to be
a combination of a kernel panic, bnx2 driver/fw flaw and silly switch
behaviour.

I'm posting this here in case anyone else have or will run into the
same problem. Hopefully it will avoid more pulling of hair. Big thanks
to Leszek Urbanski, whose detailed blog post (see below) saved us
a lot of blood sweat and tears!


Summary
=======

Symptom
-------
Typically after 200+ days of uptime, the Linux kernel (at least
2.6.26-amd64 and 2.6.32-amd64) running on Dell M610 hardware, have
a small chance of ending up crashing with a kernel panic. If
a PowerConnect M6220 blade switch (or similar, see below for details)
is used, this crash (or any other, really) triggers a partial or full
outage for all blades connected to the switch, until the crashed blade
is reset or powered off. Stacking multiple switches increases the
potential impact of the problem. Neighbour switches with flow control
tx+rx enabled could amplify the problem.

Cause
-----
Upon kernel crashes/panics (problem 1), the Broadcom bnx2 driver for
the M610 BCM5709S NICs ends up in a state where it floods its uplink
with 802.3x flow control PAUSE frames (problem 2). The M6220 switch
(and friends), having flow control receive and transmit enabled
globally by default, is prone to propagate the PAUSE frames towards
sources that try to reach the crashed blade, resulting in -- for
example -- blocking a trunked uplink with multiple VLANs, again
resulting in a partial or full enclosure outage (problem 3).

Resolution
----------
The kernel crash is likely to be fixed in recent releases, but it is
not yet confirmed if it is, nor which version introduced the fix.

The bnx2 flooding is most likely fixed in Broadcom's NXII driver
version v2.0.17j (with the corresponding firmware) -- see below. It is
confirmed through testing that it is fixed *at least* in driver
version 2.0.21 (with corresponding firmware 6.2.1 or 6.4.4), on Debian
Squeeze with the 2.6.38 backported kernel. It is also confirmed to be
fixed in Debian Wheezy with driver 2.1.6 (fw 6.4.4) on kernel 3.0.

Alternatively, disabling bnx2 flow control tx using "ethtool -A eth0
tx off" will also avoid the PAUSE flood on kernel crash.

The switch can be configured to disable flow control globally using
the "no flowcontrol" command in the global configuration scope.


Details
=======

Kernel crash
------------
We haven't gotten much closer to what might cause the kernel crash in
the first place. The exact type of crash doesn't seem to matter,
though. Triggering crashes with oopses or NMIs will also reproduce the
problem.

Interesting links:
* 
<URL:http://web.archive.org/web/20101219061056/https://bugzilla.kernel.org/show_bug.cgi?id=16991>
* <URL:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=636797>
* <URL:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=592497>
* <URL:https://bugzilla.redhat.com/show_bug.cgi?id=612861>
* <URL:https://bugzilla.redhat.com/show_bug.cgi?id=520888>
* <URL:https://bugs.launchpad.net/ubuntu/+source/linux/+bug/824304>
* <URL:https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/614853>
* 
<URL:http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a2df00aa33f741096e977456573ebb08eece0b6f>
* <URL:https://lkml.org/lkml/2011/4/28/87>


bnx2 flow control flood
-----------------------
See Leszek Urbanski's post [1] for details on BCM5709 and BCM56xxx
flow control behaviour. His post was the one getting us on the right
track as to what's causing the network outages, so huge thanks are in
order!

Broadcom's release notes [2] for the NXII driver contain the
following:

"""
bnx2 v2.0.18c (Sep 13, 2010) cnic 2.2.6c (Sep 13, 2010)
=========================================================
Fixes
-----
1. Problem: (CQ49832) bnx2 flow control not working

    Cause: Mistakenly disabled in firmware in 2.0.17j to fix CQ46393.

    Change: Re-enabled RV2P flow control with additional fixes to for
            a number of odd flow control issues.  New firmware versions
            are 6.0.15 for 5706/5708 and 6.0.17 for 5709.

    Impact: 5706/5708/5709.

[...]

bnx2 v2.0.17j (Aug 15, 2010) cnic 2.2.5j (Aug 15, 2010)
=========================================================
Fixes
-----
[...]
4. Problem: (CQ46393) 5706/08/09 NC-SI Traffic Stops After Host Kernel Panic

    Cause: RV2P firmware was set to drop input packets at a rate which
           is slower than input line rate when the host stops posting
           buffers. This has caused the rxp ftq to backup which
           eventually led to the rxp ftq hw to assert PAUSE and flood
           the network.

    Change: RV2P firmware was modified to disable any waiting before
            dropping the input packet when the host doesn't post
            buffers.

    Impact: 5706/08/09.
"""

In general, the Linux kernel version vs bnx2 driver version looks like this:
* 3.1-rc1: 2.1.11
* 2.6.39: 2.1.6
* 2.6.38: 2.0.21
* 2.6.37: 2.0.18 (should be fixed here)
* 2.6.36: 2.0.17 (July 18 2010 version, which should be v2.0.17b?)
* 2.6.35: 2.0.15
* 2.6.34: 2.0.9
* 2.6.33: 2.0.3
* 2.6.32: 2.0.2
* 2.6.30: 2.0.1

Interesting links:
* 
<URL:http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1021612>
* <URL:https://support.mayfirst.org/ticket/3772>
   * 
<URL:https://lists.mayfirst.org/pipermail/service-advisories/2010-November/000220.html>
   * 
<URL:https://lists.mayfirst.org/pipermail/service-advisories/2011-January/000226.html>
* 
<URL:http://pg8873..com/2010/05/dell-710-broadcom-nic-running-rh-kernel.html>

[1]: Flow control flaw in Broadcom BCM5709 NICs and BCM56xxx switches
<URL:http://monolight.cc/2011/08/flow-control-flaw-in-broadcom-bcm5709-nics-and-bcm56xxx-switches/>

[2]: Broadcom NXII driver version 6.2.23
<URL:http://www.broadcom.com/support/license.php?file=NXII/linux-6.2.23.zip>


Switch flow control handling
----------------------------
As Leszek points out, the switch PAUSE frame handling looks to be an
.. unfortunate default in the switch firmware, and seems to remain
unchanged *at least* from 3.1.5.2 through 4.1.1.9. Simply disabling
flow control altogether would work around the problem with bnx2 PAUSE
flooding, but having only a global on/off setting -- and being forced
to disable it -- could impact systems and protocols that
need/recommend flow control.

It would definitely help if Dell/Broadcom introduced more granular
configuration of the flow control feature, for example allowing
per-port configuration of rx, tx and autonegotiation advertisement of
flow control support.

While the BCM56xxx series switch firmware has support for
priority-based flow control (802.1Qbb), the 4.1.1.9 release notes
state that only the 8024, 8024F and M8024-k support it, leaving out
the following models: 7024, 7048, 7024P, 7048P, 7024F, 7048R,
7048R-RA, M6220, M6348, M8024.

It should be noted that Cisco disables 802.3x flow control by default
on their switches, and allow per-port rx/tx settings [3].

[3]: Nexus 5000 Series NX-OS Software Configuration Guide
<URL:http://www.cisco.com/en/US/docs/switches/datacenter/nexus5000/sw/configuration/guide/cli/QoS.html#wp1138522>


Test results
============
The following tests describe our approach to troubleshooting the flow
control problem. They are done iteratively, so each test builds on the
previous.

Test 1
------
* Summary: PAUSE frames observed on 100Mbps uplink
* Switch fw 3.1.5.2
* M6220 uplink through 3com Megahertz 574B pcmcia, 100Mbps
* First PAUSE frame number: 340
* Last PAUSE frame number: 518

Laptop set up as bridge between core network and M6220, bridging
multiple vlans (trunking). Old 3com Megahertz 574B 100Mbps pcmcia card
connected to M6220, to minimize the chance that it eats up the flow
control frames before libpcap sees them. Negotiation disabled on M6220
uplink port:

   interface ethernet 1/g17
   no negotiation

..which activates flow control (as it was disabled during
autonegotiation):

Port   Type             Duplex  Speed  Neg Link Flow Control
-----  ---------------  ------  -----  --- ---- ------------
1/g17  Gigabit - Level  Full    100    Off Up   Active

PAUSE frames generated on blade 1 using "flow-ctrl -p 65535 -i eth0
-r". Around 8 seconds after starting, PAUSE frames are observed on the
uplink, on the bridging laptop. They have pause_time=0xffff, and are
sent every ~252ms. On a 100Mbps link, ref the 802.3x spec, we have
0xffff * 1/(100 megabits per second / 512 bits) = 320ms, so the switch
probably sends these frames well in advance of the timer expiring, as
long as it observes congestion. These frames are not (and cannot) be
tagged, so they block the entire link.

Stopping the PAUSE flood makes the switch send the last PAUSE frame
with pause_time=0 (aka xon).

PAUSE frames are captured using tcpdump:
   tcpdump -envXXi eth0 'ether[12:2] == 0x8808'


Test 2
------
* Summary: PAUSE frames observed on 1Gbps uplink
* Switch fw 3.1.5.2
* M6220 uplink through Intel 82567LM [8086:10f5] rev 3, 1Gbps
* First PAUSE frame number: 447
* Last PAUSE frame number: 927

Laptop set up as bridge between core network and M6220, bridging
multiple vlans (trunking). Recent e1000e Intel 82567LM 1Gbps connected
to M6220, to test if it eats up the PAUSE frames or not.
Autonegotiation enabled on M6220. It turns out that the PAUSE frames
are visible, regardless of enabling or disabling them on the e1000e --
as long as the autonegotiation with the switch convinces the M6220
that the other side supports it (ref "show interfaces status").

PAUSE frames generated on blade 1 using "flow-ctrl -p 65535 -i eth0
-r". Around 9 seconds after starting, PAUSE frames are observed on the
uplink, on the bridging laptop. They have pause_time=0xffff, and are
sent every ~25.2ms. On a 1Gbps link, we have 2 * 0xffff * 1/(1
gigabits per second / 512 bits) = 62ms [4]. As before, the switch
probably sends these frames well in advance of the timer expiring (and
probably ignores multiplying with 2, just in case?), as long as it
observes congestion. These frames are not (and cannot) be tagged, so
they block the entire link.

Stopping the PAUSE flood makes the switch send the last PAUSE frame
with pause_time=0 (aka xon).

[4]: 802.3-2008 part 3 annex 31B section 3.7 pp 747:
"At an operating speed of 1000Mb/s, a station shall not begin to
transmit (new) frame more than two pause_quantum bit times after the
reception of a valid PAUSE frame [...]"


Test 3
------
* Summary: Forced blade crash causes PAUSE frame flood
* Switch fw 3.1.5.2
* Broadcom NXII fw: 5.0.13 (via iDRAC -- what's this?)
* Broadcom NXII fw: 5.0.11 (via OS)
* Broadcom NXII driver: 2.0.2
* M6220 uplink through Intel 82567LM [8086:10f5] rev 3, 1Gbps
* First PAUSE frame number: 4121
* Last PAUSE frame number: 6773

Laptop as bridge, like previous test.

Instead of using flow-ctrl, blade 1 is crashed by first enabling panic
on unknown NMI, "sysctl kernel.unknown_nmi_panic=1", then issuing an
NMI from the blade's iDRAC power control. After some time (unclear if
it's due to time or fill-grade of blade 1 input queue), PAUSE frames
supposedly start flooding from blade 1. The switch propagates this to
the uplink, and they are visible on the bridge. As before, the
inter-frame time is ~25.2ms.

Issuing a CMC "serveraction -m server-1 hardreset" immediately fixes
the problem. Configuring the switch with "no flowcontrol" fixes the
issue.

Re-enabling flowcontrol on the switch doesn't seem to let the problem
trigger again, oddly enough. Even after sending a lot of packets to
the crashed blade, no PAUSE frames are observed. This could be due to
me disabling autonegotiation of flow control on the bridge, which
would then be activated on the next autoneg.


Test 4
------
* Summary: Switch fw 4.1.1.9 shows the same behaviour

Continuing with the same hardware setup as the previous test (1Gbps
uplink, etc).

The switch firmware is upped to 4.1.1.9, and the behaviour is the same
as before: PAUSE frames are observed on the uplink, regardless of
using the flow-ctrl or NMI crash approaches.


Test 5
------
* Summary: New NXII firmware, but same behaviour
* Broadcom NXII fw: 6.4.4 (via iDRAC -- what's this?)
* Broadcom NXII fw: 5.2.3, NCIS 2.0.11 (via OS)
* Broadcom NXII driver: 2.0.2 (unchanged)

Upgrading Broadcom NXII firmware from 5.0.13 to 6.4.4 in the iDRAC
lifecycle management controller. Is this firmware the boot code of the
NIC, or something else? After the update, the OS/Linux reports
firmware version 5.2.3 (NCSI 2.0.11), up from 5.0.11. The driver
version is still 2.0.2, as expected.

Forcing NMI crash: Same behaviour as before, with PAUSE frames being
sent on the uplink.


Test 6
------
* Summary: New NXII driver (and firmware) fixes the flooding
* Broadcom NXII fw: 6.4.4 (via iDRAC -- what's this?)
* Broadcom NXII fw: 6.4.4 bc 5.2.3 NCSI 2.0.11 (via OS)
* Broadcom NXII driver: 2.1.6

Upgrading to Debian Wheezy with kernel 3.0.0, providing a more recent
driver which also is set to use a newer firmware.

Forcing NMI crash: It seems that the server does *not* start flooding
PAUSE frames. This seems to indicate that the bug in the bnx2
driver/fw has been resolved at some point between the previous and
this version. It will take some time to track down the exact version
to verify that it's indeed fixed in v2.0.17j, like the Broadcom
release notes state.

Using flow-ctrl: Same problematic behaviour as before, but this is to
be expected, as the switch firmware is the same as in test 4.


Test 7
------
* Summary: Debian Squeeze 2.6.38-bpo and bnx2 driver 2.0.21: no flood
* Broadcom NXII fw: 6.4.4 (via iDRAC -- what's this?)
* Broadcom NXII fw: 6.4.4 bc 5.2.3 NCSI 2.0.11 (via OS)
* Broadcom NXII driver: 2.0.21

Reinstalling Debian Squeeze, then upgrading the kernel to
2.6.38-bpo.2-amd64. The bnx2 driver there is version 2.0.21, and it
requires the firmware files bnx2-mips-09-6.2.1.fw and
bnx2-mips-06-6.2.1.fw. Those were installed from the
firmware-bnx2_0.32~bpo60 Lenny backported package, as they have not
yet been made available to Squeeze backports [5].

Forcing NMI crash: No PAUSE frame flooding, so this also builds
confidence that the problem was fixed in v2.0.17j.

[5]: Re: Problem with 2.6.38 backport and bnx2 card
<URL:http://lists.debian.org/debian-backports/2011/07/msg00063.html>


Test 8
------
* Summary: Disabling bnx2 flow control tx: no flood
* Broadcom NXII driver: 2.0.2

Booting Squeeze on 2.6.32 again, with driver 2.0.2 (fw 5.2.3), and
disabling tx flow control using "ethtool -A eth0 tx off".

Disabling tx flow control, at least with driver 2.0.2, sometimes takes
a few tries, possibly having to disable flow control autonegotiation
first. Also, "ethtool -a eth0" by default reports "Autonegotiate:
off", but the switch has registered flow control as enabled on that
port, so I'm not sure the ethtool status can be trusted.

Forcing NMI crash: No PAUSE frame flooding, so disabling tx flow
control on the bnx2 is an alternative solution.


Reference docs
==============
* <URL:http://www.broadcom.com/collateral/pg/NetXtremeII-PG203-R.pdf>
* <URL:http://standards.ieee.org/about/get/802/802.3.html>
* 
<URL:http://support.dell.com/support/edocs/network/PCM6220/en/cli/PDF/cli.zip>
* <URL:http://homepage.cem.itesm.mx/raulm/pub/mimicry/opodis06.pdf> 
(section 5)

best regards,
Sven Ulland



More information about the Linux-PowerEdge mailing list