[Linux-PowerEdge] R720xd intermittent NIC failure

Lars Hecking lhecking at users.sourceforge.net
Tue Apr 8 05:54:51 CDT 2014

 We have been observing intermittent NIC failures on a number of R720xd
 servers. They are all runnning CentOS 6.4 and use either the ixgbe driver
 that comes with this OS, 3.9.15-k, or a newer version from Intel, 3.18.7.

 These machines have four builtin NICs on the system board, 2x Intel
 X540-AT2 10Gb (8086:1528) and 2x Intel I350 1Gb (8086:1521). The 1Gb
 interfaces are unused, and the 10Gb interfaces are bonded into Nexus
 switches in 802.3ad (LACP) mode.

 This happens intermittently:

Apr  5 06:30:55 hn kernel: ixgbe 0000:01:00.0: em1: NIC Link is Down
Apr  5 06:30:55 hn kernel: bonding: bond0: link status definitely down for interface em1, disabling it
Apr  5 06:30:59 hn kernel: ixgbe 0000:01:00.0: em1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Apr  5 06:30:59 hn kernel: bond0: link status definitely up for interface em1, 10000 Mbps full duplex.

 There are probably around 10 such events in the past four weeks, in irregular
 intervals. The outage always lasts four seconds, no exception.

 Sometimes, and such events may occur every three to four months, we see that
 the second bonded interface goes down while the first is still down. Total
 loss of connectivity is no longer than four seconds for the host, but even
 such a brief outage has a detrimental effect on the storage cluster sw running
 on those machines, and the virtual disks need to be checked and re-added into
 the cluster manually.

 Out of a total deployment of 16 R720xd, around 6 have logged such messages in
 the past four weeks. Has anyone seen this before?

More information about the Linux-PowerEdge mailing list