[Linux-PowerEdge] R720xd intermittent NIC failure
lhecking at users.sourceforge.net
Tue Apr 8 05:54:51 CDT 2014
We have been observing intermittent NIC failures on a number of R720xd
servers. They are all runnning CentOS 6.4 and use either the ixgbe driver
that comes with this OS, 3.9.15-k, or a newer version from Intel, 3.18.7.
These machines have four builtin NICs on the system board, 2x Intel
X540-AT2 10Gb (8086:1528) and 2x Intel I350 1Gb (8086:1521). The 1Gb
interfaces are unused, and the 10Gb interfaces are bonded into Nexus
switches in 802.3ad (LACP) mode.
This happens intermittently:
Apr 5 06:30:55 hn kernel: ixgbe 0000:01:00.0: em1: NIC Link is Down
Apr 5 06:30:55 hn kernel: bonding: bond0: link status definitely down for interface em1, disabling it
Apr 5 06:30:59 hn kernel: ixgbe 0000:01:00.0: em1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Apr 5 06:30:59 hn kernel: bond0: link status definitely up for interface em1, 10000 Mbps full duplex.
There are probably around 10 such events in the past four weeks, in irregular
intervals. The outage always lasts four seconds, no exception.
Sometimes, and such events may occur every three to four months, we see that
the second bonded interface goes down while the first is still down. Total
loss of connectivity is no longer than four seconds for the host, but even
such a brief outage has a detrimental effect on the storage cluster sw running
on those machines, and the virtual disks need to be checked and re-added into
the cluster manually.
Out of a total deployment of 16 R720xd, around 6 have logged such messages in
the past four weeks. Has anyone seen this before?
More information about the Linux-PowerEdge