my take on lockups on 2550s, 2650s, and 66xx systems was RE: PERC3/Di failure workaround hypothesis

Steve_Boley@Dell.com Steve_Boley at Dell.com
Fri May 21 20:03:00 CDT 2004


Yeah there's been 2 prevailing things causing lockups
and if not the aacraid it usually won't log any output
with the tail of the logs.  The tailing will usually
keep outputting to the screen after the disk subsystem
is lost so you have to have the terminal on that screen
to see what output was on it when locked.

1. Network drivers (tg3 and older ver of bcm5700)
Fix differs from the load on the system.  My observations
have been that default bcm5700s in as2.1 lock under load
with large memory and older ver of tg3 just locked regardless.
Newer tg3 seem to hold up until under a very extreme load and
then the aggressive timings that Jeff Garzik put in the code
seem to be more than the 57xx chipsets can take and start locking.
Latest ver 7.x from bcom's website actually utilized a kernel
maintainer to clean and debug the code and also was written to
work with both mii-tool and ethtool I believe.  This has so far
been the best to hold up under load but when teamed with the basp
teaming utility seems to negate that fact.

My suggestion is to get the 7.x ver and if going to team
use the native bonding module and gives you 2 advantages.
Better module under load and using bonding has no tainted
stigma over the kernel if you need support.

2. The aacraid timeout issues mentioned before.  One generally
effective fix is to get the massaged code that Mark Salzyn
worked on and changed some timings and a little code work on
as well aacraid and build and run it.  Also disabling
hyperthreading either at kernel level or bios level has
helped by reducing some load on the scsi subsystem and therefore
eliminating timeouts.  The last effect is what Matt mentioned
of disabling all read and write caching on the controller
which totally eliminates and cache flushes which has purportedly
been identified as the issue.

My suggestion is to try the code build first and then if
the lockups continue and you are sure that it isn't the
network issue above or something else to disable the caching.
This at least gives you better aacraid code and then gets
ready for when a firmware fix is released to deal with this
issue.  Would simply need to flash firmware and reenable the
caching options in the cli interface.

Now in the 66xx series the network issue above applies but
not the aacraid.  Since most of those are almost all shipped
with LSI based controllers, all rhel release products are
advised to use the megaraid_2002 in errata previous to the
newer megaraid2 which is advised for later errata with it
included which 2.1 was e38 and above and is in rhel3 from
the getgo.  Make sure that latest firmware is flashed on
the perc3xc and perc4xc controllers when moving from the
megaraid to megaraid_2002 or megaraid2.

This in no way encompasses all locking issues but a great
majority of them that we've seen on the front lines of support.
Steve (Just a tech hangin in there) Boley




More information about the Linux-PowerEdge mailing list