Hard lockups on PE 6600

Rechenberg, Andrew arechenberg at shermfin.com
Thu Nov 7 09:23:00 CST 2002

I've just put a PowerEdge 6600 runing Red Hat 7.3 into production and it
has locked up hard two times in the last three days.  As you can see:

Nov  7 00:05:19 cinshrcub01 ftpd[1307]: wu-ftpd - TLS settings: control
allow, client_cert allow, data allow
Nov  7 00:05:19 cinshrcub01 ftpd[1307]: FTP session closed
Nov  7 00:05:25 cinshrcub01 telnetd[1308]: ttloop: peer died: EOF
Nov  7 07:19:51 cinshrcub01 syslogd 1.4.1: restart.
Nov  7 07:19:51 cinshrcub01 syslog: syslogd startup succeeded
Nov  7 07:19:51 cinshrcub01 kernel: klogd 1.4.1, log source = /proc/kmsg
Nov  7 07:19:51 cinshrcub01 kernel: dress[0xfec02000]

the syslog shows nothing except a syslog restart from the reboot.  I've
checked all that I know of (message, secure, cron) and all of them stop
at or around 12:05am with no errors.  There was no panic message on the
console and no response at all at the console from keyboard or mouse

There seems to be no correlation in time or system activity between the
two hangups.  The first one appeared to happen around 2:30am and last
night's happened around 12:05am.

This sytem is running the RDBMS UniVerse version from
IBM/Ardent/Informix.  It is running the latest errata kernel from Red
Hat (2.4.18-17.7.xbigmem).  The system is a quad Xeon 1.4GHz,
HyperThreading enabled, with 8GB RAM, one PERC3/DC controlling internal
disk arrays and one PERC3/QC controlling two external PowerVault 220S.
The only kernel changes I've made are runtime via /proc:

# This will set the Shared Memory Maximum to 64MB
/bin/echo "67108864" > /proc/sys/kernel/shmmax

# Set the semaphore kernel params (now in /proc)
/bin/echo "250  32767  32  256" > /proc/sys/kernel/sem

# Set kupdated to run every 0.6 sec and the kernel to flush
# dirty buffers every 3 seconds.  This change should improve
# interactive performance.  See
# http://www-106.ibm.com/developerworks/linux/library/l-fs8/
/bin/echo "40 10 0 0 60 512 60 0 0" > /proc/sys/vm/bdflush

I run procallator and Orca, so I do have some information about load
average, context switches, network and disk activity up to the crash,
and the UDP traffic before both crashes increases dramatically.  

Another interesting fact is that Red Hat 7.3 detected the Broadcom NIC's
in the box and used the tg3 driver instead of the bcm5700 driver.

Has anyone experienced any issues with the 6600 and 7.3 with the latest
bigmem kernel?  Is there anything I can do (maybe kernel profiling, or
the like) to get some more information about what is causing these
lockups?  Does anyone think that changing the NIC driver to bcm5700 will
make any difference?

Any help/suggestions is/are appreciated.  

Andrew Rechenberg
Infrastructure Team, Sherman Financial Group
arechenberg at shermanfinancialgroup.com
Phone: 513.707.3809
Fax:   513.707.3838

More information about the Linux-PowerEdge mailing list