Hard lockups on PE 6600 (SOLVED - maybe) [also PE2x50 lockups and pe2605 console hang]

Rechenberg, Andrew arechenberg at shermfin.com
Mon Nov 11 09:14:01 CST 2002


It appears that the fix to my system lockups was to change the Broadcom
NIC driver.  It appears that Dell configures their pre-installed Red Hat
systems with the bcm5700 kernel module.  If one installs a fresh Red Hat
7.3 system, the tg3 NIC driver is used.

On my particular system the lockups would occur during times of high
disk activity along with high network activity.  We do network-based
backups, so backups would trigger the problem in our case.

On Friday I switched to using the bcm5700 driver (version 2.2.26) that
comes with the 2.4.18-17.7.xbigmem Red Hat kernel.  The box has stayed
up all weekend, been successfully backed up twice, with no issues.

If your box is experiencing problems with full lockups (no network
access, console locked up), try using the bcm5700 driver instead of the
tg3 driver.  I believe that changing the driver is the fix, but I'll be
more sure when the box has been up for a week with no problems :)

Good luck,
Andy.


-----Original Message-----
From: Rechenberg, Andrew 
Sent: Thursday, November 07, 2002 10:23 AM
To: linux-poweredge at dell.com
Subject: Hard lockups on PE 6600



I've just put a PowerEdge 6600 runing Red Hat 7.3 into production and it
has locked up hard two times in the last three days.  As you can see:

Nov  7 00:05:19 cinshrcub01 ftpd[1307]: wu-ftpd - TLS settings: control
allow, client_cert allow, data allow
Nov  7 00:05:19 cinshrcub01 ftpd[1307]: FTP session closed
Nov  7 00:05:25 cinshrcub01 telnetd[1308]: ttloop: peer died: EOF
Nov  7 07:19:51 cinshrcub01 syslogd 1.4.1: restart.
Nov  7 07:19:51 cinshrcub01 syslog: syslogd startup succeeded
Nov  7 07:19:51 cinshrcub01 kernel: klogd 1.4.1, log source = /proc/kmsg
started.
Nov  7 07:19:51 cinshrcub01 kernel: dress[0xfec02000]
global_irq_base[0x20])

the syslog shows nothing except a syslog restart from the reboot.  I've
checked all that I know of (message, secure, cron) and all of them stop
at or around 12:05am with no errors.  There was no panic message on the
console and no response at all at the console from keyboard or mouse
input.  

There seems to be no correlation in time or system activity between the
two hangups.  The first one appeared to happen around 2:30am and last
night's happened around 12:05am.

This sytem is running the RDBMS UniVerse version 9.6.2.4 from
IBM/Ardent/Informix.  It is running the latest errata kernel from Red
Hat (2.4.18-17.7.xbigmem).  The system is a quad Xeon 1.4GHz,
HyperThreading enabled, with 8GB RAM, one PERC3/DC controlling internal
disk arrays and one PERC3/QC controlling two external PowerVault 220S.
The only kernel changes I've made are runtime via /proc:

# This will set the Shared Memory Maximum to 64MB
/bin/echo "67108864" > /proc/sys/kernel/shmmax

# Set the semaphore kernel params (now in /proc)
/bin/echo "250  32767  32  256" > /proc/sys/kernel/sem

# Set kupdated to run every 0.6 sec and the kernel to flush
# dirty buffers every 3 seconds.  This change should improve
# interactive performance.  See
# http://www-106.ibm.com/developerworks/linux/library/l-fs8/
#
/bin/echo "40 10 0 0 60 512 60 0 0" > /proc/sys/vm/bdflush

I run procallator and Orca, so I do have some information about load
average, context switches, network and disk activity up to the crash,
and the UDP traffic before both crashes increases dramatically.  

Another interesting fact is that Red Hat 7.3 detected the Broadcom NIC's
in the box and used the tg3 driver instead of the bcm5700 driver.

Has anyone experienced any issues with the 6600 and 7.3 with the latest
bigmem kernel?  Is there anything I can do (maybe kernel profiling, or
the like) to get some more information about what is causing these
lockups?  Does anyone think that changing the NIC driver to bcm5700 will
make any difference?

Any help/suggestions is/are appreciated.  

Regards,
Andrew Rechenberg
Infrastructure Team, Sherman Financial Group
arechenberg at shermanfinancialgroup.com
Phone: 513.707.3809
Fax:   513.707.3838

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list
archives at http://lists.us.dell.com/htdig/




More information about the Linux-PowerEdge mailing list