Severe Reliability & Performance Problems with a PE4600

Mark Cuss mcuss at cdlsystems.com
Mon May 17 12:57:00 CDT 2004


Hi All

I'm running a PE4600 as the main Linux applications server for our small
software company (about 30 users).  Over the past few weeks this normally
very reliable machine has been giving trouble.  It becomes very unresponsive
to network pings (a response can take up to a second or two instead of
microseconds) or even a direct console login - it takes a minute or two just
to get a prompt after logging in a root to the console (not in X, just good
old runlevel 3).

Hardware and Software details:  PE4600 with dual 2.2 GHz Xeons & 2 GB Ram,
one 18 gig hard drive and one HP DLT SCSI tape drive running off a 29160
SCSI Card (on board one disabled).  Using a DLink DGE-550T gigabit ethernet
card - I've disabled the onboard one.  OS is RH8 with a custom build Kernel
2.4.25.

This happened last week and the machine completely croaked before I could do
any postmortem, so I had to power cycle it.  Its up to its old tricks again
today, and I have gathered the following information which is hopefully of
help:

1)  Process listing:
The following is the first few lines listed from top:

11:29am  up 2 days, 22:35, 38 users,  load average: 36.10, 32.19, 25.71

283 processes: 281 sleeping, 2 running, 0 zombie, 0 stopped

CPU0 states:  0.0% user, 99.0% system,  0.0% nice,  0.1% idle

CPU1 states:  4.0% user,  6.0% system,  0.0% nice, 88.1% idle

CPU2 states:  0.0% user, 16.0% system,  0.0% nice, 83.0% idle

CPU3 states:  0.0% user,  4.0% system,  0.0% nice, 95.0% idle

Mem:  2068768K av, 1917860K used,  150908K free,       0K shrd,   12756K
buff

Swap: 2040244K av,   30308K used, 2009936K free                 1578100K
cached



  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND

    3 root      20  19     0    0     0 RWN  71.7  0.0  12:25 ksoftirqd_CPU0

 9938 raymond    9   0 23380  22M 18780 S    32.1  1.1   6:39
protocolsynthpa

 9944 raymond    9   0 12072  11M  9228 S     7.3  0.5   4:36
protocolsynthdl

 4191 root      19   0  1204 1204   844 R     6.4  0.0   0:00 top



You'll see that "ksoftirqd_CPU0" is pretty much pinning one CPU.  I noticed
this just before the machine died last week as well.  As I understand it,
the system runs one of these processes for each CPU, so this machine has 4
(2 hyperthreaded CPUs).  I've never ever seem these processes higher than 0
%, so I'm thinking this is pretty fishy - any comments?



2)  Kernel Log:

This system accesses all of its data from another server (our file server)
via NFS (called "hal") in the log below.  Both of these machines are
connected to each other via a gigabit switch.  The file server is a PE2650
single 3.06 GHz CPU connected to a PV220 disk array.  An excerpt from the
PE4600 log file:



May 17 11:19:05 locutus kernel: nfs: server hal not responding, still trying

May 17 11:19:06 locutus kernel: nfs: server hal OK

May 17 11:19:52 locutus pam_rhosts_auth[4137]: denied to ge at stinger as ge:
access not allowed

May 17 11:21:52 locutus kernel: nfs: server hal not responding, still trying

May 17 11:22:32 locutus kernel: nfs: server hal OK

May 17 11:22:58 locutus kernel: nfs: server hal not responding, still trying

May 17 11:23:31 locutus last message repeated 2 times

May 17 11:23:40 locutus kernel: nfs: server hal OK

May 17 11:23:43 locutus kernel: nfs: server hal OK

May 17 11:24:31 locutus kernel: nfs: server hal not responding, still trying

May 17 11:24:34 locutus pam_rhosts_auth[4152]: denied to root at hal as root:
access not allowed

May 17 11:24:34 locutus rlogin(pam_unix)[4152]: authentication failure;
logname= uid=0 euid=0 tty=rlogin ruser=root rhost=hal  user=root

May 17 11:24:37 locutus in.rlogind[4152]: PAM authentication failed for
in.rlogind

May 17 11:24:39 locutus gconfd (root-3988): Received signal 1, shutting down
cleanly

May 17 11:24:39 locutus gconfd (root-3988): Exiting

May 17 11:24:42 locutus kernel: nfs: server hal OK

May 17 11:24:43 locutus kernel: nfs: server hal not responding, still trying

May 17 11:24:49 locutus kernel: nfs: server hal not responding, still trying

May 17 11:25:01 locutus kernel: nfs: server hal OK

May 17 11:26:10 locutus kernel: nfs: server hal not responding, still trying

May 17 11:27:08 locutus kernel: nfs: server hal not responding, still trying

May 17 11:27:29 locutus kernel: nfs: server hal OK

May 17 11:27:51 locutus kernel: nfs: server hal not responding, still trying

May 17 11:27:56 locutus kernel: nfs: server hal not responding, still trying

May 17 11:28:25 locutus kernel: nfs: server hal OK

May 17 11:28:51 locutus kernel: nfs: server hal not responding, still trying

May 17 11:29:11 locutus kernel: nfs: server hal not responding, still trying

May 17 11:29:54 locutus kernel: nfs: server hal OK

May 17 11:29:56 locutus last message repeated 2 times

May 17 11:30:07 locutus login(pam_unix)[3894]: session closed for user root

May 17 11:30:15 locutus login(pam_unix)[4195]: session opened for user root
by LOGIN(uid=0)

May 17 11:30:15 locutus  -- root[4195]: ROOT LOGIN ON tty1

May 17 11:30:32 locutus kernel: nfs: server hal OK

May 17 11:30:52 locutus su(pam_unix)[4244]: session opened for user mark by
root(uid=0)

May 17 11:31:32 locutus kernel: nfs: server hal OK

May 17 11:31:32 locutus kernel: nfs: server hal OK

May 17 11:34:19 locutus login(pam_unix)[939]: session opened for user root
by LOGIN(uid=0)

May 17 11:34:19 locutus  -- root[939]: ROOT LOGIN ON tty2

May 17 11:34:41 locutus modprobe: modprobe: Can't locate module nls_cp437

May 17 11:34:41 locutus modprobe: modprobe: Can't locate module
nls_iso8859-1



As you can see - there are a lot of NFS access problems here - I'm not sure
if this is related to the unstability of the system or not....



Other points of note:

1)  All of my servers (the 4600, the 2650, two 1400's, a Sun Enterprise 250,
a Sun Blade 1000, and a PV132T tape library) are all plugged into the same 8
port gigabit switch (a LinkSys EF3508 unmanaged switch).  The lights blink
like mad on this thing all the time - its definitely busy, but I don't know
if it could be causing network problems or not.

2)  All of the machines listed above (except for the Sun Enterprise 250) is
connected to an 8 port rack mount KVM switch (the KVM1080 shown at
http://www.cablesbyacc.com/catalog3lvl.cfm?product=KVMRM&category=KVM%20Switches).



I'd like to rule out a hardware problem buy running some diags, but the CDs
that came with the server don't seem to have the diags on them...  I
downladed Dell's 32 bit diags, but the the tool that writes the floppy disks
in windows keeps on crashing...



I know this is a lot of information, but hopefully one of you experts can
point me in the right direction...  I'm not really a slouch when it comes to
Linux, but this is my first crack at setting up a system of this scale.  Any
help would be greatly appreciated!



Thanks!



Mark



Mark Cuss, B. Sc.
Real Time Systems Analyst
System Administrator
CDL Systems Ltd
Suite 230
3553 - 31 Street NW
Calgary, AB, Canada

Phone: 403 289 1733 ext 226
Fax: 403 282 1238
www.cdlsystems.com





More information about the Linux-PowerEdge mailing list