Severe Reliability & Performance Problems with a PE4600

Peter Mueller pmueller at sidestep.com
Mon May 17 15:13:00 CDT 2004


> Hardware and Software details:  PE4600 with dual 2.2 GHz 
Hello,

> Xeons & 2 GB Ram,
> one 18 gig hard drive and one HP DLT SCSI tape drive running 
> off a 29160
> SCSI Card (on board one disabled).  Using a DLink DGE-550T 
> gigabit ethernet

I am not certain the DGE-550T is a good card.  I don't know if it has
NAPI (reduces interrupts under load) or if it is a poor hardware design.
Is the system pushing out lots of load?  Does anyone have experience
with this card?  Google isn't very helpful.

On important systems you should run regular process % checks and
bandwidth counters.  You can pipe this data into MRTG for history and
nice graphs.  If you want an example snmpd.conf to allow this behavior
safely message me for an example.

> 11:29am  up 2 days, 22:35, 38 users,  load average: 36.10, 
> 32.19, 25.71

Ouch.. That's a crazy load.  In my experience anything over 20 or 30
quickly becomes unresponsive on i386.

>   PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
> 
>     3 root      20  19     0    0     0 RWN  71.7  0.0  12:25 
> ksoftirqd_CPU0

> You'll see that "ksoftirqd_CPU0" is pretty much pinning one 
> CPU.  I noticed
> this just before the machine died last week as well.  As I 
> understand it,
> the system runs one of these processes for each CPU, so this 
> machine has 4
> (2 hyperthreaded CPUs).  I've never ever seem these processes 
> higher than 0
> %, so I'm thinking this is pretty fishy - any comments?

Ksoftirqd is the interrupt handler.  Do you have any idea how much
bandwidth is being pushed through this server?  Can you install a
program such as ntop (network top) or MRTG counters to find out?  If
this is not an option, you can try using a simple shell script:

#!/bin/bash
COUNTER=0
while [  $COUNTER -lt 30 ]; do
        ifconfig eth0 | grep RX
        ifconfig eth0 | grep TX
        let COUNTER=COUNTER+1
        sleep 5
done

Something like that.  I'm a sysadmin, not a programmer.  So no flaming
me! ;-).  The goal is to find the bandwidth this server is running.
This script will print out network counters every 5 seconds.  You can
take the delta of the packets, then divide by 5.  This will then give
you an averaged rate/s.  If the rate/s is > 30megabit/s you are likely
running into network card related issues.  If it is less than it is
something else.  This is true unless the card really sucks.
Unfortunately I don't know the answer.. List?! :)

If the network card is the reason, you will need to replace that
DGE-550T.  Most people on the list seem to have great results with the
Intel cards.  Intel is NAPI-enabled by default.  This means you will get
reduced interrupts, be able to push > 100,000 packets-per-second, and
not have to use a special driver.  You will want to be certain to place
the card in a 64-bit slot.  Alternatively if you are poor you can
re-enable the onboard card and try the 2.4.27-pre2 kernel and tg3
driver.  The downside to this approach is you will be a bit of a guinea
pig.  The upside is you can then report to us if 2.4.27-pre2 tg3 patches
from D.Miller make the onboard card stable under load.

Cheers,

P




More information about the Linux-PowerEdge mailing list