Severe Reliability & Performance Problems with a PE4600

Seth Mos seth.mos at
Mon May 17 14:36:01 CDT 2004

Mark Cuss wrote:
> Hi All
> I'm running a PE4600 as the main Linux applications server for our small
> software company (about 30 users).  Over the past few weeks this normally
> very reliable machine has been giving trouble.  It becomes very unresponsive
> to network pings (a response can take up to a second or two instead of
> microseconds) or even a direct console login - it takes a minute or two just
> to get a prompt after logging in a root to the console (not in X, just good
> old runlevel 3).

Sounds a bit like something is stuck in disk I/O.

> Hardware and Software details:  PE4600 with dual 2.2 GHz Xeons & 2 GB Ram,
> one 18 gig hard drive and one HP DLT SCSI tape drive running off a 29160
> SCSI Card (on board one disabled).  Using a DLink DGE-550T gigabit ethernet
> card - I've disabled the onboard one.  OS is RH8 with a custom build Kernel
> 2.4.25.

I'd replace the D-link card with a Intel Pro 1000MT, server or desktop. 
Pick one. The server card if you have the cash. The server has decent 
64bit pci slots. Definite advantage there.

> 1)  Process listing:
> The following is the first few lines listed from top:
> 11:29am  up 2 days, 22:35, 38 users,  load average: 36.10, 32.19, 25.71

36! Something is very stuck indeed. How much data actually comes of the 
NFS server, as in, are parts of the OS itself mounted NFS partitions.

And can you show us the fstab entry?

> This system accesses all of its data from another server (our file server)
> via NFS (called "hal") in the log below.  Both of these machines are
> connected to each other via a gigabit switch.  The file server is a PE2650
> single 3.06 GHz CPU connected to a PV220 disk array.  An excerpt from the
> PE4600 log file:

What Raid controller and raid configuration are you using on the 2650?

If you have a Perc3/DC or QC you can get a Perc4, although for a hefty 

> May 17 11:19:05 locutus kernel: nfs: server hal not responding, still trying
> May 17 11:19:06 locutus kernel: nfs: server hal OK

I see the NFS server is having performance problem. Better check if the 
2650 has enough IO bandwidth. You are prfobably killing of the 2650 with 
gigabit. It's not that the 2650 can't handle, rather the raid controller 
that can not handle that much load. The perc3/DC series are not exactly 
know for their stellar performance.

I bet that if you go to 100Mbit ethernet the stability problems will 



