AW: 2650 + new BIOS + 2.6.10-ac11 and it *still* crashes

Matthias Pigulla mp at webfactory.de
Tue Mar 15 17:15:42 CST 2005


Hey all,
 
> Yes, we have been having problems with 2650's and RHEL.  
Good to know it's RHEL :)

> The server remains alive to ICMP pings.  If you portscan it, 
> it shows that ports are open.  But if you attempt to connect 
> to any of the services running on the machine, you get 
> nothing. 

For us, the server becomes totally unresponsive, no pongs.

> You can't get a console prompt, either through a monitor, or 
> through a serial connection.

... and not through the ERA remote console, however, basically, that's a
monitor :)

> The kicker has been that absolutely nothing is ever logged.  
> No oops, panic, or warning.  When you reboot the server 
> (thank goodness for RAC cards), and go back and comb through 
> the logs, there's absolutely nothing logged to indicate a 
> problem. 

Same for us, most of the time. Only in rare cases, the well known "scsi
... timeout ... hang..." messages make it from the box to a remote
loghost (via network!). 

> Nothing in ESM either.

We find "Event: Drive [0, 2 or 3] drive slot sensor drive fault
detected" in the ERA log, ERA also generates such e-mails to the admin
address.

> We are running RHEL 3, Update 4. 

Debian woody here.

> Several common items 
> include most servers are attached to a Dell/EMC CX300 SAN, 
> with single Qlogic 2340 HBAs, they are running iptables, we 
> are using the bonding driver with the tg3 NIC driver, in 
> active-fallback mode.  Kernel is either 2.4.21-27.0.1smp or 
> 2.4.21-27.0.2smp and OpenManage is installed.  BIOS and FW 
> are current.

Nothing special here; BIOS, firmware are up-to-date. Kernel is a
standard 2.4.27 with aacraid 1.1.5 (from Adaptec). 

This is a single CPU box; disabling hyperthreading did not help. Crashes
almost always (only one exception I can remember) occur while Legato
Networker is making backups. Mounting filesystems with noatime improved
the situation a little (crashes are less frequent), I presume that's
simply because it reduces IO load.

Mark Salyzyn from Adaptec mentioned disk make/firmware might play a
role, we have 4 x QUANTUM ATLAS10K3_18_SCA Rev 120G here. Maxtor support
says there is no more recent firmware.

Exchanging four disks is equally to exchanging the controller (in terms
of $$$, time and work needed to rebuild the array), however the latter
probably has a greater effect... You are not alone :)

Best regards,
Matthias




More information about the Linux-PowerEdge mailing list