PE2850 stability problem

Jason Kau bubbafat at speakeasy.net
Wed Apr 12 02:22:41 CDT 2006


We have three Dell PowerEdge 2850s each with a PowerVault 220S.  These PE2850s + PV220S are configured almost identically--RHEL4 U3 AS was installed on each using the same RHEL4 U3 AS KickStart configuration.  Only their network settings were changed depending on which network they reside on.
 
Every 30 to 60 days, the servers become largely unresponsive with these symptoms:

1) We're unable to log into the servers via SSH (sometimes the SSH login succeeds but the user login session never gets past the login banner/motd; sometimes the server is unresponsive on the network (ping) and obviously we can't even perform a SSH login).

2) We're unable to log into the servers via the DRAC4 virtual console.  Although the mouse works, the virtual console keyboard input does not and we cannot type in a username field.

3) Dell OpenManage Server Administrator 4.5 (https://<server_ip>:1311) is unresponsive--we never get the login page (even if the server is responding to pings on the network).

4) Apache Tomcat (our application server hosted on these PE2850s) becomes unresponsive.

5) On the DRAC virtual console or actual console we see a bunch of megaraid and hdf (/dev/hdf being the VIRTUAL CROM drive from the DRAC).  Please see the screenshot that shows these errors on the console:
 
http://www.speakeasy.org/~jkau/hdf_megaraid_errors.jpg

When we power cycle a PE2850 in this state, the DRAC virtual media and PERC4e/Di are not working:

1) The VIRTUAL FLOPPY and VIRTUAL CDROM drive are not detected by the ATA/133 Controller.

2) The PERC4e/Di card (which controls the internal RAID1 array with the / and /boot partitions) no longer says "1 Logical Drive(s) handled by BIOS"--instead it says "0 Logical Drive(s) handle by BIOS".  This results in a failure to boot with "strike F1 to retry boot, F2 for setup utility".  Please see these screenshots that show these two problems:

http://www.speakeasy.org/~jkau/virtual_media_perc4edi_problems.jpg
http://www.speakeasy.org/~jkau/f1_to_retry_boot.jpg

When we power cycle the PE2850 a second time, the DRAC virtual media is always detected and PERC4e/DI always says "1 Logical Drive(s) handled by BIOS" and RHEL4 AS boots up just fine (after recovering the ext3 journals).  Please see this screenshot to see the successful POST:

http://www.speakeasy.org/~jkau/virtual_media_perc4edi_success.jpg

In the case of our last PE2850 that showed this problem, it became unresponse around midnight.  After the second reboot, we were able to inspect /var/log/messages.

It has dozens of:

Apr 11 00:17:30 ngqt kernel: hdf: status error: status=0x7f { DriveReady DeviceFault SeekComplete DataRequest CorrectedError Index Error }
Apr 11 00:17:30 ngqt kernel: hdf: status error: error=0x7fIllegalLengthIndication EndOfMedia Aborted Command MediaChangeRequested LastFailedSense 0x07
Apr 11 00:17:30 ngqt kernel: hdf: drive not ready for command
Apr 11 00:17:30 ngqt kernel: hdf: ATAPI reset complete

Followed by this:

Apr 11 00:17:31 ngqt kernel: cdrom: This disc doesn't have any tracks I recognize!

Followed by this:

Apr 11 00:17:42 ngqt kernel: 7

Followed by this:

Apr 11 00:17:45 ngqt kernel: hdf: status error: status=0x80 { Busy }
Apr 11 00:17:45 ngqt kernel: hdf: status error: error=0x80LastFailedSense 0x08
Apr 11 00:17:45 ngqt kernel: hdf: drive not ready for command
Apr 11 00:17:45 ngqt kernel: irq 193: nobody cared! (screaming interrupt?)
Apr 11 00:17:45 ngqt kernel: irq 193: Please try booting with acpi=off and report a bug
Apr 11 00:17:45 ngqt kernel:  [<c01074d6>] __report_bad_irq+0x3a/0x77
Apr 11 00:17:45 ngqt kernel:  [<c010774d>] note_interrupt+0xea/0x115
Apr 11 00:17:45 ngqt kernel:  [<c01079f9>] do_IRQ+0x143/0x1ae
Apr 11 00:17:45 ngqt kernel:  [<c02d3014>] common_interrupt+0x18/0x20
Apr 11 00:17:45 ngqt kernel:  [<c01040e8>] mwait_idle+0x33/0x42
Apr 11 00:17:45 ngqt kernel:  [<c01040a0>] cpu_idle+0x26/0x3b
Apr 11 00:17:45 ngqt kernel: handlers:
Apr 11 00:17:45 ngqt kernel: [<c0240459>] (ide_intr+0x0/0x11e)
Apr 11 00:17:45 ngqt kernel: [<c0258fb8>] (usb_hcd_irq+0x0/0x4b)
Apr 11 00:17:45 ngqt kernel: Disabling IRQ #193
Apr 11 00:17:55 ngqt kernel: usb 2-1: new full speed USB device using address 3
Apr 11 00:17:56 ngqt kernel: input: USB HID v1.10 Keyboard [Dell DRAC4] on usb-0000:00:1d.0-1
Apr 11 00:17:56 ngqt hal.hotplug[6750]: DEVPATH is not set
Apr 11 00:17:56 ngqt kernel: input: USB HID v1.10 Mouse [Dell DRAC4] on usb-0000:00:1d.0-1
Apr 11 00:17:56 ngqt hal.hotplug[6808]: DEVPATH is not set
Apr 11 00:18:42 ngqt kernel: hdf: irq timeout: status=0xc1 { Busy }
Apr 11 00:18:42 ngqt kernel: hdf: irq timeout: error=0xa0LastFailedSense 0x0a

Followed by 11 of these:

Apr 11 00:18:47 ngqt kernel: hdf: status timeout: status=0xc1 { Busy }
Apr 11 00:18:47 ngqt kernel: hdf: status timeout: error=0x84Aborted Command LastFailedSense 0x08
Apr 11 00:18:47 ngqt kernel: hdf: drive not ready for command

Software/Firmware versions on these servers:

[root at ngqt ~]# omreport system version | grep -v Updateable
Version Report

---------------------
Main System Chassis
---------------------

Name       : BIOS
Version    : A04


Name       : DRAC4
Version    : 1.35


Name       : BMC
Version    : 1.40


Name       : Primary BP
Version    : 1.00

----------
Software
----------

Name       : Red Hat Enterprise Linux AS
Version    : release 4 (Nahant Update 3) Kernel 2.6.9-34.ELsmp (i686)


Name       : Dell Server Administrator
Version    : 4.5.0


[root at ngqt ~]# omreport storage controller
List of Controllers in the system

Controllers
ID                                : 0
Status                            : Ok
Name                              : PERC 4e/Di
Slot ID                           : Embedded
State                             : Ready
Firmware Version                  : 521X
Driver Version                    : Not Applicable
Minimum Required Firmware Version : Not Applicable
Minimum Required Driver Version   : Not Applicable
Number of Channels                : 2
Rebuild Rate                      : 30%
Alarm State                       : Not Applicable
Cluster Mode                      : Not Applicable
SCSI Initiator ID                 : 7

ID                                : 1
Status                            : Ok
Name                              : PERC 4/DC
Slot ID                           : PCI Slot 3
State                             : Ready
Firmware Version                  : 351X
Driver Version                    : Not Applicable
Minimum Required Firmware Version : Not Applicable
Minimum Required Driver Version   : Not Applicable
Number of Channels                : 2
Rebuild Rate                      : 30%
Alarm State                       : Enabled
Cluster Mode                      : Not Applicable
SCSI Initiator ID                 : 7

[root at ngqt1 ~]# omreport storage vdisk
List of Virtual Disks in the System

Controller PERC 4e/Di (Embedded)
ID           : 0
Status       : Ok
Name         : Virtual Disk 0
State        : Ready
Progress     : Not Applicable
Layout       : RAID-1
Size         : 68.24 GB (73274490880 bytes)
Device Name  : /dev/sda
Read Policy  : Adaptive Read Ahead
Write Policy : Write Back
Cache Policy : Direct I/O
Stripe Size  : 64 KB

Controller PERC 4/DC (Slot 3)
ID           : 0
Status       : Ok
Name         : Virtual Disk 0
State        : Ready
Progress     : Not Applicable
Layout       : RAID-5
Size         : 546.48 GB (586783129600 bytes)
Device Name  : /dev/sdb
Read Policy  : Adaptive Read Ahead
Write Policy : Write Back
Cache Policy : Direct I/O
Stripe Size  : 64 KB

[root at ngqt ~]# omreport storage enclosure
List of Enclosures in the System

Enclosure(s) on Controller PERC 4e/Di (Embedded)
ID                    : 0
Status                : Ok
Name                  : Backplane
State                 : Ready
Channel               : 0
Target ID             : 6
Configuration         : Not Applicable
Firmware Version      : Not Applicable
Service Tag           : Not Applicable
Asset Tag             : Not Applicable
Asset Name            : Not Applicable
Backplane Part Number : Not Applicable
Split Bus Part Number : Not Applicable
SCSI Rate             : Ultra 320M SCSI
Enclosure Alarm       : Not Applicable

Enclosure(s) on Controller PERC 4/DC (Slot 3)
ID                    : 0
Status                : Ok
Name                  : PV220S / PV221S
State                 : Ready
Channel               : 0
Target ID             : 6
Configuration         : Joined
Firmware Version      : E.18
Service Tag           : GVTP481
Asset Tag             :
Asset Name            :
Backplane Part Number : 0X6156A00
Split Bus Part Number : 0W0764A02
SCSI Rate             : Ultra 320M SCSI
Enclosure Alarm       : Disabled





More information about the Linux-PowerEdge mailing list