PE2650 / Perc 3Di crash

Matthias Pigulla mp at webfactory.de
Tue Aug 5 03:39:53 CDT 2003


Hello everyone,

tonight, I lost one of my PowerEdge boxes with a kernel panic. I'm
running a PERC 3/Di, RAID10, on Debian woody with a custom 2.4.19
kernel. I'll try to provide all information I can collect, I hope
someone can help me to track this issue down. Please bear with me,
although if it's long :)

First, the last lines I could retype from the remote console:

scsi: aborting command due to timeout: pid ..., scsi0, channel 0, id 0,
lun 0 Read (10) ...somehexnumbers...
scsi: aborting command due to timeout: pid ..., scsi0, channel 0, id 0,
lun 0 Write (10) ...somehexnumbers...
scsi: aborting command due to timeout: pid ..., scsi0, channel 0, id 0,
lun 0 Read (10) ...somehexnumbers...
SCSI host 0 abort (pid ...) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
scsi: aborting command due to timeout : pid ..., scsi0, channel 0, id 0,
lun 0 Read (10) ...somehexnumbers...
SCSI host 0 abort (pid ...) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
scsi: aborting command due to timeout : pid ..., scsi0, channel 0, id 0,
lun 0 Read (10) ...somehexnumbers...
SCSI host 0 abort (pid ...) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
aacraid: <...repeats 2 more times>              [<-- literally like
this]
aacraid:ID(0:02:0) - IO failed, Cmd[0x2a]
Kernel panic: scsi_free:Bad offset
In interrupt handler - not syncing

I tried a reset on the ERA. On powerup, the system tried to initialize
the controller for some minutes with no advance, so I performed a remote
power cycle. After that, I got:

Waiting for Array Controller #0 to start...
Array Controller started
[...]
Following containers have missing members and are degraded:
Container#0-Stripe
Container#62-Mirror

On boot, the filesystem was corrupted and had to be checked manually.
After that, it took another reboot, this time with no more error
messages. 

The system now came up. Unfortunately, the kern.log starts with the
reboot and all the information since the log was last rotated is
missing. (I assume the above errors were written to the file, but could
not be synced, so I lost the file due to the filesystem
inconsistencies?) One entry shortly after reboot is:

kernel: aacraid:Container 62 completed REBUILD task:

The ERA embedded system management (esm) log shows:
Di Aug 05 01:30:55 2003   Drive 2 drive slot sensor drive error/removed 

Any hints how I should proceed? Unfortunately, I'm not yet familiar with
the afacli tool (although I got it installed & it works). Which
diagnostic output would help?

Best regards,
Matthias Pigulla




More information about the Linux-PowerEdge mailing list