another 2.4.12 + aacraid + SuSE failure.

Chris Pascoe c.pascoe at itee.uq.edu.au
Wed Nov 7 01:37:00 CST 2001


> > An strace of a hanging e2fsck, hangs at the opening of the device...

> Running 2.4.9-13smp (ext2) and 2.4.14 with 2.4.13-aacraid patch posted
> yesterday (ext2) on a PERC2/Si (PE2400) as /dev/sdb now.  I'm not seeing any
> troubles.  fdisk, mke2fs, fsck, mounting, copying files, unmounting, fsck,
> repeat...  all working fine.

Perc3/Di on PE4400, root filesystem on ext2 is the combination I have that
fails, and the one that most people who have reported problems previously
have.  As I said previously, it ran fine for several runs, but now fails
every time.  I've just tried the 2.4.9-13smp kernel as opposed to the
enterprise one and it fails in exactly the same way.  A uniprocessor
kernel always seems to boot fine, though.  The server is running BIOS A08,
Perc 3/Di Firmware A05 and ESM Firmware A46.  It has Dual 1GHz Xeons, 1GB
of RAM, Intel e1000 and an Adaptec 3940 installed - with two external SCSI
disks and a tape drive attached.

> To be fair, I've not got /dev/sdb mounted at system startup.  I'm loading
> the aacraid driver and mounting it later.  That *shouldn't* be a problem.

Yes, it "shouldn't" be a problem - but it does seem to make a difference
as to whether things work or not.  I think the problem doesn't rear itself
as much if there is other disk activity taking place at the same time
though - probably why Hendrik's sysreq-sync make it come back to life for
him.  I have booted off external disks attached to the 3940 fine, and fsck
runs happily.  It's when booting off the aacraid array that the problems
always seem to arise (for me at least).

In other news, I've changed LOGLEVEL to 8 in /etc/sysconfig/init, to stop
the kernel masking all the messages that were coming up.  This reveals
some more details that were probably happening all along:

Loading linux-2.4.9-13e............................
Linux version 2.4.9-13enterprise (bhcompile at stripples.devel.redhat.com)
(gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)) #1 SMP Tue Oct 30 19:34:18 EST 2001

[...]

Mounting proc filesystem:  [  OK  ]
Unmounting initrd:  [  OK  ]
Configuring kernel parameters:  [  OK  ]
Setting clock  (utc): Wed Nov  7 18:01:23 EST 2001 [  OK  ]
Activating swap partitions:  Adding Swap: 2096440k swap-space (priority -1)
[  OK  ]
Setting hostname toy:  [  OK  ]
Checking root filesystem
AAC:        NMI ISR: NMI_DMA_0_ERROR		** after 2-3 seconds


Then we start getting SCSI errors:

scsi : aborting command due to timeout : pid 0, scsi0, channel 0, id 0, lun 0 Read (10) 00 00 04 14 de 00 00 18 00
aacraid:0 ABORT
interrupt_status = 0

[..repeats, with different codes on RHS..]

SCSI host 0 abort (pid 0) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
percraid:0 RESET
scsi : aborting command due to timeout : pid 0, scsi0, channel 0, id 0, lun 0 Read (10) 00 00 04 14 76 00 00 10 00
aacraid:0 ABORT
interrupt_status = 0

[..repeats, as above..]

SCSI host 0 abort (pid 0) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
percraid:0 RESET
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 18000000
sd08:02: old sense key None
Non-extended sense class 0 code 0x0
 I/O error: dev 08:02, sector 171096
scsi : aborting command due to timeout : pid 0, scsi0, channel 0, id 0, lun 0 Read (10) 00 00 04 14 76 00 00 10 00
aacraid:0 ABORT
interrupt_status = 0

[.. repeats on different sectors ..]

We get different errors reported as time passes - all, I assume as a
result of the card stopping processing Fibs after the DMA error above. (If
you make the initrd load the module with aacraid_options="message_level:4"
set, you stop seeing AacHba_ReadCallback's at this point).

After ~20 minutes, the machine panics trying to allocate memory (a memory
leak in the error paths, perhaps?).  I'm putting the complete console log
up of one of these boots at http://www.itee.uq.edu.au/~chrisp/log3.txt

Comments, anyone?  Firmware bug?  Something up with my PE4400's
motherboard?

Chris




More information about the Linux-PowerEdge mailing list