PERC3/Di failure workaround hypothesis

Matt Domsch Matt_Domsch at dell.com
Fri May 21 22:32:00 CDT 2004


On Fri, May 21, 2004 at 04:00:08PM -0700, Robert L Mathews wrote:

> Is this the problem in which the machine can lock up without any errors 
> being visible on the console/logs/LED?

Yes.

> If you could provide some more detailed information, that would help
> some of us tell if we will be able to provide useful feedback on the
> same problem you're testing.
> 
> In other words, I have no idea whether my problem is due to SCSI command 
> timeouts -- I just know the symptoms, which are the same on two machines 
> (a 2650 and a 2550 with Perc3/Di):
> 
>  - no error message on console/logs/LED
>  - machine still pingable
>  - network services that don't touch the disk, such as named,
>    still running fine
>  - everything else that requires disk access is locked up
>  - all disk activity has stopped
>  - no orange lights on the disks
>  - problem persists even with the latest released Perc firmware and
>    aacraid driver
>  - problem persists even if ethernet is disabled, so it's not the tg3
>    driver
> 
> Is this the problem that you are investigating?

Yes.
 
> In general, it would make me feel better about it if you could tell us 
> what you're doing and what you've found. Things like, Do you have a 
> reproducible test case? What exactly are the symptoms of the issue you're 
> working on, so we can tell if it's the same as our issue? etc.

Fair enough.

Essentially, what we see happen is that the aacraid controller (not
the driver, but the firmware) stops responding for long periods of
time.  It may eventually start responding again (between 10 and 210
seconds later), but by that time the SCSI mid-layer has already
started timing out commands, generally at the 60-second mark.  When
this happens, it gets into the SCSI error handling death spiral - you
can't abort commands already issued to the firmware, the firmware
isn't completing commands already issued to it, the mid-layer
eventually decides that the logical drive needs to be marked offline
so no more commands are issued to it.  Since this is the ROMB, likely
your root file system, swap, etc., any more accesses to these file
systems fail, including ext3 journal commits, syslog writes, etc.  

This has been seen on a variety of kernels and with a variety of
drivers, with firmware versions including 2.7 and the most recent 2.8
firmware (possibly other older ones as well), with at least the PERC
3/Di controllers, though it may be implicated in other controllers in
the Adaptec ROMB family.  RAID 1 seems the most often occurance, but
not necessarily exclusive.

It was difficult to reproduce in the lab, eventually we captured a
system from a customer that could reproduce it at will by running
chrootkit.  Other reports seem to indicate failure "overnight", likely
the 4am updatedb run from cron.  Both of these apps read a lot of
files very quickly.

(Thinking out loud here)
By default ext3 file systems are mounted with the 'atime' (access
time) parameter which causes such reads to require a modification of
the file inode, which is treated as a metadata read/modify/write,
which get batched together by the ext3 journal code and written to the
disk every 5 seconds or so.  This seems to be an access pattern which
causes the firmware cache routines much grief.  This would also partially
explain why other OSs and likely other file systems don't exhibit
the failure.  Personally, I have never used the atime data for
anything, so mount my file systems with 'noatime' whenever possible -
that may help alleviate the problem.

Disabling both the read and write caches via afacli definitely
prevents the firmware cache flush routine from running, which is where
the firmware gets stuck.  Using the 'noatime' parameter in /etc/fstab
may help also.

If you're able to reproduce the apparent hang, please connect a serial
console to your server such that you can capture the error messages
reported by the kernel, and send those to the list.  The super-brief
HOWTO:

Connect a null modem cable between your server serial port 1 and a
capture system.  Run minicom, ttywatch, or similar on the capture
system, at 115200 N81.  Put in the server grub.conf/lilo:
"console=ttyS0,115200 console=tty0" kernel command line parameters

I'd expect to see messages from the SCSI midlayer saying it was timing
out commands, messages from the aacraid driver saying SCSI bus hung,
the midlayer marking the drive offline, and the ext3 jbd layer
complaining about inode writes.

Hopefully disabling the caches solves it for now, and if so we'll work
to get a firmware out that will let you re-enable the caches.


Thanks,
Matt

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20040521/5ddb8ccd/attachment.bin


More information about the Linux-PowerEdge mailing list