Okay, this is so not cool..

Andrew Mann amann at mythicentertainment.com
Wed Oct 15 20:31:01 CDT 2003

    I don't want to speak for anyone else here, but I believe Adaptec 
has a pretty good handle on a problem with aacraid based cards under 
linux in general.  Somehow or another (it seems to very from one 
chip/firmware/drive/system hardware/OS to another) a command times out 
to one of the devices on the scsi bus, a reset is attempted, this fails 
and the entire device (from the OS point of view) is taken offline.  If 
you only have 1 container that means every bit of your disk based system 
(including swap) is now unavailable.   I don't know if it will pull 
multiple containers offline at once or not.  Anyway, the visible effect 
of the problem will be a few messages like "SCSI bus reset requested.  
SCSI hang?" followed by a whole lot of IO failure messages that just 
spam the console and never stop.  You can't login or really do anything 
other than reboot.  Adaptec seems to have a pretty good handle on that 
and I'd expect the fixes to make their way into the mainline linux and 
hopefully RedHat linux kernels soon.
    Solely on our PE2550s this situation, even with a fixed driver, 
appears to bring out some nastyness in the memory manager in RedHat's 
supplied kernel 2.4.20-20.9-smp (i686 build).  I've tested with write 
cache on, off, a whole slew of different aacraid drivers, etc.  It's a 
P3 based system, so there is no hyperthreading.  I'm not even sure that 
the crash I'm seeing (which is a kernel panic to everything that 
attempts to allocate memory - basically everything) is related to the 
aacraid driver.  I seem to be able to kick it off with about the same 
frequency that I can kick off the aacraid timeouts, but that may just be 
coincidence.  I have rolled back to a stock 2.4.20 kernel from 
kernel.org and found that it does not demonstrate the same problem.  I 
haven't yet tried the newer 2.4 kernels from kernel.org, but I intend 
to.  The symptoms in this situation - and again I've only been able to 
see it on PE2550s (and I've seen it on every pe2550 I've tried on) are a 
kernel panic referencing a BUG() call in the memory manager - 2 or 3 
differnet locations come up, though rmqueue() comes up most often.  The 
system still responds to pings, but does not accept TCP connections or 
accept any console input usually.  Once I had a half dozen processes 
crash out and the system "recover" to the point where I could remotely 
log in, but several partitions would freeze any program attempting to 
access them.  Nothing gets logged.  If screen blanking is enabled then 
you'll most likely just find a blank screen that's unresponsive.  I 
don't know of a solution to this one other than the kernel.org 2.4.20 
seems to not suffer from the problem while redhat's 2.4.20-20.9smp 
does.  I don't know if the latest RH8 kernel does.  I don't know how 
well or poorly the kernel.org kernel would play with the rest of RedHat 9.

    As always, if you can provide a reproducable test case for system 
instability on a 2550, 2650 or 1750, I'm willing to let you know if I 
can reproduce it here, and if I can I'm probably interested in finding a 
solution :) 

    If you're finding your system crashed with a blank screen, I'd 
disable screen blanking by adding "setterm -blank 0"  to 
/etc/rc.d/rc.local.  At least that way after things blow up you might be 
able to read a little bit about why they blew up.


Cris Rhea wrote:

>>Clearly I'm part of a mass hallucination of people who think their 2650's crash all of 
>>the time.  I agree it should be rock-solid.  Sadly, it ain't.  I have 6 that crash, and 
>>I seem to have my fellow travelers.
>>Still, could be interesting...  Cris says he is just using Raid 1 with 2 drives.  I'm 
>>using Raid 0 across 3 drives on my machines.  I suspect most people are using 0 or 5.  
>>Is anyone getting crashes who is using Raid 1?
>(I haven't followed this thread that closely, so if I miss something that was previously
>discussed, just slap me...)
>It's not about "mass hallucination"- it's about identifying what things are different
>between the "rock solid" and the "crash all the time" systems.   Once you find the
>difference(s), the solution is usually obvious....
>I had a similar experience with a PE2550- spent almost a month working with Dell Support 
>to diagnose why it kept hanging.  Once I found this list, my problem was solved in a day!
>For the purposes of this discussion, pick ONE of your 2650s. If we solve the issue
>for one, we'll most likely have the solution for all...
>1. What are the versions of the ESM, Backplane, BIOS and PERC firmware (all on that
>   first BIOS screen)?
>2. How is it physically configured (number/type of CPUs, memory, internal disks, PCI cards,
>   network connections, other external connections)?
>3. Have you booted from (a recent) Server Assistant CD and run full diagnostics (should 
>   take on the order of 30 hours [based on what I had done previously with my 2550])?
>   Results?
>4. What OS are you running (RH7.X, 8, 9, SuSE, etc.)? Any Dell-specific drivers installed
>   (e.g., Is this a retail version of RH or a Dell-packaged/modified version)?
>5. What updates/patches have you applied to the OS?
>6. How did you install the OS (standard CDROM install from a "known release", rebuilt 
>   kernel manually, downloaded new kernel RPMs, etc.)?
>7. What system software have you added in addition to the base OS install?
>Let's start at the physical level and work our way up... 

More information about the Linux-PowerEdge mailing list