PERC3/Di failure workaround hypothesis

John Logsdon j.logsdon at quantex-research.com
Sun May 23 07:43:01 CDT 2004


Dear all

I am getting a bit concerned about these reported errors.  I have a 2650
running Perc3/DI and I haven't seen anything untoward but then, since the
only time it is currently heavily used is when compiling kernels (mmm -j 8
does this in 3 minutes!), I worry that when it becomes heavily used under
production, it will fall over in the way described.

This is the system:

twin 2.4Ghz Xeon, HT enabled, 6Gb memory (2Gb in fallover), 5x36Gb 10k/s
disks

Red Hat/Adaptec aacraid driver
AAC0: kernel 2.7.4 build 3170
AAC0: monitor 2.7.4 build 3170
AAC0: bios 2.7.0 build 3170

scsi0 : percraid
  Vendor: DELL      Model: PERCRAID Mirror   Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
  Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02

So  Drives 0+1 RAID1
and Drives 2,3,4 RAID5

Further details from afacli:

open afa0
AFA0> controller details
Executing: controller details
Controller Information
----------------------
         Remote Computer: .
             Device Name: AFA0
         Controller Type: PERC 3/Di
             Access Mode: READ-WRITE
Controller Serial Number: Last Six Digits = 4C10D3
         Number of Buses: 2
         Devices per Bus: 15
          Controller CPU: i960 R series
    Controller CPU Speed: 100 Mhz
       Controller Memory: 128 Mbytes
           Battery State: Ok

Component Revisions
-------------------
                CLI: 3.0-0 (Build #4880)
                API: 3.0-0 (Build #4880)
    Miniport Driver: 1.1-0 Beta (Build #9999)
Controller Software: 2.7-1 (Build #3170)
    Controller BIOS: 2.7-1 (Build #3170)
Controller Firmware: (Build #3170)


if that is helpful.

The kernel is 2.4.26 with modifications (grsecurity and others) built in
aacraid (not a module).

>From what I read, this is a prime candidate for slow disk access but I
don't know whether this problem is generic to all Perc3/DIs or just some
or only a particular version of the firmware...

hdparm -t reports 52.89MB/s for the RAID1 device and 38.32MB/s for the
RAID5 device, which aren't stunningly fast but I don't know how they
compare with other 2650's. 

By comparison a cheap IDE 2Ghz Athlon box that I have reported 37.87
MB/sec, a Dell 600SC (IDE) 28.08 MB/sec and my very elderly 486DX (!) box
a leisurely 1.10 MB/sec but then maybe the 2650 is doing rather more.

To upgrade to Perc4 would require putting a card into the box and also
some expense so it strikes me that the better alternative might be to
ditch hardware raid altogether and use the much improved sofware raid - at
least I could get at the kernel and I believe the performance is now
almost indistinguishable.  

Another point may be to use a 2.6 kernel which may be better at organising
the read-write ordering (well it is in laptop mode I am told!).  Either
way, upgrading the hardware or using software raid would of course require
a complete re-install.

Any comments?

John

John Logsdon                               "Try to make things as simple
Quantex Research Ltd, Manchester UK         as possible but not simpler"
j.logsdon at quantex-research.com              a.einstein at relativity.org
+44(0)161 445 4951/G:+44(0)7768982349       www.quantex-research.com


On Sat, 22 May 2004, Matt Domsch wrote:

> On Sat, May 22, 2004 at 12:31:13PM -0700, Sean Bruno - TELECOM wrote:
> > O.k.  I have two PE2650's right now that are exhibiting this issue. 
> > Basically they run for a few days and then "poof" they hard lock(no
> > direct console, no logging).
> > 
> > They are still pingable, but unaccessible.  I can execute your test
> > procedures, but what types of feedback are you looking for?  
> 
> With the RAID read and write caches disabled via afacli as in my note
> Thursday, does the system still hard lock as you describe?  If not,
> great, let us know that after a few days where you might have expected
> it to fail.  If so, can you attach a serial console as in Friday
> night's note and send the output from that, as well as what time you
> think the system crashed, and what you may have been running at the
> time, including cron jobs.
>  
> > BTW, I am running both machines under RH AS 3, the two drives are in a
> > standard Raid 1 configuration.
> 
> OK, RAID1 seems to be the most likely to fail, so if the above causes
> it not to fail, then that would be good to know.  Basically, we're
> trying to make sure that the workaround (disabling the caches) does in
> fact solve everyone's failure case, and that there isn't another
> failure mode we haven't reproduced and root caused.
> 
> Thanks,
> Matt
> 
> -- 
> Matt Domsch
> Sr. Software Engineer, Lead Engineer
> Dell Linux Solutions linux.dell.com & www.dell.com/linux
> Linux on Dell mailing lists @ http://lists.us.dell.com
> 







More information about the Linux-PowerEdge mailing list