PERC3/Di failure workaround hypothesis
MJoyce at ccia.unsw.edu.au
Tue May 25 19:13:00 CDT 2004
What commands, if any were used to extract this information ?
Children's Cancer Institute Australia
> -----Original Message-----
> From: linux-poweredge-admin at dell.com
> [mailto:linux-poweredge-admin at dell.com] On Behalf Of John Logsdon
> Sent: Sunday, 23 May 2004 10:42 PM
> To: Linux-PowerEdge at dell.com
> Subject: Re: PERC3/Di failure workaround hypothesis
> Dear all
> I am getting a bit concerned about these reported errors. I
> have a 2650 running Perc3/DI and I haven't seen anything
> untoward but then, since the only time it is currently
> heavily used is when compiling kernels (mmm -j 8 does this in
> 3 minutes!), I worry that when it becomes heavily used under
> production, it will fall over in the way described.
> This is the system:
> twin 2.4Ghz Xeon, HT enabled, 6Gb memory (2Gb in fallover),
> 5x36Gb 10k/s disks
> Red Hat/Adaptec aacraid driver
> AAC0: kernel 2.7.4 build 3170
> AAC0: monitor 2.7.4 build 3170
> AAC0: bios 2.7.0 build 3170
> scsi0 : percraid
> Vendor: DELL Model: PERCRAID Mirror Rev: V1.0
> Type: Direct-Access ANSI SCSI revision: 02
> Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0
> Type: Direct-Access ANSI SCSI revision: 02
> So Drives 0+1 RAID1
> and Drives 2,3,4 RAID5
> Further details from afacli:
> open afa0
> AFA0> controller details
> Executing: controller details
> Controller Information
> Remote Computer: .
> Device Name: AFA0
> Controller Type: PERC 3/Di
> Access Mode: READ-WRITE
> Controller Serial Number: Last Six Digits = 4C10D3
> Number of Buses: 2
> Devices per Bus: 15
> Controller CPU: i960 R series
> Controller CPU Speed: 100 Mhz
> Controller Memory: 128 Mbytes
> Battery State: Ok
> Component Revisions
> CLI: 3.0-0 (Build #4880)
> API: 3.0-0 (Build #4880)
> Miniport Driver: 1.1-0 Beta (Build #9999)
> Controller Software: 2.7-1 (Build #3170)
> Controller BIOS: 2.7-1 (Build #3170)
> Controller Firmware: (Build #3170)
> if that is helpful.
> The kernel is 2.4.26 with modifications (grsecurity and
> others) built in aacraid (not a module).
> >From what I read, this is a prime candidate for slow disk
> access but I
> don't know whether this problem is generic to all Perc3/DIs
> or just some or only a particular version of the firmware...
> hdparm -t reports 52.89MB/s for the RAID1 device and
> 38.32MB/s for the RAID5 device, which aren't stunningly fast
> but I don't know how they compare with other 2650's.
> By comparison a cheap IDE 2Ghz Athlon box that I have
> reported 37.87 MB/sec, a Dell 600SC (IDE) 28.08 MB/sec and my
> very elderly 486DX (!) box a leisurely 1.10 MB/sec but then
> maybe the 2650 is doing rather more.
> To upgrade to Perc4 would require putting a card into the box
> and also some expense so it strikes me that the better
> alternative might be to ditch hardware raid altogether and
> use the much improved sofware raid - at least I could get at
> the kernel and I believe the performance is now almost
> Another point may be to use a 2.6 kernel which may be better
> at organising the read-write ordering (well it is in laptop
> mode I am told!). Either way, upgrading the hardware or
> using software raid would of course require a complete re-install.
> Any comments?
> John Logsdon "Try to make
> things as simple
> Quantex Research Ltd, Manchester UK as possible but
> not simpler"
> j.logsdon at quantex-research.com a.einstein at relativity.org
> +44(0)161 445 4951/G:+44(0)7768982349 www.quantex-research.com
> On Sat, 22 May 2004, Matt Domsch wrote:
> > On Sat, May 22, 2004 at 12:31:13PM -0700, Sean Bruno -
> TELECOM wrote:
> > > O.k. I have two PE2650's right now that are exhibiting
> this issue.
> > > Basically they run for a few days and then "poof" they
> hard lock(no
> > > direct console, no logging).
> > >
> > > They are still pingable, but unaccessible. I can execute
> your test
> > > procedures, but what types of feedback are you looking for?
> > With the RAID read and write caches disabled via afacli as
> in my note
> > Thursday, does the system still hard lock as you describe? If not,
> > great, let us know that after a few days where you might
> have expected
> > it to fail. If so, can you attach a serial console as in Friday
> > night's note and send the output from that, as well as what
> time you
> > think the system crashed, and what you may have been running at the
> > time, including cron jobs.
> > > BTW, I am running both machines under RH AS 3, the two
> drives are in
> > > a standard Raid 1 configuration.
> > OK, RAID1 seems to be the most likely to fail, so if the
> above causes
> > it not to fail, then that would be good to know. Basically, we're
> > trying to make sure that the workaround (disabling the
> caches) does in
> > fact solve everyone's failure case, and that there isn't another
> > failure mode we haven't reproduced and root caused.
> > Thanks,
> > Matt
> > --
> > Matt Domsch
> > Sr. Software Engineer, Lead Engineer
> > Dell Linux Solutions linux.dell.com & www.dell.com/linux
> > Linux on Dell mailing lists @ http://lists.us.dell.com
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> Please read the FAQ at http://lists.us.dell.com/faq or search
> the list archives at http://lists.us.dell.com/htdig/
More information about the Linux-PowerEdge