PERC3/Di failure workaround hypothesis

Matthew Joyce MJoyce at ccia.unsw.edu.au
Tue May 25 19:13:00 CDT 2004


Hi,

What commands, if any were used to extract this information ?

thanks

Matt Joyce
Children's Cancer Institute Australia
http://www.ccia.org.au


> -----Original Message-----
> From: linux-poweredge-admin at dell.com 
> [mailto:linux-poweredge-admin at dell.com] On Behalf Of John Logsdon
> Sent: Sunday, 23 May 2004 10:42 PM
> To: Linux-PowerEdge at dell.com
> Subject: Re: PERC3/Di failure workaround hypothesis
> 
> 
> Dear all
> 
> I am getting a bit concerned about these reported errors.  I 
> have a 2650 running Perc3/DI and I haven't seen anything 
> untoward but then, since the only time it is currently 
> heavily used is when compiling kernels (mmm -j 8 does this in 
> 3 minutes!), I worry that when it becomes heavily used under 
> production, it will fall over in the way described.
> 
> This is the system:
> 
> twin 2.4Ghz Xeon, HT enabled, 6Gb memory (2Gb in fallover), 
> 5x36Gb 10k/s disks
> 
> Red Hat/Adaptec aacraid driver
> AAC0: kernel 2.7.4 build 3170
> AAC0: monitor 2.7.4 build 3170
> AAC0: bios 2.7.0 build 3170
> 
> scsi0 : percraid
>   Vendor: DELL      Model: PERCRAID Mirror   Rev: V1.0
>   Type:   Direct-Access                      ANSI SCSI revision: 02
>   Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
>   Type:   Direct-Access                      ANSI SCSI revision: 02
> 
> So  Drives 0+1 RAID1
> and Drives 2,3,4 RAID5
> 
> Further details from afacli:
> 
> open afa0
> AFA0> controller details
> Executing: controller details
> Controller Information
> ----------------------
>          Remote Computer: .
>              Device Name: AFA0
>          Controller Type: PERC 3/Di
>              Access Mode: READ-WRITE
> Controller Serial Number: Last Six Digits = 4C10D3
>          Number of Buses: 2
>          Devices per Bus: 15
>           Controller CPU: i960 R series
>     Controller CPU Speed: 100 Mhz
>        Controller Memory: 128 Mbytes
>            Battery State: Ok
> 
> Component Revisions
> -------------------
>                 CLI: 3.0-0 (Build #4880)
>                 API: 3.0-0 (Build #4880)
>     Miniport Driver: 1.1-0 Beta (Build #9999)
> Controller Software: 2.7-1 (Build #3170)
>     Controller BIOS: 2.7-1 (Build #3170)
> Controller Firmware: (Build #3170)
> 
> 
> if that is helpful.
> 
> The kernel is 2.4.26 with modifications (grsecurity and 
> others) built in aacraid (not a module).
> 
> >From what I read, this is a prime candidate for slow disk 
> access but I
> don't know whether this problem is generic to all Perc3/DIs 
> or just some or only a particular version of the firmware...
> 
> hdparm -t reports 52.89MB/s for the RAID1 device and 
> 38.32MB/s for the RAID5 device, which aren't stunningly fast 
> but I don't know how they compare with other 2650's. 
> 
> By comparison a cheap IDE 2Ghz Athlon box that I have 
> reported 37.87 MB/sec, a Dell 600SC (IDE) 28.08 MB/sec and my 
> very elderly 486DX (!) box a leisurely 1.10 MB/sec but then 
> maybe the 2650 is doing rather more.
> 
> To upgrade to Perc4 would require putting a card into the box 
> and also some expense so it strikes me that the better 
> alternative might be to ditch hardware raid altogether and 
> use the much improved sofware raid - at least I could get at 
> the kernel and I believe the performance is now almost 
> indistinguishable.  
> 
> Another point may be to use a 2.6 kernel which may be better 
> at organising the read-write ordering (well it is in laptop 
> mode I am told!).  Either way, upgrading the hardware or 
> using software raid would of course require a complete re-install.
> 
> Any comments?
> 
> John
> 
> John Logsdon                               "Try to make 
> things as simple
> Quantex Research Ltd, Manchester UK         as possible but 
> not simpler"
> j.logsdon at quantex-research.com              a.einstein at relativity.org
> +44(0)161 445 4951/G:+44(0)7768982349       www.quantex-research.com
> 
> 
> On Sat, 22 May 2004, Matt Domsch wrote:
> 
> > On Sat, May 22, 2004 at 12:31:13PM -0700, Sean Bruno - 
> TELECOM wrote:
> > > O.k.  I have two PE2650's right now that are exhibiting 
> this issue.
> > > Basically they run for a few days and then "poof" they 
> hard lock(no
> > > direct console, no logging).
> > > 
> > > They are still pingable, but unaccessible.  I can execute 
> your test 
> > > procedures, but what types of feedback are you looking for?
> > 
> > With the RAID read and write caches disabled via afacli as 
> in my note 
> > Thursday, does the system still hard lock as you describe?  If not, 
> > great, let us know that after a few days where you might 
> have expected 
> > it to fail.  If so, can you attach a serial console as in Friday 
> > night's note and send the output from that, as well as what 
> time you 
> > think the system crashed, and what you may have been running at the 
> > time, including cron jobs.
> >  
> > > BTW, I am running both machines under RH AS 3, the two 
> drives are in 
> > > a standard Raid 1 configuration.
> > 
> > OK, RAID1 seems to be the most likely to fail, so if the 
> above causes 
> > it not to fail, then that would be good to know.  Basically, we're 
> > trying to make sure that the workaround (disabling the 
> caches) does in 
> > fact solve everyone's failure case, and that there isn't another 
> > failure mode we haven't reproduced and root caused.
> > 
> > Thanks,
> > Matt
> > 
> > --
> > Matt Domsch
> > Sr. Software Engineer, Lead Engineer
> > Dell Linux Solutions linux.dell.com & www.dell.com/linux
> > Linux on Dell mailing lists @ http://lists.us.dell.com
> > 
> 
> 
> 
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq or search 
> the list archives at http://lists.us.dell.com/htdig/
> 




More information about the Linux-PowerEdge mailing list