PERC3/Di failure workaround hypothesis

Paul Anderson Paul.Anderson at priorityhealthcare.com
Wed May 26 13:20:24 CDT 2004


We run Oracle Apps Servers on our front end and Oracle 9i RAC on the back end.  All of our front end boxes for the Applications Servers are 2650's.  We found early on that the onboard Adaptec controller can not handle high I/O loads.  The firmware does not respond quick enough on the card.  We switched out to the PERC 3/DC cards and found that they, while degrading on high I/O, still run and can handle the load.

Paul Anderson

 -----Original Message-----
From: 	linux-poweredge-admin at dell.com [mailto:linux-poweredge-admin at dell.com]  On Behalf Of Matthew Joyce
Sent:	Tuesday, May 25, 2004 8:12 PM
To:	Linux-PowerEdge at dell.com
Subject:	RE: PERC3/Di failure workaround hypothesis


Hi,

What commands, if any were used to extract this information ?

thanks

Matt Joyce
Children's Cancer Institute Australia
http://www.ccia.org.au


> -----Original Message-----
> From: linux-poweredge-admin at dell.com 
> [mailto:linux-poweredge-admin at dell.com] On Behalf Of John Logsdon
> Sent: Sunday, 23 May 2004 10:42 PM
> To: Linux-PowerEdge at dell.com
> Subject: Re: PERC3/Di failure workaround hypothesis
> 
> 
> Dear all
> 
> I am getting a bit concerned about these reported errors.  I 
> have a 2650 running Perc3/DI and I haven't seen anything 
> untoward but then, since the only time it is currently 
> heavily used is when compiling kernels (mmm -j 8 does this in 
> 3 minutes!), I worry that when it becomes heavily used under 
> production, it will fall over in the way described.
> 
> This is the system:
> 
> twin 2.4Ghz Xeon, HT enabled, 6Gb memory (2Gb in fallover), 
> 5x36Gb 10k/s disks
> 
> Red Hat/Adaptec aacraid driver
> AAC0: kernel 2.7.4 build 3170
> AAC0: monitor 2.7.4 build 3170
> AAC0: bios 2.7.0 build 3170
> 
> scsi0 : percraid
>   Vendor: DELL      Model: PERCRAID Mirror   Rev: V1.0
>   Type:   Direct-Access                      ANSI SCSI revision: 02
>   Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
>   Type:   Direct-Access                      ANSI SCSI revision: 02
> 
> So  Drives 0+1 RAID1
> and Drives 2,3,4 RAID5
> 
> Further details from afacli:
> 
> open afa0
> AFA0> controller details
> Executing: controller details
> Controller Information
> ----------------------
>          Remote Computer: .
>              Device Name: AFA0
>          Controller Type: PERC 3/Di
>              Access Mode: READ-WRITE
> Controller Serial Number: Last Six Digits = 4C10D3
>          Number of Buses: 2
>          Devices per Bus: 15
>           Controller CPU: i960 R series
>     Controller CPU Speed: 100 Mhz
>        Controller Memory: 128 Mbytes
>            Battery State: Ok
> 
> Component Revisions
> -------------------
>                 CLI: 3.0-0 (Build #4880)
>                 API: 3.0-0 (Build #4880)
>     Miniport Driver: 1.1-0 Beta (Build #9999)
> Controller Software: 2.7-1 (Build #3170)
>     Controller BIOS: 2.7-1 (Build #3170)
> Controller Firmware: (Build #3170)
> 
> 
> if that is helpful.
> 
> The kernel is 2.4.26 with modifications (grsecurity and 
> others) built in aacraid (not a module).
> 
> >From what I read, this is a prime candidate for slow disk 
> access but I
> don't know whether this problem is generic to all Perc3/DIs 
> or just some or only a particular version of the firmware...
> 
> hdparm -t reports 52.89MB/s for the RAID1 device and 
> 38.32MB/s for the RAID5 device, which aren't stunningly fast 
> but I don't know how they compare with other 2650's. 
> 
> By comparison a cheap IDE 2Ghz Athlon box that I have 
> reported 37.87 MB/sec, a Dell 600SC (IDE) 28.08 MB/sec and my 
> very elderly 486DX (!) box a leisurely 1.10 MB/sec but then 
> maybe the 2650 is doing rather more.
> 
> To upgrade to Perc4 would require putting a card into the box 
> and also some expense so it strikes me that the better 
> alternative might be to ditch hardware raid altogether and 
> use the much improved sofware raid - at least I could get at 
> the kernel and I believe the performance is now almost 
> indistinguishable.  
> 
> Another point may be to use a 2.6 kernel which may be better 
> at organising the read-write ordering (well it is in laptop 
> mode I am told!).  Either way, upgrading the hardware or 
> using software raid would of course require a complete re-install.
> 
> Any comments?
> 
> John
> 
> John Logsdon                               "Try to make 
> things as simple
> Quantex Research Ltd, Manchester UK         as possible but 
> not simpler"
> j.logsdon at quantex-research.com              a.einstein at relativity.org
> +44(0)161 445 4951/G:+44(0)7768982349       www.quantex-research.com
> 
> 
> On Sat, 22 May 2004, Matt Domsch wrote:
> 
> > On Sat, May 22, 2004 at 12:31:13PM -0700, Sean Bruno - 
> TELECOM wrote:
> > > O.k.  I have two PE2650's right now that are exhibiting 
> this issue.
> > > Basically they run for a few days and then "poof" they 
> hard lock(no
> > > direct console, no logging).
> > > 
> > > They are still pingable, but unaccessible.  I can execute 
> your test 
> > > procedures, but what types of feedback are you looking for?
> > 
> > With the RAID read and write caches disabled via afacli as 
> in my note 
> > Thursday, does the system still hard lock as you describe?  If not, 
> > great, let us know that after a few days where you might 
> have expected 
> > it to fail.  If so, can you attach a serial console as in Friday 
> > night's note and send the output from that, as well as what 
> time you 
> > think the system crashed, and what you may have been running at the 
> > time, including cron jobs.
> >  
> > > BTW, I am running both machines under RH AS 3, the two 
> drives are in 
> > > a standard Raid 1 configuration.
> > 
> > OK, RAID1 seems to be the most likely to fail, so if the 
> above causes 
> > it not to fail, then that would be good to know.  Basically, we're 
> > trying to make sure that the workaround (disabling the 
> caches) does in 
> > fact solve everyone's failure case, and that there isn't another 
> > failure mode we haven't reproduced and root caused.
> > 
> > Thanks,
> > Matt
> > 
> > --
> > Matt Domsch
> > Sr. Software Engineer, Lead Engineer
> > Dell Linux Solutions linux.dell.com & www.dell.com/linux
> > Linux on Dell mailing lists @ http://lists.us.dell.com
> > 
> 
> 
> 
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq or search 
> the list archives at http://lists.us.dell.com/htdig/
> 

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list archives at http://lists.us.dell.com/htdig/







More information about the Linux-PowerEdge mailing list