PE2650 / Perc 3Di crash

James Bourne jbourne at mtroyal.ca
Sat Aug 9 12:46:16 CDT 2003


On Tue, 5 Aug 2003, Salyzyn, Mark wrote:

> Deanna no longer works for Adaptec :-(. I have been instructed to replace
> her duties regarding the aacraid and dpt_i2o drivers for Linux.
> 
> 2.4.22-pre6-ac1 has 100 ...
> 
> James Bourne has already indicated that with 100, the problem still occurs.
> Can someone experiment with, lets say, 64 on an offending system? Adaptec
> *is* exploring how to get back to 512 reliably on all variants of adapters,
> but I must remind you that not all adapters have this `low' limit. This is
> but a case of least common denominator ...

FYI, this is the kernel I posted on my web site, with 64 and write cache on.

Yesterday at 0700 and 52 seconds I received a timeout on the raid,  then
shortly after that the adapter hung and I started to get I/O errors.
Here's the kernel log for the event.

Aug  9 07:00:52 midgarth-st kernel: aacraid:ID(1:03:0) Timeout detected on cmd[0x28]
Aug  9 07:00:52 midgarth-st kernel: aacraid:SCSI Channel[1]: Timeout Detected On 3 Command(s)
Aug  9 07:00:57 midgarth-st kernel: aacraid:SCSI Channel[1]: Timeout Detected On 1 Command(s)
Aug  9 07:01:07 midgarth-st kernel: aacraid:SCSI Channel[1]: Timeout Detected On 4 Command(s)
Aug  9 07:01:46 midgarth-st kernel: aacraid: <...repeats 2 more times>
Aug  9 07:01:46 midgarth-st kernel: aacraid:ID(1:03:0) Timeout detected on cmd[0x28]
Aug  9 07:01:46 midgarth-st kernel: aacraid:SCSI Channel[1]: Timeout Detected On 4 Command(s)
Aug  9 07:01:54 midgarth-st kernel: aacraid: Host adapter reset request. SCSI hang ?
Aug  9 07:03:35 midgarth-st kernel: 9660816
Aug  9 07:03:35 midgarth-st kernel:  I/O error: dev 08:11, sector 19660816
Aug  9 07:03:36 midgarth-st last message repeated 96 times
Aug  9 07:03:36 midgarth-st kernel:  I/O error280
Aug  9 07:03:36 midgarth-st kernel: 280
Aug  9 07:03:36 midgarth-st kernel:  I/O error: dev 08:11, sector 65987280
Aug  9 07:03:36 midgarth-st last message repeated 7 times
Aug  9 07:03:36 midgarth-st kernel:  I/O error: d280
Aug  9 07:03:36 midgarth-st kernel: 280
Aug  9 07:03:36 midgarth-st kernel:  I/O error: dev 08:12280
Aug  9 07:03:37 midgarth-st kernel:  I/O erro280
Aug  9 07:03:37 midgarth-st kernel:  I/O error: dev 08:11, sect22280
Aug  9 07:03:37 midgarth-st kernel: <4280
Aug  9 07:03:37 midgarth-st kernel:  I/O error: dev 08:2280
Aug  9 07:03:37 midgarth-st kernel:  I/O erro280
Aug  9 07:03:37 midgarth-st kernel:  I/O error: dev 08:11, sector 6598728280
Aug  9 07:03:37 midgarth-st kernel:  I/280
Aug  9 07:03:37 midgarth-st kernel:  I/O error: dev 08:1128280
Aug  9 07:03:37 midgarth-st kernel:  I/O error: dev 08:11, sector 659280280
Aug  9 07:03:37 midgarth-st kernel:  I/280
Aug  9 07:03:37 midgarth-st kernel:  I/O error: dev280
Aug  9 07:03:37 midgarth-st kernel:  I/O erro280
Aug  9 07:03:37 midgarth-st kernel:  I/O error:280
Aug  9 07:03:37 midgarth-st kernel:  I/O error: de280
Aug  9 07:03:37 midgarth-st kernel:  I/O error: dev 08280
Aug  9 07:03:37 midgarth-st kernel:  I/O error: dev 08:11, sector 65987280

It looks like syslog here is having a hard time keeping up with the errors,
hence the strangeness in the entries.

Anything else I should try?

Regards
James Bourne


> 
> Good on the verify not finding an issue, this means that you are more likely
> to have *this* bug rather than the a troublesome drive, but it does not
> necessarily mean that you do not have a troublesome drive. There is a
> possibility that the combination of outstanding commands and error recovery
> on troublesome drives could be giving us your headache.
> 
> Sincerely -- Mark Salyzyn
> 
> -----Original Message-----
> From: Matthias Pigulla [mailto:mp at webfactory.de]
> Sent: Tuesday, August 05, 2003 10:27 AM
> To: Salyzyn, Mark; James Bourne
> Cc: linux-poweredge at dell.com; linux-aacraid-devel at dell.com;
> matt_domsch at dell.com; deanna_bonds at adaptec.com
> Subject: AW: PE2650 / Perc 3Di crash
> 
> 
> Hi,
> 
> I added deanna_bonds at adaptec.com to the list of recipients, as the
> drivers/scsi/aacraid/README file says that this driver is supported by
> Adaptec and that she might be contacted and:
> 
> Deanna Bonds <deanna_bonds at adaptec.com> (non-DASD support, PAE fibs and 64
> bit, added new adaptec controllers
>                      added new ioctls, changed scsi interface to use new
> error handler,
>                      increased the number of fibs and outstanding commands
> to a container)
> 
> ... she seems to have increased the number of fibs (whatever they are :).
> 
> I'd like to hear some more ("official") opinions on either decreasing
> AAC_NUM_IO_FIB to 100 and rebuilding one of the newer kernel versions, or
> immediately switching to 2.4.22-pre6-ac1 with a value of 100. Possible
> consequences, side effects?
> 
> Best regards,
> Matthias
> 
> PS. @Mark I did the dd as well as the afacli/disk verify and got no errors.
> 
> > -----Ursprüngliche Nachricht-----
> > Von: Salyzyn, Mark [mailto:mark_salyzyn at adaptec.com] 
> > Gesendet: Dienstag, 5. August 2003 16:13
> > An: 'James Bourne'; Matthias Pigulla
> > Cc: linux-poweredge at dell.com; linux-aacraid-devel at dell.com; 
> > matt_domsch at dell.com
> > Betreff: RE: PE2650 / Perc 3Di crash
> > 
> > 
> > If this is the case ... AAC_NUM_IO_FIB defined in 
> > drivers/scsi/aacraid/aacraid.h which was originally set to 
> > 512, and is reduced to 116 in the 2.4.19 generic variant of 
> > the driver might have to be reduced. The 2.4.20 driver has 
> > this value *increased* to 512 (!!!!)
> > 
> > In Adaptec's release of the driver it is reduced to a value 
> > of 100, only because we determined experimentally that 128 
> > would crash the adapter, and 100 did not under all test 
> > circumstances for a sample of card variants. I have *no* idea 
> > where the 116 came from in the . The theoretical maximum in 
> > the adapter is 512 with *one* array including the RAID 
> > splitting and other Firmware tasks which have to absorb some 
> > of the spares above this limit.
> > 
> > My suggestion is to drop the AAC_NUM_IO_FIB to 100, *maybe* 
> > 116, but *not* leave it at 512.
> > 
> > Sincerely -- Mark Salyzyn
> > 
> > Value of AAC_NUM_IO_FIB for various kernels:
> ...
> 
> > 
> > -----Original Message-----
> > From: James Bourne [mailto:jbourne at mtroyal.ab.ca]
> > Sent: Tuesday, August 05, 2003 9:43 AM
> > To: Matthias Pigulla
> > Cc: linux-poweredge at dell.com; linux-aacraid-devel at dell.com; 
> > matt_domsch at dell.com
> > Subject: Re: PE2650 / Perc 3Di crash
> > 
> > 
> > On Tue, 5 Aug 2003, Matthias Pigulla wrote:
> > 
> > > Hello everyone,
> > > 
> > > tonight, I lost one of my PowerEdge boxes with a kernel panic. I'm 
> > > running a PERC 3/Di, RAID10, on Debian woody with a custom 2.4.19 
> > > kernel. I'll try to provide all information I can collect, I hope 
> > > someone can help me to track this issue down. Please bear with me, 
> > > although if it's long :)
> > 
> > FYI, this is what we have seen on our aacraid systems under 
> > heavy I/O and CPU load.  It's unclear at this time if this is 
> > a firmware issue or a driver issue, but I do know that now 
> > Dell and Adaptec are working on a resolution...
> > 
> > Turning off write caching will provide a work around, 
> > although you will still get timeouts, it looks as though the 
> > crashes will be prevented.
> > 
> > Regards
> > James Bourne
> 

-- 
James Bourne, Supervisor Data Centre Operations
Mount Royal College, Calgary, AB, CA
www.mtroyal.ab.ca

"There are only 10 types of people in this world: those who
understand binary and those who don't."




More information about the Linux-PowerEdge mailing list