kernel: aacraid: Host adapter reset request. SCSI hang ?

Salyzyn, Mark mark_salyzyn at adaptec.com
Wed Aug 27 14:28:00 CDT 2003


I may have a root cause on this issue, even though I have not been able to
duplicate it yet.

There is code that does the following in the driver:

	scsicmd->result = DID_OK << 16 | COMMAND_COMPLETE << 8 |
SAM_STAT_TASK_SET_FULL;
	aac_io_done(scsicmd);
	return -1;

This is *wrong*, because the none zero return causes the system to hold the
command in the queue due to the use of the new error handler, yet we have
also completed the command as `BUSY' *and* as a result of the constraints of
the aac_io_done call which relocks (on io_request_lock) the caller had to
unlock leaving a hole that SMP machines fill. By dropping the result and
done calls in these situations, and holding the locks in the caller of such
routines, I believe we will close this hole.

Thanks, in part, to Josef Möllers for pointing out this locking problem
under SMP, serendipitously a day after I had noticed the other problem with
the inaccurate busy return sequences in the code and started making the
changes to investigate. Kill two birds with one stone.

I will report back on my tests of these changes, but will need a volunteer
with kernel compile experience to report on the success in resolving this
issue in the field *please*.

Sincerely -- Mark Salyzyn

-----Original Message-----
From: Javier Rodriguez [mailto:jlr at jlrconsulting.com]
Sent: Wednesday, August 27, 2003 9:18 AM
To: 'Stefano Turolla'
Cc: linux-aacraid-devel at dell.com; linux-poweredge at dell.com
Subject: RE: kernel: aacraid: Host adapter reset request. SCSI hang ?


Hi,

Thank you for the feedback. For us, re-enabling hypterthreading on the 2650s
causes the problem to return, so at least we know that it is playing a role
in the SCSI hang problem. As for high sustained I/O, with hyperthreading
enabled, we've encountered the problem with both low and high I/O rates.
With hyperthreading disabled, we can now sustain high I/O loads without a
problem.

Thanks,
Jav

-----Original Message-----
From: linux-aacraid-devel-admin at dell.com
[mailto:linux-aacraid-devel-admin at dell.com] On Behalf Of Stefano Turolla
Sent: Wednesday, August 27, 2003 6:27 AM
To: Javier Rodriguez
Cc: linux-aacraid-devel at dell.com; linux-poweredge at dell.com
Subject: RE: kernel: aacraid: Host adapter reset request. SCSI hang ?


Hell
we have the same problem with poweredge 1650 and 2650.
We tried different versions of kernel and redhat releases (7.3 and 9) kernel
tried 
2.4.18-18.7.x
2.4.18-24.7.x
2.4.18-26.7.x
2.4.20-13.7
2.4.20-19.7
2.4.21.ac2-rc2
As a workaround i removed raid controller form some 1650 and re-install the
machine with only scsi interface connected. We didn't have any more crash in
the last month!

Besides, for most of our machines (1650) disabling the hyperthreading has no
sense as they have one cpu (pentium III from 1.4 to 1.7 GHz) with no
hyperthreading, of course. On the other hand we have other 2650 that are
running since 2 or three moths without problems, some of them with
hyperthreading disabled. A couple of other 2650 had several crashes whene
they were used as ftp server. I don't know what it really means but it seems
something not really related to
hyperthreading, but only to a high substained i/o   

On Fri, 2003-08-22 at 13:07, Javier Rodriguez wrote:
> Hello,
>  
> Does anyone developing the aacraid driver have an update regarding the 
> problem below? Disabling HyperThreading (Logical Processor) within the 
> Dell 2650 BIOS has without a doubt circumvented the problem for us (as 
> well as a few others), but it would be nice to reenable the feature.
>  
> For reference, with HyperThreading disabled, we've been able to 
> successfully execute Red Hat's distribution of Linux kernel-2.4.20-9, 
> kernel-smp-2.4.20-9, kernel-2.4.20-13.9, kernel-smp-2.4.20-13.9, 
> kernel-2.4.20-18.9 and kernel-smp-2.4.20-18.9. We currently have two 
> Dell 2650s executing kernel-smp-2.4.20-18.9 for 70 days without 
> incident. Prior to disabling HyperThreading, our systems would 
> normally crash within 24 hours (no longer than 48 hours) with both the 
> smp and non-smp version of the kernel.
>  
> Thanks,
> Javier
>         -----Original Message-----
>         From: linux-aacraid-devel-admin at dell.com
>         [mailto:linux-aacraid-devel-admin at dell.com] On Behalf Of
>         Javier Rodriguez
>         Sent: Saturday, May 31, 2003 7:43 PM
>         To: linux-aacraid-devel at dell.com
>         Subject: kernel: aacraid: Host adapter reset request. SCSI
>         hang ?
>         
>         
>         Hello,
>          
>         We recently purchased two Dell PowerEdge 2650 servers with
>         PERC3/Di controllers. Both servers are executing RedHat Linux
>         9.0. On both servers we are encountering the following error:
>          
>         <<< Portion of server message log >>>
>         May 31 16:14:07 server1 kernel: aacraid: Host adapter reset
>         request. SCSI hang ?
>         May 31 16:14:17 server1 kernel: scsi: device set offline -
>         command error recover failed: host 0 channel 0 id 0 lun 0
>         May 31 16:14:17 server1 kernel: SCSI disk error : host 0
>         channel 0 id 0 lun 0 return code = 6000000
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         83200
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         13568
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         13616
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         83200
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         22030904
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         88348712
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         72976
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         13624
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         13752
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         13768
>         May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector
>         72976
>         <<< I/O error messages continue until the server is rebooted
>         >>>
>          
>          
>         Here are a few notes regarding the error and operating
>         environment:
>          
>         - The error occurs with RedHat's kernel RPMs
>         kernel-smp-2.4.20-9 and kernel-smp-2.4.20-13.9. As of today,
>         we are testing kernel-2.4.20-9 to determine if the problem
>         occurs under a non-smp environment.
>         - The time between failures varies from several hours to
>         several days.
>         - The failures occur both during light and heavy system loads.
>         - PowerEdge 2650 BIOS is at 1.10 A10
>         - Backplane firmware is at 1.01
>         - PERC3/Di BIOS is at V2.7-1 (build 3170)
>         - A full system diagnostics has been successfully executed on
>         both servers.
>         - The RAID media has been successfully 'verified' on both
>         servers.
>          
>         Thank you in advance for your assistance in helping to get
>         this problem resolved.
>          
>         Javier
>          
>          
>         JLR Consulting, PO Box 638, Bernville, PA 19506-0638
>         mailto:jlr at jlrconsulting.com
>          
-- 
+------+---------+--------+--------+--------+---------+--------+-------+
| Stefano Turolla                             Phone : +49 89 32006537  |
| UNIX System Manager                         Fax   : +49 89 32006380  |
| European Southern Observatory (ESO):        E-Mail: sturolla at eso.org |
| Karl-Schwarzschild-strasse 2 D-85748 Garching bei Muenchen           |
+------+---------+--------+--------+--------+---------+--------+-------+
Computers are like airconditioners ,
they stop working properly if you open WINDOWS


_______________________________________________
Linux-aacraid-devel mailing list
Linux-aacraid-devel at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-aacraid-devel
Please read the FAQ at http://lists.us.dell.com/faq or search the list
archives at http://lists.us.dell.com/htdig/


_______________________________________________
Linux-aacraid-devel mailing list
Linux-aacraid-devel at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-aacraid-devel
Please read the FAQ at http://lists.us.dell.com/faq or search the list
archives at http://lists.us.dell.com/htdig/




More information about the Linux-PowerEdge mailing list