Fixing SCSI aacraid driver related FS crashes on Dell PowerEdge 2650 with Perc 3/Di runing kernel 2.6.8 in Debian stable ?

Olivier Berger olivier.berger at int-evry.fr
Tue Jun 13 07:41:19 CDT 2006


Hi.

I've experienced again problems with the Perc 3/Di SCSI/RAID controller
on a PE 2650 running stock Debian stable's 2.6.8 (i686-smp) kernel...

I'd be happy to see others confronting their experiences with this
setup, cause I'm not sure at all I fixed all problems :(

I thought the machine was safe after I had upgraded to sarge and 2.6,
since I hadn't had any more crashes (also I had turned off the cache and
upgraded firmware before), until recently, when I had another FS crash,
but this time at least I had some copy of kernel messages moved to
another host on the network, which helped me google to the following
patch. :(

The last step I took was to patch the aacraid driver in Debian's 2.6.8
with aac-remove-handle-aif.patch... and it seems it worked so far...

Thanks in advance for your comments and suggestions.

Best regards,

------------------
Here's a copy of my post on my blog
(http://www-inf.int-evry.fr/~olberger/weblog/2006/06/13/fixing-scsi-aacraid-driver-related-fs-crashes-on-dell-poweredge-2650-with-perc-3di-runing-kernel-268-in-debian-stable

I’ve experienced random crashes of the file-system on a Dell server,
model PowerEdge 2650, with a Perc 3/Di SCSI controller, runninng a
Debian testing system with the standard 2.6.8 Debian kernel (i686+smp),
mainly during disk-intensive operations (for instance, I suspect such a
crash happened when amanda backup task were launched on the machine).

There have been numerous discussions on the linux-poweredge mailing-list
and many proposals for fixing this issue (see details on google).

The symptoms look like this :

        Jun 9 20:52:58 myhost kernel: aacraid: Host adapter reset
        request. SCSI hang ?
        Jun 9 20:52:58 myhost kernel: aacraid: Host adapter reset
        request. SCSI hang ?
        Jun 9 20:52:58 myhost kernel: aacraid: SCSI bus appears hung
        Jun 9 20:52:58 myhost kernel: aacraid: SCSI bus appears hung
        Jun 9 20:52:58 myhost syslogd: /var/log/messages: Read-only file
        system
        Jun 9 20:52:58 myhost kernel: scsi: Device offlined - not ready
        after error recovery: host 0 channel 0 id 0 lun 0
        Jun 9 20:52:58 myhost kernel: SCSI error : <0 0 0 0> return code
        = 0x6000000
        Jun 9 20:52:58 myhost kernel: end_request: I/O error, dev sda,
        sector 401836233
        Jun 9 20:52:58 myhost kernel: scsi0 (0:0): rejecting I/O to
        offline device
        Jun 9 20:52:58 myhost kernel: scsi0 (0:0): rejecting I/O to
        offline device
        
        

I think I have come closer than never to a solution, applying the
following steps :

     1. upgrading the firmware of the Perc 3/Di controller : look at the
        Dell site for the right version…
     2. disabling the cache with afacli :
                # afacli
                
                open AFA0
                
                AFA0 container set
                cache /read_cache_enable=FALSE /write_cache_enable=FALSE
                0
                
                AFA0 container show cache 0
                Executing: container show cache 0
                
                Global Container Read Cache Size : 0
                Global Container Write Cache Size : 118259712
                
                Read Cache Setting : DISABLE
                Write Cache Setting : DISABLE
                Write Cache Status : Inactive, cache disabled
                
                
     3. patching the 2.6.8 aacraid driver’s code with the following
        patch : aac-remove-handle-aif.patch), to avoid tacking the
        controller offline in some circumstances (see explanation in
        this post :
        http://marc.theaimsgroup.com/?l=linux-scsi&m=110252243627410&w=2). 
             1. get the kernel-source-2.6.8 package from stable
             2. unpack it and apply patch
             3. get the running (uname -r) kernel’s .config from /boot
                and copy it to the /usr/src/kernel-source-2.6.7/
             4. make-kpkg clean
             5. make oldconfig
             6. make-kpkg –append_to_version=patchaacremovehandleaif –
                initrd kernel_image
             7. install resulting kernel, and reboot
     4. pray ;)

The machine had worked almost OK since it was in Debian’s 2.6.8 kernel
with cache disabled and firmware upgraded, but it finally crashed again…

I hope that the patch against aacraid driver will solve the issue.



-- 
Olivier BERGER <olivier.berger at int-evry.fr>
Ingénieur Recherche - Dept INF
INT Evry (http://www.int-evry.fr)
OpenPGP-Id: 1024D/6B829EEC




More information about the Linux-PowerEdge mailing list