legacy megaraid based cards sucks ?

Eric Belhomme eric.belhomme at icsb.fr
Tue Nov 6 04:39:25 CST 2007


Hi,

I own two refurbished legacy LSI Megaraid based cards :
- a LSI MegaRAID i4,
- a HP NetRAID-1M (LSI Megaraid series 475 if I remember well)

They are listed as this

02:01.0 SCSI storage controller: Adaptec AHA-2940U2/U2W / 7890/7891
02:04.0 RAID bus controller: American Megatrends Inc. MegaRAID (rev 02)
02:07.0 RAID bus controller: American Megatrends Inc. MegaRAID (rev 21)
02:04.0 0104: 101e:1960 (rev 02)
02:07.0 0104: 101e:1960 (rev 21)

The Adaptec SCSI controller is onboard (CDROM drive and DAT will be 
connected on it)

Devices are detected as this :

holy:/usr/src/linux# dmesg|grep scsi
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0
scsi1 : LSI Logic MegaRAID driver
scsi[1]: scanning scsi channel 0 [Phy 0] for non-raid devices
scsi[1]: scanning scsi channel 1 [Phy 1] for non-raid devices
scsi2 : LSI Logic MegaRAID driver
scsi[2]: scanning scsi channel 0 [Phy 0] for non-raid devices
scsi[1]: scanning scsi channel 2 [Phy 2] for non-raid devices
scsi[1]: scanning scsi channel 3 [Phy 3] for non-raid devices
scsi[1]: scanning scsi channel 4 [virtual] for logical drives
scsi 1:4:0:0: Direct-Access     MegaRAID LD 0 RAID5  715G N661 PQ: 0 ANSI: 2
scsi 1:4:1:0: Direct-Access     MegaRAID LD 1 RAID5  343G N661 PQ: 0 ANSI: 2
scsi 2:0:5:0: Processor         HP       SAFTE; U160/M BP 1023 PQ: 0 ANSI: 2
scsi[2]: scanning scsi channel 1 [virtual] for logical drives
scsi 2:1:0:0: Direct-Access     MegaRAID LD 0 RAID5   69G   H  PQ: 0 ANSI: 2
sd 1:4:0:0: Attached scsi generic sg0 type 0
sd 1:4:1:0: Attached scsi generic sg1 type 0
scsi 2:0:5:0: Attached scsi generic sg2 type 3
sd 2:1:0:0: Attached scsi generic sg3 type 0


Of course, the system (GNU/Linux Debian Etch) is installed on the SCSI 
RAID5 volume (/dev/sdc) and both raid cards are detected by the 
megaraid_mbox driver :

holy:/usr/src/linux# lsmod|grep raid
megaraid_mbox          30448  3
megaraid_mm            10464  1 megaraid_mbox
scsi_mod              136620  6 
sg,sd_mod,megaraid_mbox,aic7xxx,scsi_transport_spi,libata


The problem I have is the volumes sda and sdb (so attached to the i4 
raid card) goes offline when they are accessed and then i get this kind 
of logs :

Nov  5 21:36:05 holy kernel: megaraid: aborting-1030 cmd=2a <c=4 t=1 l=0>
Nov  5 21:36:05 holy kernel: megaraid abort: 1030:1[255:129], fw owner
Nov  5 21:36:05 holy kernel: megaraid: aborting-1031 cmd=2a <c=4 t=1 l=0>
Nov  5 21:36:05 holy kernel: megaraid abort: 1031:0[255:129], fw owner
Nov  5 21:36:05 holy kernel: megaraid: aborting-1032 cmd=2a <c=4 t=1 l=0>
Nov  5 21:36:05 holy kernel: megaraid abort: 1032:3[255:129], fw owner
Nov  5 21:36:05 holy kernel: megaraid: aborting-1033 cmd=2a <c=4 t=1 l=0>
Nov  5 21:36:05 holy kernel: megaraid abort: 1033:2[255:129], fw owner
Nov  5 21:36:05 holy kernel: megaraid: aborting-1034 cmd=2a <c=4 t=1 l=0>
Nov  5 21:36:05 holy kernel: megaraid abort: 1034:4[255:129], fw owner
Nov  5 21:36:05 holy kernel: megaraid: 5 outstanding commands. Max wait 
300 sec
Nov  5 21:36:05 holy kernel: megaraid mbox: Wait for 5 commands to 
complete:300
Nov  5 21:36:10 holy kernel: megaraid mbox: Wait for 5 commands to 
complete:295
Nov  5 21:36:15 holy kernel: megaraid mbox: Wait for 5 commands to 
complete:290
...
Nov  5 21:40:56 holy kernel: megaraid mbox: Wait for 5 commands to 
complete:10
Nov  5 21:41:01 holy kernel: megaraid mbox: Wait for 5 commands to 
complete:5
Nov  5 21:41:06 holy kernel: megaraid mbox: critical hardware error!
Nov  5 21:41:06 holy kernel: megaraid: hw error, cannot reset
Nov  5 21:41:06 holy kernel: megaraid: hw error, cannot reset
Nov  5 21:41:06 holy kernel: sd 0:4:1:0: scsi: Device offlined - not 
ready after error recovery
Nov  5 21:41:06 holy last message repeated 4 times
Nov  5 21:41:06 holy kernel: sd 0:4:1:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
Nov  5 21:41:06 holy kernel: end_request: I/O error, dev sdb, sector 
351721439
Nov  5 21:41:06 holy kernel: sd 0:4:1:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
Nov  5 21:41:06 holy kernel: end_request: I/O error, dev sdb, sector 
351721566
Nov  5 21:41:06 holy kernel: sd 0:4:1:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
Nov  5 21:41:06 holy kernel: end_request: I/O error, dev sdb, sector 
351721694
Nov  5 21:41:06 holy kernel: sd 0:4:1:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
Nov  5 21:41:06 holy kernel: end_request: I/O error, dev sdb, sector 
351721822
Nov  5 21:41:06 holy kernel: sd 0:4:1:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
Nov  5 21:41:06 holy kernel: end_request: I/O error, dev sdb, sector 
351721950


I have to precise this kind of error occured also on the NetRaid 
controller : I was not able to install Debian on the host because the 
install process systematically crashed while copying packages from 
official Etch CDROM to raid volume with the same kernel messages. The 
workaround I found was to update the netraid firmware to lastest 
version, whitch allowed me to finalize the process installation, but the 
problem reapeared after an "apt-get upgrade" that caused the 
installation of the "linux-image-2.6.18-4-686" kernel.
I solved again by installing "linux-image-2.6.22-1-686" from Debian 
Testing, but note that a vanilla kernel 2.6.23-1 freshly downloaded from 
http://www.kernel.org" cause again the problem...

Concerning the i4 card, I have to precise how it is configured :
ide channel 0 master is a Fujitsu 250Gb drive,
ide channel 0 slave is a Seagate 120Gb drive,
ide channel 1 master is a Fujitsu 250Gb drive,
ide channel 1 slave is a Seagate 120Gb drive,
ide channel 2 master is a Fujitsu 250Gb drive,
ide channel 2 slave is a Seagate 120Gb drive,
ide channel 3 master is a Fujitsu 250Gb drive,
ide channel 3 slave is a Seagate 120Gb drive,
Logical volume 0 is a raid5 volume with all 4 Fujistu drives,
Logical volume 1 is a raid5 volume with all 4 Seagate drives,
And the firmware revision is the lastest found from the LSI web site.

It seems sda can be accessed safely (at least I was able to format it as 
xfs, but I didn't tried to stess it yet), but I can even archieve to 
format sdb (kernel messages listed above occured while formating sdb)
So maybe it is not a so good idea to use both master and slave channels 
on this card ?
Moreover, this card previously worked on another computer whith only the 
4 Seagate drives (all as single master on each IDE channel) and worked 
fine with megaraid_mbox driver (and no other megaraid device...)

Of course, the dellmgr utility reports all logical volumes are online, 
and all physical drives are alive


So my questions are :

- why the netraid operating is so erratic between kernel revisions ? is 
it due to the fact there is 2 controllers onboard ?
- there is a known bug when multiple megaraid based controller are in 
use, that could explain my problems ?
- I readen Documentation/scsi/megaraid.txt provided on kernel archive to 
get help on how to get health status on controllers and volumes, I also 
browsing /sys entries, but I didn't found anything usable. So how to 
monitor raid status with the newgen megaraid ???

Sorry for this long post, and many thanks for support :)

-- 
Rico



More information about the Linux-PowerEdge mailing list