PERC 4/Di/Red Hat ES 3 hang on high I/O load

Mike M saetaes at gmail.com
Sun May 21 11:10:52 CDT 2006


Hi list, I'm hoping someone can help me out here...

We have about 450 PowerEdge 2850s purchased late last summer, in the
July/August timeframe.  All are running hardware RAID on the internal
PERC 4e/Di controller.  OS is Red Hat Enterprise Linux 3, Update 5,
kernel 2.4.21-32.0.1.  MegaRAID driver version v2.10.8.2-RH1, PERC
Firmware 521X:H430.

Over the past week we have had at least 10 machines hang with the
errors shown in the attached screenshots.  (If you can't see the jpegs
or they get stripped, there are loads of EXT3-fs errors, the infamous
"megaraid: aborting", and "megaraid: hardware error, cannot reset"
messages).  The only way to fix this is to physically power down the
machines.  When rebooted, it's as if the OS "lost" it's disks -
nothing in /var/log/messages, nothing else out of the ordinary other
than the fact that the machine wasn't shutdown properly.

I/O load has been higher than normal, but nothing the controller
shouldn't be able to handle.  In fact, we ran similar I/O loads on
these boxes in the past, and they didn't do this.  I'm stumped, as is
Dell's technical support.

Has anyone else seen this, and if so, were you able to find a resolution?

A little more information below from a machine that crashed last night:

[root at host root]# cat /proc/megaraid/hba0/raiddrives-0-9
Logical drive: 0:, state: optimal
Span depth:  1, RAID level:  5, Stripe size:128, Row size:  6
Read Policy: Adaptive, Write Policy: Write back, Cache Policy: Direct IO


[root at host root]# cat /proc/megaraid/hba0/diskdrives-ch0
Channel: 0 Id: 0 State: Online.
  Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
  Type:   Direct-Access                      ANSI SCSI revision: 03
Channel: 0 Id: 1 State: Online.
  Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
  Type:   Direct-Access                      ANSI SCSI revision: 03
Channel: 0 Id: 2 State: Online.
  Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
  Type:   Direct-Access                      ANSI SCSI revision: 03
Channel: 0 Id: 3 State: Online.
  Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
  Type:   Direct-Access                      ANSI SCSI revision: 03
Channel: 0 Id: 4 State: Online.
  Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
  Type:   Direct-Access                      ANSI SCSI revision: 03
Channel: 0 Id: 5 State: Online.
  Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
  Type:   Direct-Access                      ANSI SCSI revision: 03


Thanks in advance for any help,

Mike
-------------- next part --------------
A non-text attachment was scrubbed...
Name: megaraid-crash1.jpg
Type: image/jpeg
Size: 141666 bytes
Desc: not available
Url : http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20060521/620d67a3/attachment-0003.jpg 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: megaraid-crash2.jpg
Type: image/jpeg
Size: 130689 bytes
Desc: not available
Url : http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20060521/620d67a3/attachment-0004.jpg 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: megaraid-crash3.jpg
Type: image/jpeg
Size: 139415 bytes
Desc: not available
Url : http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20060521/620d67a3/attachment-0005.jpg 


More information about the Linux-PowerEdge mailing list