PERC 4e/Di/Red Hat ES 3 hang on high I/O load

Harald_Jensas at Dell.com Harald_Jensas at Dell.com
Mon May 22 01:29:46 CDT 2006


> -----Original Message-----
> From: linux-poweredge-bounces at dell.com 
> [mailto:linux-poweredge-bounces at dell.com] On Behalf Of Mike M
> Sent: 21 May 2006 21:44
> To: linux-poweredge-Lists
> Subject: PERC 4e/Di/Red Hat ES 3 hang on high I/O load
> 
> **Apologies for the possible duplicate message; it appears 
> the attachments to the original caused it to be trapped by 
> the listserv
> software...**
> 
> 
> Hi list, I'm hoping someone can help me out here:
> 
> We have about 450 PowerEdge 2850s purchased late last summer, 
> in the July/August timeframe.  All are running hardware RAID 
> on the internal PERC 4e/Di controller.  OS is Red Hat 
> Enterprise Linux 3, Update 5, kernel 2.4.21-32.0.1.  MegaRAID 
> driver version v2.10.8.2-RH1, PERC Firmware 521X:H430.
> 
> Over the past week we have had at least 10 machines hang with 
> the errors shown in the attached screenshots.  (If you can't 
> see the jpegs or they get stripped, there are loads of 
> EXT3-fs errors, the infamous
> "megaraid: aborting", and "megaraid: hardware error, cannot reset"
> messages).  The only way to fix this is to physically power 
> down the machines.  When rebooted, it's as if the OS "lost" 
> it's disks - nothing in /var/log/messages, nothing else out 
> of the ordinary other than the fact that the machine wasn't 
> shutdown properly.
> 
> I/O load has been higher than normal, but nothing the 
> controller shouldn't be able to handle.  In fact, we ran 
> similar I/O loads on these boxes in the past, and they didn't 
> do this.  I'm stumped, as is Dell's technical support.
> 
> Has anyone else seen this, and if so, were you able to find a 
> resolution?
> 
> A little more information below from a machine that crashed 
> last night:
> 
> [root at host root]# cat /proc/megaraid/hba0/raiddrives-0-9
> Logical drive: 0:, state: optimal
> Span depth:  1, RAID level:  5, Stripe size:128, Row size:  6 
> Read Policy: Adaptive, Write Policy: Write back, Cache 
> Policy: Direct IO
> 
> 
> [root at host root]# cat /proc/megaraid/hba0/diskdrives-ch0
> Channel: 0 Id: 0 State: Online.
>   Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
>   Type:   Direct-Access                      ANSI SCSI revision: 03
> Channel: 0 Id: 1 State: Online.
>   Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
>   Type:   Direct-Access                      ANSI SCSI revision: 03
> Channel: 0 Id: 2 State: Online.
>   Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
>   Type:   Direct-Access                      ANSI SCSI revision: 03
> Channel: 0 Id: 3 State: Online.
>   Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
>   Type:   Direct-Access                      ANSI SCSI revision: 03
> Channel: 0 Id: 4 State: Online.
>   Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
>   Type:   Direct-Access                      ANSI SCSI revision: 03
> Channel: 0 Id: 5 State: Online.
>   Vendor: MAXTOR    Model: ATLAS15K2_146SCA  Rev: JT00
>   Type:   Direct-Access                      ANSI SCSI revision: 03
> 
> 
> Thanks in advance for any help,
> 
> Mike
> 


Don't know if it will help, but it seems there is a later megaraid driver available v.2.10.10.1.

perc-EM64T-2.10.10.1-A03.tar.gz


Maby setting up your servers to log to a syslog daemon on an other system will give you more info. As access to disks are lost system is not able to write to /var/log/messages ..


//
Harald Jensås



More information about the Linux-PowerEdge mailing list