PE1950 PERC6/E crashes

Howard Powell hbp4c at virginia.edu
Thu Sep 3 18:17:21 CDT 2009


Hi everyone,

I have a Dell PE1950 running an MD1000 raid array via a PERC 6/E  
controller card.  The system is running 64bit Centos 5.3 with a stock  
2.6.18-128.7.1.el5 kernel.

I just updated the PE1950 BIOS to version: 2.6.1
I updated the PE1960 BMC to 2.37
I just updated the PERC 6/E firmware to: 6.2.0-0013
The PERC 6/E driver is the "in box" driver provided by the OS:  
00.00.04.01-RH1
The MD1000 firmware is version A.04

 From what I can tell, everything is up to date.

Under heavy I/O load, such as when moving several hundred GB of files  
over a few hours the megasas controller will panic and the MD1000 RAID  
will go offline.  I've appended to this email the klog output when  
this happens.  This problem has been occurring for months under  
various kernels and various (older) firmware and drivers.

Any ideas how to tackle this issue?

Thanks!
Howard

------- begin log -------

Sep  3 18:17:29 halo kernel: irq 169: nobody cared (try booting with  
the "irqpoll" option)
Sep  3 18:30:32 halo kernel:  [<ffffffff800b7c61>] __do_IRQ+0xbd/0x103
Sep  3 18:30:32 halo kernel:  [<ffffffff80011fc4>] __do_softirq 
+0x89/0x133
Sep  3 18:30:32 halo kernel:  [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5
Sep  3 18:30:32 halo kernel:  [<ffffffff8018d056>] acpi_processor_idle 
+0x0/0x440
Sep  3 18:30:32 halo kernel:  [<ffffffff8005d615>] ret_from_intr+0x0/0xa
Sep  3 18:30:32 halo kernel:  <EOI>  [<ffffffff8018cfe7>]  
acpi_safe_halt+0x25/0x36
Sep  3 18:30:32 halo kernel:  [<ffffffff8018d1dd>] acpi_processor_idle 
+0x187/0x440
Sep  3 18:30:32 halo kernel:  [<ffffffff8018d056>] acpi_processor_idle 
+0x0/0x440
Sep  3 18:30:32 halo automount[4454]: 2 remaining in /net
Sep  3 18:30:32 halo mountd[6799]: authenticated mount request from  
128.143.57.123:838 for /local (/local)
Sep  3 18:30:32 halo syslogd: /var/log/secure: Read-only file system
Sep  3 18:30:32 halo auditd[3954]: Record was not written to disk  
(Read-only file system)
Sep  3 18:30:32 halo nagios: Error: Unable to create temp file for  
writing status data!
Sep  3 18:30:32 halo mountd[6799]: could not open /var/lib/nfs/rmtab  
for locking
Sep  3 18:30:33 halo kernel:  [<ffffffff8018d056>] acpi_processor_idle 
+0x0/0x440
Sep  3 18:30:33 halo auditd[3954]: write: Audit daemon detected an  
error writing an event to disk (Read-only file system)
Sep  3 18:30:33 halo kernel:  [<ffffffff80048d79>] cpu_idle+0x95/0xb8
Sep  3 18:30:33 halo automount[4454]: attempting to mount entry /net/ 
astro64
Sep  3 18:30:33 halo kernel:  [<ffffffff80076c3f>] start_secondary 
+0x45a/0x469
Sep  3 18:30:33 halo kernel:
Sep  3 18:30:33 halo kernel: handlers:
Sep  3 18:30:33 halo kernel: [<ffffffff880b4443>] (megasas_isr 
+0x0/0x45 [megaraid_sas])
Sep  3 18:30:33 halo kernel: [<ffffffff880b4443>] (megasas_isr 
+0x0/0x45 [megaraid_sas])
Sep  3 18:30:33 halo kernel: Disabling IRQ #169
Sep  3 18:30:33 halo auditd[3954]: Record was not written to disk  
(Read-only file system)
Sep  3 18:30:33 halo auditd[3954]: write: Audit daemon detected an  
error writing an event to disk (Read-only file system)
Sep  3 18:30:34 halo auditd[8601]: Audit daemon failed to exec (null)
Sep  3 18:30:34 halo auditd[8601]: The audit daemon is exiting.
Sep  3 18:30:34 halo kernel: sd 1:2:0:0: megasas: RESET -341995 cmd=8a  
retries=0
Sep  3 18:30:34 halo kernel: megasas: [ 0]waiting for 108 commands to  
complete
Sep  3 18:30:34 halo auditd[8600]: Audit daemon failed to exec (null)
Sep  3 18:30:34 halo mountd[6799]: authenticated mount request from  
128.143.57.233:993 for /local (/local)
Sep  3 18:30:34 halo kernel: megasas: [ 5]waiting for 108 commands to  
complete
Sep  3 18:30:34 halo auditd[8600]: The audit daemon is exiting.
Sep  3 18:30:34 halo mountd[6799]: could not open /var/lib/nfs/rmtab  
for locking
Sep  3 18:30:36 halo kernel: megasas: [40]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [45]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [50]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [55]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [60]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [65]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [70]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [75]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [80]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [85]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [90]waiting for 108 commands to  
complete
Sep  3 18:30:36 halo kernel: megasas: [95]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [100]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [105]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [110]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [115]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [120]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [125]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [130]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [135]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [140]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: sd 0:2:0:0: megasas: RESET -149624 cmd=2a  
retries=0
Sep  3 18:30:37 halo kernel: megasas: [ 0]waiting for 16 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: reset successful
Sep  3 18:30:37 halo kernel: megasas: [145]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [150]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: sd 0:2:0:0: megasas: RESET -149624 cmd=0  
retries=0
Sep  3 18:30:37 halo kernel: megasas: [ 0]waiting for 1 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [155]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: reset successful
Sep  3 18:30:37 halo kernel: sd 0:2:0:0: megasas: RESET -149624 cmd=2a  
retries=0
Sep  3 18:30:37 halo kernel: megasas: reset successful
Sep  3 18:30:37 halo kernel: megasas: [160]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [165]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [170]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: [175]waiting for 108 commands to  
complete
Sep  3 18:30:37 halo kernel: sd 0:2:0:0: megasas: RESET -149624 cmd=0  
retries=0
Sep  3 18:30:37 halo kernel: megasas: [ 0]waiting for 1 commands to  
complete
Sep  3 18:30:37 halo kernel: megasas: reset successful
Sep  3 18:30:37 halo kernel:
Sep  3 18:30:37 halo kernel: megasas[1]: Dumping Frame Phys Address of  
all pending cmds in FW
Sep  3 18:30:37 halo kernel: megasas[1]: Total OS Pending cmds : 108
Sep  3 18:30:37 halo kernel:
Sep  3 18:30:37 halo kernel: megasas[1]: 64 bit SGLs were sent to FW
Sep  3 18:30:37 halo kernel: megasas[1]: Pending OS cmds in FW :
Sep  3 18:30:37 halo kernel: megasas[1]: Frame addr :0x37fed400 :  
<3>megasas[1]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo :  
0x316257e2, lba_hi : 0x1, sense_buf addr : 0x37fec080,sge count : 0x50
Sep  3 18:30:37 halo kernel:
Sep  3 18:30:37 halo kernel: megasas[1]: Frame addr :0x37fea000 :  
<3>megasas[1]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo :  
0xa387d932, lba_hi : 0x0, sense_buf addr : 0x37fec400,sge count : 0x50
Sep  3 18:30:37 halo kernel:
Sep  3 18:30:37 halo kernel: megasas[1]: Frame addr :0x37fea400 :  
<3>megasas[1]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo :  
0xbf50544a, lba_hi : 0x1, sense_buf addr : 0x37fec480,sge count : 0x50
Sep  3 18:30:37 halo kernel:
Sep  3 18:30:37 halo kernel: megasas[1]: Frame addr :0x37fe9000 :  
<3>megasas[1]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo :  
0x1df769ea, lba_hi : 0x2, sense_buf addr : 0x37fec600,sge count : 0x50
Sep  3 18:30:37 halo kernel:
Sep  3 18:30:37 halo kernel: megasas[1]: Frame addr :0x37fe5000 :  
<3>megasas[1]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo :  
0x1df77b6a, lba_hi : 0x2, sense_buf addr : 0x37fece00,sge count : 0x4a
Sep  3 18:30:37 halo kernel:
Sep  3 18:30:37 halo kernel: megasas[1]: Frame addr :0x37fe5c00 :  
<3>megasas[1]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo :  
0xa387d6b2, lba_hi : 0x0, sense_buf addr : 0x37fecf80,sge count : 0x50
Sep  3 18:30:38 halo kernel:
Sep  3 18:30:38 halo kernel: megasas[1]: Frame addr :0x37fdf000 :  
<3>megasas[1]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, lba lo :  
0x270abdc2, lba_hi : 0x4, sense_buf addr : 0x37fe3800,sge count : 0x50
Sep  3 18:30:38 halo kernel:
Sep  3 18:30:38 halo kernel: sd 1:2:0:0: timing out command, waited 360s
Sep  3 18:30:38 halo kernel: sd 1:2:0:0: SCSI error: return code =  
0x06000000
Sep  3 18:30:38 halo kernel: end_request: I/O error, dev sdb, sector  
17834888386
Sep  3 18:30:38 halo kernel: end_request: I/O error, dev sdb, sector  
2743584050
Sep  3 18:30:38 halo kernel: sd 1:2:0:0: timing out command, waited 360s
Sep  3 18:30:49 halo last message repeated 123 times
Sep  3 18:30:49 halo kernel: EXT3-fs error (device sda2):  
ext3_find_entry: reading directory #390145 offset 0


------- end log -------

--

Never interrupt your enemy when he is making a mistake
     Napoleon Bonaparte

Howard Powell
Computer Support Technician
Astronomy Department, UVa



-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4937 bytes
Desc: not available
Url : http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20090903/13e36413/attachment.p7s 


More information about the Linux-PowerEdge mailing list