Preventing I/O starvation on MD1000s triggered by a failed disk.

Jeff Ewing jewing at aconex.com
Mon Aug 23 23:56:03 CDT 2010


On Mon, Aug 23, 2010 at 08:46:03PM -0700, Bond Masuda wrote:
> Do you know if you encountered a URE? I've been running into URE's on
> 
I have now exported the controller log from the controller that failed. I didnt see a URE around the time the that my NFS requests stopped (around 11:38)(rebuild was actually 19% complete):

08/16/10 11:37:32: EVT#47438-08/16/10 11:37:32: 103=Rebuild progress on PD 33(e0x32/s14) is 18.98%(3262s)^M
08/16/10 11:40:10: EVT#47439-08/16/10 11:40:10: 103=Rebuild progress on PD 33(e0x32/s14) is 19.98%(3420s)^M
08/16/10 11:42:48: EVT#47440-08/16/10 11:42:48: 103=Rebuild progress on PD 33(e0x32/s14) is 20.98%(3578s)^M
08/16/10 11:45:26: EVT#47441-08/16/10 11:45:26: 103=Rebuild progress on PD 33(e0x32/s14) is 21.98%(3736s)^M
08/16/10 11:48:04: EVT#47442-08/16/10 11:48:04: 103=Rebuild progress on PD 33(e0x32/s14) is 22.98%(3894s)^M
08/16/10 11:50:42: EVT#47443-08/16/10 11:50:42: 103=Rebuild progress on PD 33(e0x32/s14) is 23.98%(4052s)^M
08/16/10 11:53:20: EVT#47444-08/16/10 11:53:20: 103=Rebuild progress on PD 33(e0x32/s14) is 24.98%(4210s)^M


There were errors later, my collegue tried to set the rebuild rate to 5% to bring the server back:

08/16/10 13:22:54: EVT#47479-08/16/10 13:22:54: 103=Rebuild progress on PD 33(e0x32/s14) is 58.96%(9584s)^M
08/16/10 13:25:31: EVT#47480-08/16/10 13:25:31: 103=Rebuild progress on PD 33(e0x32/s14) is 59.96%(9741s)^M
08/16/10 13:27:06: NCQ Mode value is not valid or not found, return default^M
08/16/10 13:27:06: EVT#47481-08/16/10 13:27:06:  40=Rebuild rate changed to 5%^M
08/16/10 13:31:12: mfiIsr: idr=00000020^M
08/16/10 13:31:12: Driver detected possible FW hang, halting FW.^M
08/16/10 13:31:12: Pending Command Details:^M



> Were there any kernel messages when the I/O stopped? Have you dumped the
> log from the controller? If not, that's where I would start looking....


There were alot of megasas messages in the debug log at the time the NFS requests stopped:

Aug 16 11:38:06 nas2 kernel: sd 1:2:0:0: megasas: RESET -356143413 cmd=2a retries=0
Aug 16 11:38:06 nas2 kernel: megasas: [ 0]waiting for 4 commands to complete
Aug 16 11:38:07 nas2 kernel: megasas: reset successful 

These were in dmesg also:

mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc4 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc5 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc6 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc7 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc0 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc1 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc2 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc3 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc4 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc5 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc6 not found!
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @596 - ioc7 not found!
sd 1:2:0:0: megasas: RESET -356143413 cmd=2a retries=0
megasas: [ 0]waiting for 4 commands to complete
megasas: reset successful 
sd 1:2:0:0: megasas: RESET -356143737 cmd=2a retries=0
megasas: [ 0]waiting for 5 commands to complete
megasas: reset successful 
sd 1:2:0:0: megasas: RESET -356143749 cmd=2a retries=0
megasas: [ 0]waiting for 5 commands to complete
megasas: reset successful 
sd 1:2:0:0: megasas: RESET -356143771 cmd=2a retries=0
megasas: [ 0]waiting for 4 commands to complete
megasas: reset successful 
sd 1:2:0:0: megasas: RESET -356143781 cmd=2a retries=0
megasas: [ 0]waiting for 5 commands to complete



Thank you.

Jeff Ewing




More information about the Linux-PowerEdge mailing list