Massive sense key & IO errors and eventual crashing. R900 with Perc 6i

Chris Trainor ctrainor at quickhit.com
Tue May 11 13:14:53 CDT 2010


Actually a correction to the previous note.. looks like the Sense Key messages are coming from our external  storage but the IO errors are from the internal... still strange.  External is an MD1120 attached via a PERC 6 as well.

--Chris



From: linux-poweredge-bounces at dell.com [mailto:linux-poweredge-bounces at dell.com] On Behalf Of Chris Trainor
Sent: Tuesday, May 11, 2010 1:50 PM
To: linux-poweredge at dell.com
Subject: Massive sense key & IO errors and eventual crashing. R900 with Perc 6i

HI all,

Having some odd issues with our internal storage on our R900.    We get hundreds of SCSI sense key errors reported on all the disks all day long.... Eventually we'll get I/O errors a few times a week and the drives go offline and system crashes.

Here are some examples:
(just prior to crash)

May 10 02:05:57 mackey kernel: megasas: [20]waiting for 127 commands to complete
May 10 02:06:02 mackey kernel: megasas: [25]waiting for 127 commands to complete
May 10 02:06:07 mackey kernel: megasas: [30]waiting for 127 commands to complete


May 10 02:08:38 mackey kernel: megasas[0]: Frame addr :0xbfaa4800 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x1, lba l
o : 0x63c01bf, lba_hi : 0x0, sense_buf addr : 0x37f47b00,sge count : 0x1
May 10 02:08:38 mackey kernel:
May 10 02:08:38 mackey kernel: megasas[0]: Frame addr :0xbfaa4c00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x1, lba l
o : 0x63a991f, lba_hi : 0x0, sense_buf addr : 0x37f47b80,sge count : 0x1


May 10 02:08:38 mackey kernel: megasas[0]: Pending Internal cmds in FW :
May 10 02:08:38 mackey kernel: megasas[0]: Dumping Done.
May 10 02:08:38 mackey kernel:
May 10 02:08:38 mackey kernel: megasas: failed to do reset
May 10 02:08:38 mackey kernel: sd 0:2:1:0: megasas: RESET -264942026 cmd=2a retries=0
May 10 02:08:38 mackey kernel: megasas: cannot recover from previous reset failures
May 10 02:08:38 mackey kernel: sd 0:2:1:0: megasas: RESET -264942026 cmd=2a retries=0

May 10 02:08:38 mackey kernel: sd 0:2:0:0: scsi: Device offlined - not ready after error recovery
May 10 02:08:38 mackey kernel: sd 0:2:1:0: scsi: Device offlined - not ready after error recovery
May 10 02:08:38 mackey last message repeated 106 times


May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s
May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000
May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 111033375
May 10 02:08:38 mackey kernel: Buffer I/O error on device dm-5, logical block 13879116
May 10 02:08:38 mackey kernel: lost page write due to I/O error on dm-5
May 10 02:08:38 mackey kernel: Buffer I/O error on device dm-5, logical block 13879117
May 10 02:08:38 mackey kernel: lost page write due to I/O error on dm-5
May 10 02:08:38 mackey kernel: Buffer I/O error on device dm-5, logical block 13879118
(dozens more of the IO errors.... )

May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s
May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000
May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 111033767
May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s

May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device
May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device
May 10 02:08:38 mackey kernel: Aborting journal on device dm-6.
May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device
May 10 02:08:38 mackey kernel: EXT3-fs error (device dm-6): read_block_bitmap: Cannot read block bitmap - block_group = 53, block_bi
tmap = 1736704
May 10 02:08:38 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device
May 10 02:08:38 mackey kernel: ext3_abort called.
May 10 02:08:38 mackey kernel: EXT3-fs error (device dm-6): ext3_journal_start_sb: Detected aborted journal
May 10 02:08:38 mackey kernel: Remounting filesystem read-only
May 10 02:08:38 mackey kernel: Aborting journal on device dm-5.
May 10 02:08:38 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device
May 10 02:08:38 mackey kernel: __journal_remove_journal_head: freeing b_committed_data
May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device
May 10 02:08:38 mackey last message repeated 3 times

And the death throws just prior to crashing/reboot.  (obviously the clock is off here....  Need to fix that. :) )

May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s
May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000
May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 104737343
May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s
May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000
May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 104745543
May 10 02:08:44 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device
May 10 02:08:44 mackey kernel: printk: 1724 messages suppressed.
May 10 02:08:44 mackey kernel: Buffer I/O error on device dm-5, logical block 13094162
May 10 02:08:44 mackey kernel: lost page write due to I/O error on dm-5
May 10 02:08:44 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device
May  9 22:27:02 mackey syslogd 1.4.1: restart.
May  9 22:27:02 mackey kernel: klogd 1.4.1, log source = /proc/kmsg started.
May  9 22:27:02 mackey kernel: Linux version 2.6.18-128.1.10.el5 (mockbuild at builder10.centos.org<mailto:mockbuild at builder10.centos.org>) (gcc version 4.1.2 20080704 (Red H
at 4.1.2-44)) #1 SMP Thu May 7 10:35:59 EDT 2009

(after reboot... this is what shows most of the day)


May 10 02:35:52 mackey Server Administrator: Storage Service EventID: 2095  SCSI sense data Sense key:  B Sense code: 4B Sense quali
fier:  4:  Physical Disk 1:0:13 Controller 0, Connector 1
May 10 02:35:54 mackey Server Administrator: Storage Service EventID: 2095  SCSI sense data Sense key:  B Sense code: 4B Sense quali
fier:  4:  Physical Disk 1:0:2 Controller 0, Connector 1
May 10 02:35:54 mackey Server Administrator: Storage Service EventID: 2095  SCSI sense data Sense key:  B Sense code: 4B Sense quali
fier:  4:  Physical Disk 1:0:1 Controller 0, Connector 1




Any ideas what could be causing this?   This is all internal disk, nothing external.   CentOS 5.4 kernel 2.6.18-128.1.10.el5 #1 SMP Thu May 7 10:35:59 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux


Thanks,
--Chris


Christopher M. Trainor
Manager, IT & Network Operations
Quick Hit, Inc.
o.  508.203.4857
w.  www.quickhit.com



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20100511/1d18fa6c/attachment-0001.htm 


More information about the Linux-PowerEdge mailing list