1. Re: Rebuild the array (pjwelsh)

brijesh patel bridgepatel at hotmail.com
Fri Jul 1 04:31:15 CDT 2011


thanks i will try to install the OMSA and see what happens.

> From: linux-poweredge-request at dell.com
> Subject: Linux-PowerEdge Digest, Vol 85, Issue 26
> To: linux-poweredge at dell.com
> Date: Thu, 30 Jun 2011 23:03:42 -0500
> 
> Send Linux-PowerEdge mailing list submissions to
> 	linux-poweredge at dell.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://lists.us.dell.com/mailman/listinfo/linux-poweredge
> or, via email, send a message with subject or body 'help' to
> 	linux-poweredge-request at dell.com
> 
> You can reach the person managing the list at
> 	linux-poweredge-owner at dell.com
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Linux-PowerEdge digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: Rebuild the array (pjwelsh)
>    2. OMSA 6.5 Problems on Ubuntu (Mark Petersen)
>    3. Kernel crash and multiple drive failure on Dell R610 / LSI
>       SAS1068E (Stephen Vaughan)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Thu, 30 Jun 2011 12:40:42 -0500
> From: pjwelsh <pjwelsh at gmail.com>
> Subject: Re: Rebuild the array
> To: linux-poweredge at dell.com
> Message-ID: <4E0CB51A.8040600 at gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> On 06/30/2011 12:00 PM, linux-poweredge-request at dell.com wrote:
> ...
> >> > To verify/validate/fix drive parity issues (AKA "parity scrub"), add a cron
> >> > job (and/or run the command now) to make sure that the RAID groups on your
> >> > PERC RAID controllers are in perfect shape. You don't want any failed disk
> >> > rebuild surprises! Here are a couple of examples (You could make this into a
> >> > script to dynamically find the vdisks):
> >> > 
> >> > 00 17 * * 0 /opt/dell/srvadmin/bin/omconfig storage vdisk
> >> > action=checkconsistency controller=0 vdisk=0 > /tmp/omconfig-vdisk0.out 2>&1
> >> > || cat /tmp/omconfig-vdisk0.out |mail -s "omconfig issue on `hostname`" admin
> >> > 30 17 * * 0 /opt/dell/srvadmin/bin/omconfig storage vdisk
> >> > action=checkconsistency controller=0 vdisk=1 > /tmp/omconfig-vdisk1.out 2>&1
> >> > || cat /tmp/omconfig-vdisk1.out |mail -s "omconfig issue on `hostname`" admin
> >> > 
> >> > pjwelsh
> >> > 
> >> > 
> >> > 
> > Thanks for the reply pjwelsh
> >  
> > one thing is that i dont have the srvadmin tool installed as i  am using Mandriva and its not supported( i tried to install but had no luck) and second thing is i am using raid 10 so i dont think there is any parity involved.
> >
> > In short if i want to rebuild this array would it be sufficient if i back up all the data, recreate the array and copy the data back? would it work?
> >
> > Brijesh
> 
> Just to clarify, even a (PERC) RAID 10 will benefit from a scrub. The parity
> scrub will verify that the copy/mirror = primary. It is unfortunate that you
> can not get OMSA on your system. There are MANY benefits from working with a
> better support system list OMSA. Please consider some additional research if
> Mandriva can have it or think of switching OS's...
> 
> To answer you question; yes, backup data on array and rebuild! You could still
> be prone to this issue based on your situation and no omconfig...
> 
> pjwelsh
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Thu, 30 Jun 2011 20:13:45 +0000
> From: Mark Petersen <mpetersen at peak6.com>
> Subject: OMSA 6.5 Problems on Ubuntu
> To: "linux-poweredge at dell.com" <linux-poweredge at dell.com>
> Message-ID:
> 	<443D5D30774F8E4F8B2D304016C65D7A0485848D at sswchi5pmbx1.peak6.net>
> Content-Type: text/plain; charset="us-ascii"
> 
> Hello,
> 
> I have a number of Dell R610/710 servers.  Most of these are running OMSA 6.5 on Ubuntu 10.04.02 with no issues.  On a few of them when you restart dataeng it will attempt to start dsm_sa_snmpd and say it started, but it doesn't.  The only interesting message in the logs is 'dataeng: dsm_sa_snmpd startup failed'.  I've been unsuccessful in trying to troubleshoot myself and either my google-fu is weak or there's nothing interesting on this topic.
> 
> I'm hoping someone can point me in the right direction to attempt to troubleshoot this (turn up logging, correct way to start process without upstart/init script so I can trace it, etc.)  Since it works on a number of servers with the same OS, firmware, etc. there must be someway I can get it going.  I've purged and re-installed a couple times, but that didn't help (however, it does remove /opt which happens to be a symlink when you purge with apt, which caused some issues of course.)
> 
> 
> Thanks,
> Mark
> 
> ______________________________________________
> 
> See  http://www.peak6.com/email_disclaimer.php
> for terms and conditions related to this email
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20110630/bfebb4e9/attachment-0001.htm 
> 
> ------------------------------
> 
> Message: 3
> Date: Fri, 1 Jul 2011 14:03:39 +1000
> From: Stephen Vaughan <stephenvaughan at gmail.com>
> Subject: Kernel crash and multiple drive failure on Dell R610 / LSI
> 	SAS1068E
> To: linux-poweredge at dell.com
> Message-ID: <BANLkTi=no2UXRpXQ2AxrQh7Tn9Cv00xFfA at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Hi,
> 
> We recently experienced a few hiccups with one of our R610 boxes; first we
> experienced a kernel panic/system lockup, followed about a month later by
> what appeared to be a multiple drive failure.
> 
> The system is an R610 w/ 2x73gig SAS drives (Raid1), LSI SAS1068E
> controller, Fujitsu drives (Model: MBE2073RC         Rev: D701), RHEL5.2
> (kernel 2.6.18-92.el5), mptlinux-3.04.05 drivers
> 
> *Incident #1.*
> System locks up, possibly a kernel panic. System is restored smoothly via a
> power cycle. We suspect the cause of this was our scsi controller, but we
> don't know for certain.
> 
> In /var/log/messages, this was the last message to be logged:
> May 10 02:36:55 boxname kernel: mptbase: ioc0: LogInfo(0x31140000):
> Originator={
> 
> PL}, Code={IO Executed}, SubCode(0x0000)
> 
> We also have a screenshot from the drac console:
> http://i51.tinypic.com/2qlaaes.png
> 
> *Incident #2.*
> File system (ext3) goes into read-only mode at around 8:42am, the system is
> later rebooted at around 9:25pm and fails to restore service
> 
> Several errors are logged prior to the file system going into read only:
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: LogInfo(0x31110b00):
> Originator={PL}, Code={Reset}, SubCode(0x0b00)
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 0
> id=1
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0:   PhysDisk is now failed
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 0
> id=1
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0:   PhysDisk is now failed, out of
> sync
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0:   volume is now degraded, enabled
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: LogInfo(0x31140000):
> Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: LogInfo(0x31140000):
> Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: LogInfo(0x31140000):
> Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> 
> Jun  4 08:42:20  last message repeated 2 times
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: LogInfo(0x31140000):
> Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: LogInfo(0x31110b00):
> Originator={PL}, Code={Reset}, SubCode(0x0b00)
> 
> Jun  4 08:42:20  kernel: mptscsih: ioc0: attempting task abort!
> (sc=ffff81018fe5eb00)
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0:
> 
> Jun  4 08:42:20  kernel:         command: Write(10): 2a 00 06 65 87 b0 00 00
> 88 00
> 
> Jun  4 08:42:20  kernel: mptscsih: ioc0: WARNING - TM Handler for type=1:
> IOC Not operational (0x40008015)!
> 
> Jun  4 08:42:20  kernel: mptscsih: ioc0: WARNING -  Issuing HardReset!!
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: Initiating recovery
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: WARNING - IOC is in FAULT state!!!
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: WARNING -            FAULT code =
> 8015h
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: mptscsih: ioc0: completing cmds:
> fw_channel 0, fw_id 0, sc=ffff81018fe5ec80, mf = ffff81022f402e80, idx=d
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: mptscsih: ioc0: completing cmds:
> fw_channel 0, fw_id 0, sc=ffff81022fd41980, mf = ffff81022f403480, idx=19
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: mptscsih: ioc0: completing cmds:
> fw_channel 0, fw_id 0, sc=ffff81022fd41500, mf = ffff81022f403f80, idx=2f
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: mptscsih: ioc0: completing cmds:
> fw_channel 0, fw_id 0, sc=ffff81018fe5ee00, mf = ffff81022f404300, idx=36
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: mptscsih: ioc0: completing cmds:
> fw_channel 0, fw_id 0, sc=ffff81018fe5e080, mf = ffff81022f404f80, idx=4f
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: mptscsih: ioc0: completing cmds:
> fw_channel 0, fw_id 0, sc=ffff81018fe5eb00, mf = ffff81022f405080, idx=51
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: mptscsih: ioc0: completing cmds:
> fw_channel 0, fw_id 0, sc=ffff81022fd41680, mf = ffff81022f405300, idx=56
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: mptscsih: ioc0: completing cmds:
> fw_channel 0, fw_id 0, sc=ffff81022fd41200, mf = ffff81022f405500, idx=5a
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: mptscsih: ioc0: completing cmds:
> fw_channel 0, fw_id 0, sc=ffff81022fd41080, mf = ffff81022f407380, idx=97
> 
> Jun  4 08:42:20  kernel: mptbase: ioc0: Recovered from IOC FAULT
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: rejecting I/O to offline device
> 
> Jun  4 08:42:20  last message repeated 3 times
> 
> Jun  4 08:42:20  kernel: sd 0:1:0:0: rejecting I/O to offline device
> 
> Jun  4 08:42:22  kernel: EXT3-fs error (device sda3): ext3_find_entry:
> reading directory #16351233 offset 0
> 
> 
> DRAC logs the following, indicating only a single drive has failed:
> 
> Sat Jun 04 2011 08:37:20 Storage Drive 1: Drive Slot sensor for
> Storage, drive fault was asserted
> Sat Jun 04 2011 21:28:38 Storage Drive 1: Drive Slot sensor for
> Storage, drive removed
> Sat Jun 04 2011 21:28:38 Storage Drive 1: Drive Slot sensor for
> Storage, drive fault was deasserted
> Sat Jun 04 2011 21:30:31 Storage Drive 1: Drive Slot sensor for
> Storage, drive presence was asserted
> Sat Jun 04 2011 21:34:02 Storage Drive 1: Drive Slot sensor for
> Storage, drive removed
> Sat Jun 04 2011 21:34:07 Storage Drive 1: Drive Slot sensor for
> Storage, drive presence was asserted
> Sat Jun 04 2011 21:40:57 Storage Drive 1: Drive Slot sensor for
> Storage, drive removed
> Sat Jun 04 2011 21:41:17 Storage Drive 1: Drive Slot sensor for
> Storage, drive presence was asserted
> 
> 
> Yet, Dell Server Administrator logs that both drives have failed:
> 
> Jun 4 08:42:20  Server Administrator: Storage Service EventID: 2095 SCSI
> sense data Sense key: 3 Sense code: 11 Sense qualifier: 1: Physical Disk
> 0:0:0 Controller 0, Connector 0
> 
> Jun 4 08:42:20  Server Administrator: Storage Service EventID: 2095 SCSI
> sense data Sense key: 3 Sense code: 11 Sense qualifier: 1: Physical Disk
> 0:0:0 Controller 0, Connector 0
> 
> Jun 4 08:42:20  Server Administrator: Storage Service EventID: 2095 SCSI
> sense data Sense key: 3 Sense code: 11 Sense qualifier: 1: Physical Disk
> 0:0:0 Controller 0, Connector 0
> 
> Jun 4 08:42:20  Server Administrator: Storage Service EventID: 2350 There
> was an unrecoverable disk media error during the rebuild or recovery
> operation: Physical Disk 0:0:0 Controller 0, Connector 0
> 
> Jun 4 08:42:20  Server Administrator: Storage Service EventID: 2350 There
> was an unrecoverable disk media error during the rebuild or recovery
> operation: Physical Disk 0:0:1 Controller 0, Connector 0
> 
> Jun 4 08:42:20 Server Administrator: Storage Service EventID: 2048 Device
> failed: Physical Disk 0:0:1 Controller 0, Connector 0
> 
> Jun 4 08:42:20 Server Administrator: Storage Service EventID: 2123
> Redundancy lost: Virtual Disk 0 (Virtual Disk 0) Controller 0 (SAS 6/iR
> Integrated)
> 
> Jun 4 08:42:20 Server Administrator: Storage Service EventID: 2057 Virtual
> disk degraded: Virtual Disk 0 (Virtual Disk 0) Controller 0 (SAS 6/iR
> Integrated)
> 
> Jun 4 08:42:20 Server Administrator: Storage Service EventID: 2348 The
> rebuild failed due to errors on the target physical disk.: Physical Disk
> 0:0:1 Controller 0, Connector 0
> 
> 
>  Jun 4 20:19:26 Server Administrator: Storage Service EventID: 2048 Device
> failed: Physical Disk 0:0:0 Controller 0, Connector 0
> 
> 
> System is rebooted a console around 8:20pm is reporting I/O errors for it's
> Raid1 device:
> 
> 
> Jun  4 20:19:52  kernel: end_request: I/O error, dev sda, sector 0
> Jun  4 20:19:52  kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
> Jun  4 20:19:52  kernel: sda: Current: sense key: Hardware Error
> Jun  4 20:19:52  kernel:     <<vendor>> ASC=0xc4 ASCQ=0x1ASC=0xc4 ASCQ=0x1
> Jun  4 20:19:52  kernel:
> Jun  4 20:19:52  kernel: end_request: I/O error, dev sda, sector 0
> Jun  4 20:19:52  kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
> Jun  4 20:19:52  kernel: sda: Current: sense key: Hardware Error
> Jun  4 20:19:53  kernel:     <<vendor>> ASC=0xc4 ASCQ=0x1ASC=0xc4 ASCQ=0x1
> Jun  4 20:19:53  kernel:
> Jun  4 20:19:53  kernel: end_request: I/O error, dev sda, sector 0
> Jun  4 20:19:53  kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
> Jun  4 20:19:53  kernel: sda: Current: sense key: Hardware Error
> Jun  4 20:19:53  kernel:     <<vendor>> ASC=0xc4 ASCQ=0x1ASC=0xc4 ASCQ=0x1
> 
> 
> The OS also starts showing individual drives after the reboot (sda and sdb):
> 
> 
> Jun  4 20:19:18  kernel:   Vendor: FUJITSU   Model: MBE2073RC         Rev:
> D701
> Jun  4 20:19:18  kernel:   Type:   Direct-Access                      ANSI
> SCSI revision: 05
> Jun  4 20:19:19  kernel: SCSI device sdb: 143374650 512-byte hdwr sectors
> (73408 MB)
> Jun  4 20:19:19  kernel: sdb: Write Protect is off
> Jun  4 20:19:19  kernel: sdb: Mode Sense: c7 00 00 08
> Jun  4 20:19:19  kernel: SCSI device sdb: drive cache: write through
> Jun  4 20:19:19  kernel: SCSI device sdb: 143374650 512-byte hdwr sectors
> (73408 MB)
> Jun  4 20:19:19  kernel: sdb: Write Protect is off
> Jun  4 20:19:19  kernel: sdb: Mode Sense: c7 00 00 08
> Jun  4 20:19:19  kernel: SCSI device sdb: drive cache: write through
> Jun  4 20:19:19  kernel:  sdb: sdb1 sdb2 sdb3
> Jun  4 20:19:19  kernel: sd 0:0:1:0: Attached scsi disk sdb
> 
> 
> The OS activates a swap on /dev/sdb2, due to the OS seeing both drives and
> the fact that the drive has a swap label:
> 
> 
> Jun  4 20:19:45  kernel: Adding 2096472k swap on /dev/sdb2.  Priority:-1
> extents:1 across:2096472k
> Jun  4 20:19:45  kernel: IA-32 Microcode Update Driver: v1.14a <
> tigran at veritas.com>
> Jun  4 20:19:45  kernel: Fusion MPT misc device (ioctl) driver 3.04.05
> Jun  4 20:19:45  kernel: mptctl: Registered with Fusion MPT base driver
> Jun  4 20:19:45  kernel: mptctl: /dev/mptctl @ (major,minor=10,220)
> Jun  4 20:19:45  kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
> 
> 
> Some time later, more errors are reported by the kernel, this time for
> /dev/sdb3:
> 
> 
> Jun  4 21:28:32  kernel: sd 0:0:1:0: SCSI error: return code = 0x00010000
> Jun  4 21:28:32  kernel: end_request: I/O error, dev sdb, sector 135316304
> Jun  4 21:28:32  kernel: printk: 33 messages suppressed.
> Jun  4 21:28:32  kernel: Buffer I/O error on device sdb3, logical block
> 16352263
> Jun  4 21:28:32  kernel: lost page write due to I/O error on sdb3
> Jun  4 21:28:32  kernel: sd 0:0:1:0: SCSI error: return code = 0x00010000
> Jun  4 21:28:32  kernel: end_request: I/O error, dev sdb, sector 135316272
> Jun  4 21:28:32  kernel: Buffer I/O error on device sdb3, logical block
> 16352259
> Jun  4 21:28:32  kernel: lost page write due to I/O error on sdb3
> Jun  4 21:28:32  kernel: Buffer I/O error on device sdb3, logical block
> 16352260
> Jun  4 21:28:32  kernel: lost page write due to I/O error on sdb3
> Jun  4 21:28:32  kernel: Buffer I/O error on device sdb3, logical block
> 16352261
> Jun  4 21:28:32  kernel: lost page write due to I/O error on sdb3
> Jun  4 21:28:32  kernel: sd 0:0:1:0: SCSI error: return code = 0x00010000
> Jun  4 21:28:32  kernel: end_request: I/O error, dev sdb, sector 105889496
> Jun  4 21:28:32  kernel: Buffer I/O error on device sdb3, logical block
> 12673912
> 
> Jun  4 21:28:36  kernel: scsi 0:0:1:0: rejecting I/O to dead device
> Jun  4 21:28:36  kernel: EXT3-fs error (device sdb3): ext3_find_entry:
> reading directory #8389380 offset 0
> Jun  4 21:28:36  kernel: scsi 0:0:1:0: rejecting I/O to dead device
> Jun  4 21:28:36  kernel: scsi 0:0:1:0: rejecting I/O to dead device
> Jun  4 21:28:36  kernel: EXT3-fs error (device sdb3): ext3_find_entry:
> reading directory #11042817 offset 0
> Jun  4 21:28:36  kernel: scsi 0:0:1:0: rejecting I/O to dead device
> Jun  4 21:28:36  kernel: EXT3-fs error (device sdb3): ext3_find_entry:
> reading directory #11042817 offset 0
> Jun  4 21:28:36  kernel: scsi 0:0:1:0: rejecting I/O to dead device
> 
> 
> The kernel continues to spill out these errors, from this point the box is
> determined to be toast and both drives are replaced and the OS is
> reinstalled.
> 
> Ultimately the system had to be rebuilt with 2 new drives, however we are
> curious about a couple of points:
> 
> 1. Is incident #1 related to #2? Surely they are.
> 2. Why are there conflicting reports between which drives failed, the drac
> is saying only one drive failed, yet Server Administrator is saying both
> failed. The lights/display on the front of the server was also only
> reporting a single drive failure
> 3. Is it possible the controller and/or backplane is to blame for the kernel
> panic and multiple drive failure?
> 4. Did the OS destroy /dev/sdb when it activated the swap partition on boot?
> 
> Thanks.
> 
> -- 
> 
> Stephen
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20110630/a534363e/attachment.htm 
> 
> ------------------------------
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
> 
> End of Linux-PowerEdge Digest, Vol 85, Issue 26
> ***********************************************
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20110701/15167cf4/attachment-0001.htm 


More information about the Linux-PowerEdge mailing list