Changing a drive in predictable failure state

Bertrand LUPART bertrand.lupart at linkeo.com
Wed Jun 18 10:44:47 CDT 2008


Hello,

> Best case steps for this (least possible chance of losing data):

Thank you for your answer.

Just to be sure, isn't there any chance for the rebuild (point 5) to
rebuild the wrong disk in case of a foreign configuration? Or is it more
safe to clear the spare disk before operation?


For later reference, below's what i did with a spare PE 2950 with same
hard drive and RAID setup for testing.
The command are for pdisk #3, in vdisk #1 (300GB RAID-1 SAS).


> 1. perform a consistency check on the vd

----8<----8<----8<----8<----
$ sudo omconfig storage vdisk action=checkconsistency controller=0
vdisk=1
---->8---->8---->8---->8----

Then i got that in /var/log/syslog:
----8<----8<----8<----8<----
Jun 18 10:53:15 myserver Server Administrator: Storage Service EventID:
2058  Virtual disk Check Consistency started:  Virtual Disk 1 (Virtual
Disk 1) Controller 0 (PERC 5/i Integrated)
Jun 18 12:05:08 myserver Server Administrator: Storage Service EventID:
2085  Virtual disk Check Consistency completed:  Virtual Disk 1 (Virtual
Disk 1) Controller 0 (PERC 5/i Integrated)
---->8---->8---->8---->8----

I guess everything should be fine.


> 2. when the CC has finished, issue the offline command to the SMART disk

Checked the drive i was about to remove was the good one:
----8<----8<----8<----8<----
$ sudo omconfig storage pdisk action=blink controller=0 pdisk=0:0:3
$ sudo omconfig storage pdisk action=blink controller=0 pdisk=0:0:3
---->8---->8---->8---->8----

Since i didn't understood the difference between remove and offline, i
went for offline :)
----8<----8<----8<----8<----
$ sudo omconfig storage pdisk action=offline controller=0 pdisk=0:0:3
---->8---->8---->8---->8----
The drive LED is now alterning amber/green.

Got this in the logs:
----8<----8<----8<----8<----
Jun 18 15:10:42 myserver Server Administrator: Storage Service EventID:
2123  Redundancy lost:  Virtual Disk 1 (Virtual Disk 1) Controller 0
(PERC 5/i Integrated)
Jun 18 15:10:43 myserver Server Administrator: Storage Service EventID:
2057  Virtual disk degraded:  Virtual Disk 1 (Virtual Disk 1) Controller
0 (PERC 5/i Integrated)
Jun 18 15:10:43 myserver Server Administrator: Storage Service EventID:
2050  Physical disk offline:  Physical Disk 0:0:3 Controller 0,
Connector 0
---->8---->8---->8---->8----


> 3. remove SMART disk
> 4. replace with replacement drive
> 5. If rebuild has not started after 5 minutes, assign the replacement
> drive as a dedicated hotspare to the vd

Seconds after the new drive was inserted, the LEDs for vdisk #1 (pdisk
#2 & #3) started to blink:
----8<----8<----8<----8<----
Jun 18 15:21:16 myserver Server Administrator: Storage Service EventID:
2052  Physical disk inserted:  Physical Disk 0:0:3 Controller 0,
Connector 0
Jun 18 15:21:16 myserver Server Administrator: Storage Service EventID:
2121  Device returned to normal:  Physical Disk 0:0:3 Controller 0,
Connector 0
Jun 18 15:21:17 myserver Server Administrator: Storage Service EventID:
2050  Physical disk offline:  Physical Disk 0:0:3 Controller 0,
Connector 0
Jun 18 15:21:17 myserver Server Administrator: Storage Service EventID:
2065  Physical disk Rebuild started:  Physical Disk 0:0:3 Controller 0,
Connector 0
Jun 18 15:21:17 myserver Server Administrator: Storage Service EventID:
2121  Device returned to normal:  Physical Disk 0:0:3 Controller 0,
Connector 0
---->8---->8---->8---->8----

----8<----8<----8<----8<----
$ sudo omreport storage vdisk
...
ID                  : 1
Status              : Non-Critical
Name                : Virtual Disk 1
State               : Degraded
Progress            : Not Applicable
Layout              : RAID-1
Size                : 278.88 GB (299439751168 bytes)
Device Name         : /dev/sdb
Type                : SAS
Read Policy         : Adaptive Read Ahead
Write Policy        : Write Through
Cache Policy        : Not Applicable
Stripe Element Size : 64 KB
Disk Cache Policy   : Disabled
...
---->8---->8---->8---->8----

----8<----8<----8<----8<----
$ sudo omreport storage pdisk controller=0
...

ID                        : 0:0:2
Status                    : Ok
Name                      : Physical Disk 0:0:2
State                     : Online
Failure Predicted         : No
Progress                  : Not Applicable
Type                      : SAS
Capacity                  : 278.88 GB (299439751168 bytes)
Used RAID Disk Space      : 278.88 GB (299439751168 bytes)
Available RAID Disk Space : 0.00 GB (0 bytes)
Hot Spare                 : No
Vendor ID                 : DELL    
Product ID                : ST3300555SS     
Revision                  : T106
Serial No.                : 3LM0DSGY            
Negotiated Speed          : Not Available
Capable Speed             : Not Available
Manufacture Day           : 07
Manufacture Week          : 02
Manufacture Year          : 2005
SAS Address               : 5000C50001E8DBF9

ID                        : 0:0:3
Status                    : Ok
Name                      : Physical Disk 0:0:3
State                     : Rebuilding
Failure Predicted         : No
Progress                  : 5% complete
Type                      : SAS
Capacity                  : 278.88 GB (299439751168 bytes)
Used RAID Disk Space      : 278.88 GB (299439751168 bytes)
Available RAID Disk Space : 0.00 GB (0 bytes)
Hot Spare                 : No
Vendor ID                 : DELL    
Product ID                : HUS153030VLS300 
Revision                  : A280
Serial No.                : J8W3LMYC            
Negotiated Speed          : Not Available
Capable Speed             : Not Available
Manufacture Day           : 03
Manufacture Week          : 08
Manufacture Year          : 2008
SAS Address               : 5000CCA0053EEA69
...
---->8---->8---->8---->8----



Then, about one hour later:

----8<----8<----8<----8<----
Jun 18 16:28:29 myserver Server Administrator: Storage Service EventID:
2092  Physical disk Rebuild completed:  Physical Disk 0:0:3 Controller
0, Connector 0
Jun 18 16:28:30 myserver Server Administrator: Storage Service EventID:
2121  Device returned to normal:  Virtual Disk 1 (Virtual Disk 1)
Controller 0 (PERC 5/i Integrated)
Jun 18 16:28:30 myserver Server Administrator: Storage Service EventID:
2124  Redundancy normal:  Virtual Disk 1 (Virtual Disk 1) Controller 0
(PERC 5/i Integrated)
Jun 18 16:28:30 myserver Server Administrator: Storage Service EventID:
2158  Physical disk online:  Physical Disk 0:0:3 Controller 0, Connector
0
---->8---->8---->8---->8----

----8<----8<----8<----8<----
$ sudo omreport storage vdisk
...

ID                  : 1
Status              : Ok
Name                : Virtual Disk 1
State               : Ready
Progress            : Not Applicable
Layout              : RAID-1
Size                : 278.88 GB (299439751168 bytes)
Device Name         : /dev/sdb
Type                : SAS
Read Policy         : Adaptive Read Ahead
Write Policy        : Write Through
Cache Policy        : Not Applicable
Stripe Element Size : 64 KB
Disk Cache Policy   : Disabled

...
---->8---->8---->8---->8----

----8<----8<----8<----8<----
$ sudo omreport storage pdisk controller=0
...

ID                        : 0:0:3
Status                    : Ok
Name                      : Physical Disk 0:0:3
State                     : Online
Failure Predicted         : No
Progress                  : Not Applicable
Type                      : SAS
Capacity                  : 278.88 GB (299439751168 bytes)
Used RAID Disk Space      : 278.88 GB (299439751168 bytes)
Available RAID Disk Space : 0.00 GB (0 bytes)
Hot Spare                 : No
Vendor ID                 : DELL    
Product ID                : HUS153030VLS300 
Revision                  : A280
Serial No.                : J8W3LMYC            
Negotiated Speed          : Not Available
Capable Speed             : Not Available
Manufacture Day           : 03
Manufacture Week          : 08
Manufacture Year          : 2008
SAS Address               : 5000CCA0053EEA69

...
---->8---->8---->8---->8----


Worked perfectly. I now have to reproduce that on the real production
machine.

Thank you,

-- 
Bertrand LUPART

http://bertrand.gotpike.org/



More information about the Linux-PowerEdge mailing list