Hard Drive Rebuild under Linux

Steve_Boley@Dell.com Steve_Boley at Dell.com
Tue Jan 14 08:32:01 CST 2003


You can try this and see if it fixes it.  You have a bad spot on drive 8
where the parity stripe is located to rebuild the drive.  Pull your drives
out of the system and go to setup and change from raid to scsi and then
answer the 2 questions y for loss of data and then after F1 F2 no boot
device, reboot and put the drives back in.

You will now see an Adaptec 7899 come up instead of raid and go into control
a and go to channel a controller and then scsi disk utilities.  It will then
scan and if you see the drives you are on the right controller but if not
escape back and choose the other one.  Go to id8 and hit enter and perform a
disk verify on it and when it says finds bad sector say yes to remap it and
go thru the drive.

After this do the same as above and pull the drives and go back to setup and
change back to raid.  Reboot and answer the yes to loss of data again and
after F1 F2 then you can put the drives back in.  Try and see then if the
drive will rebuild and if not you also need to get a replacement for 8 and
rebuild the system.
Steve

-----Original Message-----
From: Jean Lofts [mailto:jean.lofts at eng.ox.ac.uk]
Sent: Tuesday, January 14, 2003 6:45 AM
To: Boley, Steve
Cc: linux-poweredge at exchange.dell.com
Subject: Re: Hard Drive Rebuild under Linux



Steve

Many thanks for your response.

I received the replacement drive today, and attempted the rebuild
again. Same behaviour as previously seen. The rebuild started OK
but finished at around 2% done in state BAD.I have now discovered
the following in the controller log


[56]: parallel rebuild container 0
[57]: ID(0:08:0); Error Event [command:0x28]
[58]: ID(0:08:0); Medium Error, Block Range 1672000 : 1672063
[59]: ID(0:08:0); Unrecovered Read Error
.
.
.
[77]: Container 0 failed REBUILD task: I/O error - drive 0:8:0 fa
[78]: iled
.
.
.
Presumably, this is telling me that the rebuild of Scsi id 0 failed because
of errors on Scsi id 8 ?
Is there anything I can do to recover from this situation other than
a reinstall of the O/S and restore of data? I have never seen errors in
the system log to indicate that there was a problem with id 8.

Jean

Steve_Boley at Dell.com wrote:

> Your error was after the rebuild obviously failed you needed to do a
> controller rescan and you would then have seen the drive in failed status
> again.  You need to replace id0.  Pull it while the system is rebooting
> before the perc posts and have it come up as missing member and then
replace
> as soon as possible.
> Steve
>
> -----Original Message-----
> From: Jean Lofts [mailto:jean.lofts at eng.ox.ac.uk]
> Sent: Friday, January 10, 2003 7:51 AM
> To: linux-poweredge at exchange.dell.com
> Subject: Hard Drive Rebuild under Linux
>
> Hello All
>
> I have a Dell PE 4400 with PERC 3/Di running RedHat Linux 6.2.
> It has eight drives configured as a single RAID-5 container.
>
> On reboot the system reported
>
> following containers have missing members and are degraded
> container #0 RAID 5 237.29GB critical
>
> afacli reported
>
> AFA0> container list
> Executing: container list
> Num          Total  Oth Chunk          Scsi   Partition
> Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
> ----- ------ ------ --- ------ ------- ------ -------------
>  0    RAID-5  237GB       32KB Open    0:00:0 64.0KB!33.8GB
>  /dev/sda                              0:01:0 64.0KB:33.8GB
>                                        0:02:0 64.0KB:33.8GB
>                                        0:03:0 64.0KB:33.8GB
>                                        0:04:0 64.0KB:33.8GB
>                                        0:05:0 64.0KB:33.8GB
>                                        0:08:0 64.0KB:33.8GB
>                                        0:09:0 64.0KB:33.8GB
>
> AFA0> disk list
> Executing: disk list
>
> B:ID:L  Device Type     Blocks    Bytes/Block Usage            Shared
> ------  --------------  --------- ----------- ---------------- ------
> 0:00:0   Disk            71132959  512         Initialized     NO
> 0:01:0   Disk            71132959  512         Initialized     NO
> 0:02:0   Disk            71132959  512         Initialized     NO
> 0:03:0   Disk            71132959  512         Initialized     NO
> 0:04:0   Disk            71132959  512         Initialized     NO
> 0:05:0   Disk            71132959  512         Initialized     NO
> 0:08:0   Disk            71132959  512         Initialized     NO
> 0:09:0   Disk            71132959  512         Initialized     NO
>
> AFA0> disk show space
> Executing: disk show space
>
> Scsi B:ID:L Usage      Size
> ----------- ---------- -------------
>   0:00:0     Dead      64.0KB:33.8GB
>   0:00:0     Free      33.8GB:59.0KB
>   0:01:0     Container 64.0KB:33.8GB
>   0:01:0     Free      33.8GB:59.0KB
>   0:02:0     Container 64.0KB:33.8GB
>   0:02:0     Free      33.8GB:59.0KB
>   0:03:0     Container 64.0KB:33.8GB
>   0:03:0     Free      33.8GB:59.0KB
>   0:04:0     Container 64.0KB:33.8GB
>   0:04:0     Free      33.8GB:59.0KB
>   0:05:0     Container 64.0KB:33.8GB
>   0:05:0     Free      33.8GB:59.0KB
>   0:08:0     Container 64.0KB:33.8GB
>   0:08:0     Free      33.8GB:59.0KB
>   0:09:0     Container 64.0KB:33.8GB
>   0:09:0     Free      33.8GB:59.0KB
>
> After researching the newsgroups, I attempted to rebuild the failed
> drive
> as follows
>
> disk remove dead_partitions (0,0,0)
> container set failover 0 (0,0,0)
>
> task list
>
> Controller Tasks
>
> TaskId Function Done%  Container State Specific1 Specific2
> ------ -------- ------- --------- ----- --------- ---------
>   100   Rebuild   0.1%     00     RUN   00000000  00000000
>
> So far so good. But the rebuild task only continued for about 5 min
> before task list reported that there were no tasks current.
> The previous task list that I had done only a minute earlier reported
> only about 1% done. After the rebuild finished, I could see no
> activity on the failed disk when there was clearly activity on the
> other seven drives.
>
> afacli now reports
>
> AFA0> container list
> Executing: container list
> Num          Total  Oth Chunk          Scsi   Partition
> Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
> ----- ------ ------ --- ------ ------- ------ -------------
>  0    RAID-5  237GB       32KB Open    0:00:0 64.0KB:33.8GB
>  /dev/sda                              0:01:0 64.0KB:33.8GB
>                                        0:02:0 64.0KB:33.8GB
>                                        0:03:0 64.0KB:33.8GB
>                                        0:04:0 64.0KB:33.8GB
>                                        0:05:0 64.0KB:33.8GB
>                                        0:08:0 64.0KB:33.8GB
>                                        0:09:0 64.0KB:33.8GB
>
> which looks good, but
>
> AFA0> enclosure show slot
> Executing: enclosure show slot
>
> Enclosure
> ID (B:ID:L) Slot scsiId Insert  Status
> ----------- ---- ------ -------
> ------------------------------------------
>  0  0:06:0   0   0:00:0     0   OK FAILED CRITICAL ACTIVATE
>  0  0:06:0   1   0:01:0     0   OK FAILED CRITICAL ACTIVATE
>  0  0:06:0   2   0:02:0     0   OK FAILED CRITICAL ACTIVATE
>  0  0:06:0   3   0:03:0     0   OK FAILED CRITICAL ACTIVATE
>  0  0:06:0   4   0:04:0     0   OK FAILED CRITICAL ACTIVATE
>  0  0:06:0   5   0:05:0     0   OK FAILED CRITICAL ACTIVATE
>  0  0:06:0   6   0:08:0     0   OK FAILED CRITICAL ACTIVATE
>  0  0:06:0   7   0:09:0     0   OK FAILED CRITICAL ACTIVATE
>
> would seem to indicate that there is still a problem.
>
> I have now rebooted the system and I receive the same message from
> the controller
>
> following containers have missing members and are degraded
> container #0 RAID 5 237.29GB critical
>
> On attempting to view the container information in the Configuration
> Utility, I am presented with the message
>
> configuration changes have been detected in the system. If you reject
> the change you will not be able to modify the current configuration.
> If you accept it will be updated to the current configuration.
>
> Is there any risk in choosing accept here? I currently have a working
> system and don't want to risk doing further damage.
>
> Any suggestions or advice would be most welcome. I would prefer
> to rebuild from the afacli utility if possible, but will take
> the system down and rebuild in the Configuration Utility if necessary.
>
> Thanks
>
> Jean
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq or search the list
> archives at http://lists.us.dell.com/htdig/

--
Jean Lofts                     E-mail: jean.lofts at eng.ox.ac.uk

Computing Officer (Medical Vision Laboratory)
Dept of Engineering Science
University of Oxford
Parks Rd                            Tel: (0)1865-280921
Oxford OX1 3PJ UK                   Fax: (0)1865-280922







More information about the Linux-PowerEdge mailing list