PV220S in a bad state. Recovery advice needed
philippe.gramoulle at mmania.com
Tue May 18 18:28:00 CDT 2004
Thanks for the lengthy description.
On Tue, 18 May 2004 12:14:47 -0500
<Jason_Mick at Dell.com> wrote:
| I would say that you are having some communication issues on this vault. Perhaps Controller, cables, or Zemm firmware. From what I can tell I would say that this is the sequence of what has happened on this system.
| 1. Drive 4 failed
| 2. Drive 14 was assigned to rebuild
| 3. Drive 14 failed to respond to the rebuild task so it was failed
| 4. Drive 15 jumped in for drive 14.
| 5. Drive 15 failed to respond to the rebuild task so it was failed.
| 6. Sometime after this process drive 0 also failed to respond to a command so it was failed.
| At this point the entire volume was off line.
| If we assume that none of the drives were accessible at this point then it is possible to force only drive 0 online and then boot the system up. That does not take into account what data was possibly attempting to be written to that container at the time of failure. Since you have a logical spanned volume over two RAID 5 sets there may not be any recovery since a spanned volume is not redundant. At the time of this failure one of the drives in your volume was still accessible (the RAID 5 A01-00 thru A01-05) the other drive was not. I am not sure if there will be any recovery in this situation without restoration from backup.
| I guess the best approach will be to force drive 0 back online and see if the logical volume will mount.
Indeed this was what has been done and it seems to do the trick. So right now, after the HOTSPARE disks were replaced, rebuilding
of disk #4 is going on and data is available, in degraded mode.
Layout now looks like this:
0° ONLINE A00-00
1° ONLINE A00-01
2° ONLINE A00-02
3° ONLINE A00-03
5° ONLINE A00-05
8° ONLINE A01-00
9° ONLINE A01-00
10° ONLINE A01-00
11° ONLINE A01-00
12° ONLINE A01-00
13° ONLINE A01-00
14° REBLD A00-04
| None of the steps above do anything to determine the nature of this failure. It is my opinion that this failure is not drive related. The only possible way that a drive could have caused this is if the drive was failing in a way that it was causing a communication issue on entire the bus.
I think it was the cables in the first place, because it looks like bad part number cables were sent and new ones were about to be replaced
in a few days.
| Things that I would check in attempts to root cause...
| What cables are you using?
Like i said, we had diagnosed that the wrong cables were installed and we were about to change them.
It rather unlikely that 2 disks drives breaks at the same time in the same PowerVault.
| Do any of the cables have excessive bents in them?
Don't think so, just not the right ones, it seems.
| What version of ZEMM firmware are you using?
E.10. We'll upgrade to E17 asap.
| Do the drives that are installed have any available firmware updates? (check support.dell.com)
No, i've checked and all drives are Hitachi DK32EJ-72NC 160MB/Sec , no firmware update available on the dell support site.
| Did the controller log any messages prior to this failure?
nope, direcly SCSI errors:
May 18 16:19:41 server39: SCSI disk error : host 1 channel 0 id 0 lun 0 return code = 40001
May 18 16:19:43 server39: I/O error: dev 08:11, sector 847375472
May 18 16:19:44 server39: SCSI disk error : host 1 channel 0 id 0 lun 0 return code = 40001
May 18 16:19:49 server39: I/O error: dev 08:11, sector 847375528
May 18 16:19:49 server39: SCSI disk error : host 1 channel 0 id 0 lun 0 return code = 40001
More information about the Linux-PowerEdge