problem restoring RAID5 container

Chris Pakkala cpakkala at salu.com
Mon Oct 13 14:15:00 CDT 2003


After pulling the disk out and reinitializing it completely, I believe I've
isolated the problem.  Here's the current status:
AFA0> container list
Executing: container list
Num          Total  Oth Chunk          Scsi   Partition
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
 0    RAID-5 16.0GB       32KB Open    0:00:0 64.0KB:5.33GB
 /dev/sda             root             0:01:0 64.0KB:5.33GB
                                       0:02:0 64.0KB:5.33GB
                                         --- Missing ---

 1    RAID-5 85.6GB       32KB Open    0:00:0 5.33GB:28.5GB
 /dev/sdb             data             0:01:0 5.33GB:28.5GB
                                       0:02:0 5.33GB:28.5GB
                                         --- Missing ---

Now I disabled automatic failover so I can specify which container I want to
failover:
AFA0> controller show automatic_failover
Executing: controller show automatic_failover
Automatic failover DISABLED

And disk (0:03:0) is initialized and ready to go:
AFA0> disk show space
Executing: disk show space

Scsi B:ID:L Usage      Size
----------- ---------- -------------
  0:00:0     Container 64.0KB:5.33GB
  0:00:0     Container 5.33GB:28.5GB
  0:00:0     Free      33.8GB: 123KB
  0:01:0     Container 64.0KB:5.33GB
  0:01:0     Container 5.33GB:28.5GB
  0:01:0     Free      33.8GB: 123KB
  0:02:0     Container 64.0KB:5.33GB
  0:02:0     Container 5.33GB:28.5GB
  0:02:0     Free      33.8GB: 123KB
  0:03:0     Free      64.0KB:33.8GB

Here's the problem.  When I do a "container set failover 0 (0:03:0)" and
wait a few minutes, nothing happens:
AFA0> task list
Executing: task list

Controller Tasks

TaskId Function  Done%  Container State Specific1 Specific2
------ -------- ------- --------- ----- --------- ---------

No tasks currently running on controller

However, when I do the same command on the second container, it works just
fine:
AFA0> container set failover 1 (0:03:0)
Executing: container set failover 1 (BUS=0,ID=3,LUN=0)

AFA0> task list
Executing: task list

Controller Tasks

TaskId Function  Done%  Container State Specific1 Specific2
------ -------- ------- --------- ----- --------- ---------
  102   Rebuild   0.0%      1      RUN   00000000  00000000


This explains why the rebuild happened automatically for container 1, but
not container 0 the first time I rebooted.  Does anyone have a clue why this
might be happening?  Could it be related to the fact that container 0 is the
root partition?



-----Original Message-----
From: linux-poweredge-admin at dell.com
[mailto:linux-poweredge-admin at dell.com]On Behalf Of Chris Pakkala
Sent: Monday, October 13, 2003 9:36 AM
To: linux-poweredge at dell.com
Subject: problem restoring RAID5 container


I inherited a RedHat 7.3 system that has 2 raid 5 containers with 4
partitions each.  One of my drives(0:03:0) went bad as was evident by the
"!" in the output of "container list".  I read both the afacli users guide
and reference guides and neither one gave clear instructions on what to do
when a drive goes bad.  So, I replaced the drive and tried a "container
restore RAID5" on both labels.  The prompt returned right away and nothing
had changed.  So, I rebooted to enter into the raid configuration utility at
the prom level, where I saw that the machine had automatically started
restoring container "1" with the new drive.  I let the restore complete,
thinking that it would continue on and restore container "0" as well; but it
never did.  I rebooted again, hoping that it would initiate another restore
on container "0", and it never did.  I pressed cntrl-r to manually start the
restore, but was freightened off by the warning that I may lose data and
canceled the request.  Now the machine is fully booted and I see this:

Num          Total  Oth Chunk          Scsi   Partition
Creation        System
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size   State   RO Lk
Task    Done%  Ent Date   Time      Files
----- ------ ------ --- ------ ------- ------ ------------- ------- -- -- --
----- ------ --- ------ -------- ------
 0    RAID-5 16.0GB       32KB Open    0:00:0 64.0KB:5.33GB UnProt
0  012802 14:12:13
 /dev/sda             root             0:01:0 64.0KB:5.33GB UnProt
1  012802 14:12:13
                                       0:02:0 64.0KB:5.33GB UnProt
2  012802 14:12:13
                                         --- Missing ---

 1    RAID-5 85.6GB       32KB Open    0:00:0 5.33GB:28.5GB
0  012802 14:12:35
 /dev/sdb             data             0:01:0 5.33GB:28.5GB
1  012802 14:12:35
                                       0:02:0 5.33GB:28.5GB
2  012802 14:12:35
                                       0:03:0 64.0KB:28.5GB
3  012802 14:12:35

As you can see, things are messed up now because container "1" started the
new disk(0:03:0) at the beginning of the drive(offset 64.0KB), where the
partition for container "0" should start.  So my question is how do I remove
the (0:03:0) disk from both containers and add it back(with the correct
offset:size values) without losing data?  If there is better
documentation(on what to actually do; not just a list of commands and
switches) available; please let me know.  Any help would be greatly
appreciated!

Thank you,
Chris

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list
archives at http://lists.us.dell.com/htdig/




More information about the Linux-PowerEdge mailing list