Dell 2650 Perc 3/Di RAID failure, Server wouldn't reboot during rebuild filesystem dead
Ben Russo
Ben at muppethouse.com
Wed Apr 7 17:57:01 CDT 2004
A Story of my problem with DELL PERC 3/Di...
It hung the OS when one disk of a RAID5 volume was
falsely accused of being "FAILED"
And then delayed rebooting for about 18 hours while
it "REBUILT" (it wouldn't reboot during this time)
I have a DELL PowerEdge 2650 that was bought in January of 2003
It came with
DUAL 2.6GHz P4 CPU's
3GB RAM
onboard Hardware RAID (PERC 3 Di)
five 142GB (or thereabouts) hard disks
I configured the 5 disks in a single BIG raid 5 volume
I installed RedHat AS 2.1 (as purchased from Dell with a 3 year license).
The box had been running for about a year without a reboot when
at 5:50am yesterday the box stopped responding to SSH connections
I went into the lab and saw that DISK0 had an amber light and the LCD
panel on the server was Orange and had the text "DISK FAIL" scrolling
across it.
I could not log in on the console either, I would just get error
messages saying something like: "Ext3-fs error: cannot write inode
SCSI sector XXXXXXX" (approximately that anyway). They would repeat
every few minutes and just scroll up the screen endlessly.
I called Dell and the guy there told me to power cycle the machine.
"Really?" I asked, "This box has a 400+GB filesystem, I am worried about
doing that, is there any way to get the box to shutdown gracefully? I
wouldn't think that a single hard disk failure should cause the OS to
halt, I mean that *is* why we bought Hardware RAID right?"
The Dell support staff member told me that the procedure in this case
was to power cycle the machine and then go into the PERC BIOS SETUP
utility. Which we did. He had me go into the SCSI Disk utilities and
run the "DISK VERIFY" utility. This proceeded to 10% without reporting
any errors and I noticed that the Amber light had gone out. The Dell
technician told me that the hard disk error was therefore a false alarm
and that the RAID would now rebuild and I could reboot the system now if
I wanted too.
I wasn't happy about the fact that the RAID would "false alarm" and
cause the OS to hang! But what the heck, I have about two dozen such
servers and if this only happened to one of the two dozen once every
year or two I could live with that. Besides it didn't seem to have
caused any serious problem (just yet).. I figured it would just take an
hour for the FSCK of the big filesystem and then we would be on our way,
boy was I wrong.
The server wouldn't reboot. It would get to the point where it normally
would show me a GRUB boot kernel selection menu. But instead of showing
me anything about GRUB at all I just got a message like:
No boot device found, press <F1> to try again, or <F2> to enter setup.
The DELL support tech told me that I might have to wait until the
rebuild was at 5% before it would reboot. Huh? That doesn't make any
sense but OK, whatever, I will wait until it get's to 5%. So back to
the RAID controller bios setup utility I go to watch it painfully slowly
rebuild a 536GB RAID 5 volume. *(It rebuild at just a little less than
10% per hour)
In the meantime I asked the Dell support Tech if he could raise a red
flag on this issue (of a single disk problem causing the OS to halt).
He recommended I go to "linux.dell.com" I explained to him that I love
this mailing list and the website, but that I don't pay thousands of
dollars for name brand hardware and support so that I can go to a
community news group for support.
I explained to him that my boss would demand that Dell represent itself
for the money we pay for our servers. And that if I don't have
something to show for it he would ask why we didn't just go buy
Off-The-Shelf server components and support them via the newsgroups all
the time? The Dell tech said he would get me a manager and put me on
hold. After about 10-20 minutes I got disconnected.
At this point I realized that I had never been given a case number or a
trouble ticket number. That is OK I thought, I'm sure they track these
things using the Express Svc code of the server or the Service Tag
number, right? So I called back Dell support... no such luck. They
didn't know the name of the guy I had been talking to and seemed to be
unaware of the history (1 hours worth) that I had already established
with the Tech I had been working with. So I started all over again.
Got to the same point of being put on hold to get a manager. Got
disconnected again!
Called back. Again (at first) they seemed to be unaware of my history of
support calls and couldn't locate the two different persons I had been
working with. Then a little by surprise they managed to find the
manager that had talked to the original tech, the manager explained that
my case was rare and that it seemed abnormal. But he said that there
was only one RAID/SCSI controller and only one SCSI bus in my server.
So he would send a new motherboard (with the onboard RAID controller)
and would send a new SCSI backplane and RAID cache RAM DIMM.
Dell delivered on this, and did it by 2:10pm. (By this time I had spent
almost 2 hours on the phone, and the only real progress on the server
outage was that it had been power cycled and was 15% done REBUILD'g the
RAID volume.)
After the hardware was replaced I tried booting up the OS again. Still
no dice. I figured that even if I had to re-install the server OS from
scratch I would still have to rebuild the RAID group. So I might as
well wait to see if it finished.
It did finish at 11:50PM that night (almost 18 hours later!)
To my delight it could find the GRUB boot menu now and I could select a
kernel and hit enter. However all the kernels would get to the stage:
Unpacking the kernel... OK, booting up Linux.
Then the server would just sit there and do nothing.
(And by the way the server was going to be slow as heck for at least
another 10 hours because the RAID volume was "SCRUBBING" now.)
I was able to boot from the RHAS 2.1 install CD in RESCUE mode.
It was able to see /dev/sda and find all the partitions and mount all
the filesystems. I fsck'd them all and examined the partition table.
Everything seem'd OK.
I chroot'd to the /mnt/sysimage and then ran
rpm --verify kernel-smp-2.4.9-e.38 grub filesystem
Everything seem'd just fine. I ran grub-install /dev/sda which
succeeded with no errors and looked good.
Then I tried rebooting again. Still no dice.
I called Dell again. They were pretty stumped too. They asked me to
make sure I had /initrd directory and to check the permissions on
bunches of stuff they had me force re-install the kernel RPM and do the
mkinitrd and grub-install again. Still no dice. I couldn't get the box
to reboot off of the RAID 5 volume. Dell said that even though I had
bought this server (with Hardware RAID) and a 3 year 4 hour support
contract that it was for HW only and that now I was able to get to the
OS BOOT menu so it was my problem. Jerks. They sell a HW RAID
controller that hangs the system and causes the RAID volume to go
non-bootable and they refuse to help. I could understand if they just said
"Well we have no ideas, you got any left?"
But rather than just say that I was stuck with re-installing and they
felt bad about it, (which I suspected was the case) They instead told me
that I was "on my own" because I didn't spend $299 for software support
contract.
Anyway, I backed up /etc, /var, and /usr/share/rhn. Then I did a
minimal install using the existing partions and filesystems on the
existing RAID 5 volume without re-formatting anything. After the
minimal install I booted (waited through another interminable fsck) and
then got into single user mode and restored /etc and /usr/share/rhn,
then I started up networking and ran up2date -u. Then I restored the
/var (where my RPM-database was) and rebooted again.
Everything seems to be up now, and other than the 36 hour hiatus
everything is OK.
Dell, please... fix the problem with not being able to find the boot
device during rebuild of failed disk0 on a RAID5 volume. And it doesn't
make any sense not to assist a customer who paid many thousands of
dollars for HW that failed in a way that it shouldn't have. It kinda
tick's me off. If your hardware performed as advertised I wouldn't have
been asking you for support.
-Ben.
More information about the Linux-PowerEdge
mailing list