Dell 2650 Perc 3/Di RAID failure, Server wouldn't reboot during rebuild filesystem dead

Ben Russo Ben at muppethouse.com
Wed Apr 7 17:57:01 CDT 2004


	A Story of my problem with DELL PERC 3/Di...
	It hung the OS when one disk of a RAID5 volume was
		falsely accused of being "FAILED"
	And then delayed rebooting for about 18 hours while
		it "REBUILT"  (it wouldn't reboot during this time)

I have a DELL PowerEdge 2650 that was bought in January of 2003
It came with
	DUAL 2.6GHz P4 CPU's
	3GB RAM
	onboard Hardware RAID (PERC 3 Di)
	five 142GB (or thereabouts) hard disks

I configured the 5 disks in a single BIG raid 5 volume
I installed RedHat AS 2.1  (as purchased from Dell with a 3 year license).

The box had been running for about a year without a reboot when
at 5:50am yesterday the box stopped responding to SSH connections
I went into the lab and saw that DISK0  had an amber light and the LCD 
panel on the server was Orange and had the text "DISK FAIL" scrolling 
across it.

I could not log in on the console either, I would just get error 
messages saying something like:  "Ext3-fs error: cannot write inode 
SCSI sector  XXXXXXX"  (approximately that anyway).  They would repeat
every few minutes and just scroll up the screen endlessly.

I called Dell and the guy there told me to power cycle the machine.
"Really?" I asked, "This box has a 400+GB filesystem, I am worried about
doing that, is there any way to get the box to shutdown gracefully?  I
wouldn't think that a single hard disk failure should cause the OS to
halt, I mean that *is* why we bought Hardware RAID right?"

The Dell support staff member told me that the procedure in this case 
was to power cycle the machine and then go into the PERC BIOS SETUP 
utility.  Which we did.  He had me go into the SCSI Disk utilities and 
run the "DISK VERIFY" utility.   This proceeded to 10% without reporting 
any errors and I noticed that the Amber light had gone out.  The Dell 
technician told me that the hard disk error was therefore a false alarm 
and that the RAID would now rebuild and I could reboot the system now if 
I wanted too.

I wasn't happy about the fact that the RAID would "false alarm" and 
cause the OS to hang!  But what the heck, I have about two dozen such 
servers and if this only happened to one of the two dozen once every 
year or two I could live with that.  Besides it didn't seem to have 
caused any serious problem (just yet)..  I figured it would just take an 
hour for the FSCK of the big filesystem and then we would be on our way, 
boy was I wrong.

The server wouldn't reboot.  It would get to the point where it normally 
would show me a GRUB boot kernel selection menu. But instead of showing 
me anything about GRUB at all I just got a message like:

	No boot device found, press <F1> to try again, or <F2> to enter setup.

The DELL support tech told me that I might have to wait until the 
rebuild was at 5% before it would reboot.  Huh?  That  doesn't make any 
sense but OK, whatever, I will wait until it get's to 5%.  So back to 
the RAID controller bios setup utility I go to watch it painfully slowly 
rebuild a 536GB RAID 5 volume. *(It rebuild at just a little less than 
10% per hour)

In the meantime I asked the Dell support Tech if he could raise a red 
flag on this issue (of a single disk problem causing the OS to halt).

He recommended I go to "linux.dell.com"  I explained to him that I love 
this mailing list and the website, but that I don't pay thousands of 
dollars for name brand hardware and support so that I can go to a 
community news group for support.

I explained to him that my boss would demand that Dell represent itself 
for the money we pay for our servers.  And that if I don't have 
something to show for it he would ask why we didn't just go buy 
Off-The-Shelf server components and support them via the newsgroups all 
the time?  The Dell tech said he would get me a manager and put me on 
hold.  After about 10-20 minutes I got disconnected.

At this point I realized that I had never been given a case number or a 
trouble ticket number.  That is OK I thought, I'm sure they track these 
things using the Express Svc code of the server or the Service Tag 
number, right?  So I called back Dell support... no such luck.  They 
didn't know the name of the guy I had been talking to and seemed to be 
unaware of the history (1 hours worth) that I had already established 
with the Tech I had been working with.  So I started all over again.
Got to the same point of being put on hold to get a manager.  Got 
disconnected again!

Called back. Again (at first) they seemed to be unaware of my history of 
support calls and couldn't locate the two different persons I had been 
working with.  Then a little by surprise they managed to find the 
manager that had talked to the original tech, the manager explained that 
my case was rare and that it seemed abnormal.  But he said that there 
was only one RAID/SCSI controller and only one SCSI bus in my server. 
So he would send a new motherboard (with the onboard RAID controller) 
and would send a new SCSI backplane and RAID cache RAM DIMM.

Dell delivered on this, and did it by 2:10pm. (By this time I had spent 
almost 2 hours on the phone, and the only real progress on the server 
outage was that it had been power cycled and was 15% done REBUILD'g the 
RAID volume.)

After the hardware was replaced I tried booting up the OS again.  Still 
no dice.  I figured that even if I had to re-install the server OS from 
scratch I would still have to rebuild the RAID group.  So I might as 
well wait to see if it finished.

It did finish at 11:50PM that night (almost 18 hours later!)
To my delight it could find the GRUB boot menu now and I could select a 
kernel and hit enter.  However all the kernels would get to the stage:
	Unpacking the kernel... OK, booting up Linux.
Then the server would just sit there and do nothing.
(And by the way the server was going to be slow as heck for at least 
another 10 hours because the RAID volume was "SCRUBBING" now.)

I was able to boot from the RHAS 2.1 install CD in RESCUE mode.

It was able to see /dev/sda and find all the partitions and mount all 
the filesystems.  I fsck'd them all and examined the partition table.
Everything seem'd OK.

I chroot'd to the /mnt/sysimage and then ran
	rpm --verify kernel-smp-2.4.9-e.38 grub filesystem
Everything seem'd just fine.  I ran grub-install /dev/sda which
succeeded with no errors and looked good.

Then I tried rebooting again.  Still no dice.

I called Dell again.  They were pretty stumped too.  They asked me to 
make sure I had /initrd directory and to check the permissions on 
bunches of stuff they had me force re-install the kernel RPM and do the 
mkinitrd and grub-install again.  Still no dice.  I couldn't get the box 
to reboot off of the RAID 5 volume.  Dell said that even though I had 
bought this server (with Hardware RAID) and a 3 year 4 hour support 
contract that it was for HW only and that now I was able to get to the 
OS BOOT menu so it was my problem.  Jerks. They sell a HW RAID 
controller that hangs the system and causes the RAID volume to go 
non-bootable and they refuse to help.  I could understand if they just said
	"Well we have no ideas, you got any left?"
But rather than just say that I was stuck with re-installing and they 
felt bad about it, (which I suspected was the case) They instead told me 
that I was "on my own" because I didn't spend $299 for software support 
contract.

Anyway,  I backed up /etc, /var, and /usr/share/rhn. Then I did a 
minimal install using the existing partions and filesystems on the 
existing RAID 5 volume without re-formatting anything.  After the 
minimal install I booted (waited through another interminable fsck) and 
then got into single user mode and restored /etc and /usr/share/rhn, 
then I started up networking and ran up2date -u.  Then I restored the 
/var (where my RPM-database was) and rebooted again.

Everything seems to be up now, and other than the 36 hour hiatus 
everything is OK.

Dell, please... fix the problem with not being able to find the boot 
device during rebuild of failed disk0 on a RAID5 volume.  And it doesn't 
make any sense not to assist a customer who paid many thousands of 
dollars for HW that failed in a way that it shouldn't have. It kinda 
tick's me off.  If your hardware performed as advertised I wouldn't have 
been asking you for support.

-Ben.




More information about the Linux-PowerEdge mailing list