bad crash. any idea ?

Fabrice LORRAIN fabrice.lorrain at
Thu Aug 8 12:29:01 CDT 2002

Hi all,

2 days ago our main PE4400+PERC 3Di crashed. We just finished putting a 
spare server online, but we are missing some data.Any help appreciate.

Here is the story :
- tusday the server freezes and I could not connect to it (ssh),
- I found the console spitting
"failed to exec /sbin/modprobe -s -k binfmt-b1d7, errno=8"
like mad. Couldn't log in. And a 3 fingers reboot didn't work.
At the same time, the 5 disks of our RAID5 pool where playing christmas 
tree (blinking orange, the lone volume disk seems to be ok).

-> AC power stop/ AC power start
During the POST :
container #0 RAID 5 critical (known pb)
container #1 unkown --> the real pb
container #2 Volume ok

Bye-bye our 60Go /home on sdb1 (container #1), sda2 (/ on ext3) seems to 
be behind salvation too.sda[5-7] are ok.

Right now, I've an nfsroot environnement with afacli booting the server.

What I would like to know is :
- where does the binfmt error message come from,
- any chance we can get container #1 online
- some explanation on how this mess could happen (ie how can we loose a 
whole container with an AC shortage).

technical info :
I can provide more if needed.
- hardware : poweredge 4400+PERC 3Di, dual xeon 933Mhz, 1Go RAM, altheon 
copper Giga NIC (+ onboard intel)
- BIOS A06, ESM 5.22, array monitor v2.1-3
- distrib : debian potato
- kernel : vanilla-2.2.19+SMP+aacraid patch from Matt page+ext3 patch 
(maybe kernel-2.2.18)

The server is our main file server (samba) + dhcp server + DNS slave
It has been running like a charm for more than a year with almost no load.

Current june, we had a pb with drive 4. I change the disk, but the 
automatic rebuild didn't do what I expect (cf the following "container 
list"). I leave the first container in critical state because we where 
supposed to change the server rapidly...

AFA0> container	list
Executing: container list
Num          Total  Oth Chunk          Scsi   Partition
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
  0    RAID-5 8.00GB 	  32KB Valid   0:00:0 64.0KB:2.00GB
  /dev/sda             system           0:01:0 64.0KB:2.00GB
                                        0:02:0 64.0KB:2.00GB
                                        0:03:0 64.0KB:2.00GB
                                        	 --- Missing ---

  1    RAID-5 59.7GB 	  32KB Valid   0:00:0 2.00GB!14.9GB
  /dev/sdb             donnees          0:01:0 2.00GB!14.9GB
                                        0:02:0 2.00GB!14.9GB
                                        0:03:0 2.00GB!14.9GB
                                        0:04:0 64.0KB!14.9GB

  2    Volume 16.9GB 	Open    0:09:0 64.0KB:16.9GB
  /dev/sdc             dump

Thanks for any insight.

	F. Lorrain
	administrateur systemes et reseau
	universite de Marne-la-Vallee

More information about the Linux-PowerEdge mailing list