Linux Debian 5 on PE R610 : random freeze at boot

ROUSSEL Kévin Kevin.Roussel at loria.fr
Mon Sep 14 11:48:30 CDT 2009


Hello,

I am in charge of three PowerEdge R610 servers at work, and we plan to
operate them with Debian Linux v. 5 ("Lenny") in 64-bit mode ("amd64").

I managed to install the system without much problem (using the
additional bnx2 firmware for the NIC), but when we began to use the
servers in production, we faced a big problem.

Our problem is : each time we reboot any of the three machines, the
system will randomly (with a probability of about 50%) hang at the
beginning of the boot process.

To be more precise :
All of our 3 PE R610 servers use SAS disks (300Gb Seagate ST9300603SS
disks) behind a PERC 6/i controller, which "pack" them in RAID-1 arrays,
those arrays being presented by the PERC 6/i controller to the Linux
kernel, which will in turn use them as LVM physical volumes.

Then, at boot, the kernel will eventually detect them in sequence,
producing messages like this :

...

[    1.451160] scsi 0:0:32:0: Enclosure         DP       BACKPLANE
1.07 PQ: 0 ANSI: 5
[    1.465155] scsi 0:2:0:0: Direct-Access     DELL     PERC 6/i
1.22 PQ: 0 ANSI: 5
[    1.465167] scsi 0:2:1:0: Direct-Access     DELL     PERC 6/i
1.22 PQ: 0 ANSI: 5
[    1.465951] scsi 0:2:2:0: Direct-Access     DELL     PERC 6/i
1.22 PQ: 0 ANSI: 5
[    1.491462] Driver 'sd' needs updating - please use bus_type methods
[    1.492786] sd 0:2:0:0: [sda] 584843264 512-byte hardware sectors
(299440 MB)
[    1.492808] sd 0:2:0:0: [sda] Write Protect is off
[    1.492808] sd 0:2:0:0: [sda] Mode Sense: 1f 00 10 08
[    1.492849] sd 0:2:0:0: [sda] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.492885] sd 0:2:0:0: [sda] 584843264 512-byte hardware sectors
(299440 MB)
[    1.492905] sd 0:2:0:0: [sda] Write Protect is off
[    1.492905] sd 0:2:0:0: [sda] Mode Sense: 1f 00 10 08
[    1.492947] sd 0:2:0:0: [sda] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.492947]  sda: sda1 sda2 sda3
[    1.507952] sd 0:2:0:0: [sda] Attached SCSI disk
[    1.510478] sd 0:2:1:0: [sdb] 584843264 512-byte hardware sectors
(299440 MB)
[    1.510486] sd 0:2:1:0: [sdb] Write Protect is off
[    1.510486] sd 0:2:1:0: [sdb] Mode Sense: 1f 00 10 08
[    1.510505] sd 0:2:1:0: [sdb] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.510520] sd 0:2:1:0: [sdb] 584843264 512-byte hardware sectors
(299440 MB)
[    1.510529] sd 0:2:1:0: [sdb] Write Protect is off
[    1.510529] sd 0:2:1:0: [sdb] Mode Sense: 1f 00 10 08
[    1.510548] sd 0:2:1:0: [sdb] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.510548]  sdb: sdb1
[    1.520318] sd 0:2:1:0: [sdb] Attached SCSI disk
[    1.520330] sd 0:2:2:0: [sdc] 584843264 512-byte hardware sectors
(299440 MB)
[    1.523459] sd 0:2:2:0: [sdc] Write Protect is off
[    1.523459] sd 0:2:2:0: [sdc] Mode Sense: 1f 00 10 08
[    1.523484] sd 0:2:2:0: [sdc] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.523500] sd 0:2:2:0: [sdc] 584843264 512-byte hardware sectors
(299440 MB)
[    1.523506] sd 0:2:2:0: [sdc] Write Protect is off
[    1.523506] sd 0:2:2:0: [sdc] Mode Sense: 1f 00 10 08
[    1.523525] sd 0:2:2:0: [sdc] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.523525]  sdc: sdc1
[    1.531530] sd 0:2:2:0: [sdc] Attached SCSI disk
[    1.765212] device-mapper: uevent: version 1.0.3
[    1.765948] device-mapper: ioctl: 4.13.0-ioctl (2007-10-18)
initialised: dm-devel at redhat.com
[    1.847550] PM: Starting manual resume from disk
[    1.875987] kjournald starting.  Commit interval 5 seconds
[    1.875987] EXT3-fs: mounted filesystem with ordered data mode.
[    2.959687] udevd version 125 started
[    3.117734] dcdbas dcdbas: Dell Systems Management Base Driver
(version 5.6.0-3.2)

...


There you have the 'dmesg' sequence recorded during a successful boot.

When the boot process is to fail, the SCSI/SAS detection sequence shown
above will be interrupted by a message like:

'Volume group "vg-server" not found.'

Then, immediately after the SCSI/SAS sequence is complete (in the above
example, this would be juste after the 'sd 0:2:2:0: [sdc] Attached SCSI
disk' line), the kernel prints this line:

'Begin: Waiting for root file system ...'

and just drops dead! It becomes totally unresponsive (even to the
"SysRq" key combinations), the only way to revive the server in this
state is to turn it off.


The problem occurs with our three servers, which all have a different
amount of RAM, a different number of disks (from 2 to 6, always arranged
in RAID-1 arrays), and a different number of processors (one Xeon L5520
CPU for one the machines, two Xeon L5520 for the two others). The only
hardware constant is the server model (PE R610), and the SAS RAID
controller (PERC 6/i).

The computers are brand new (received on August 2009), and I ensured the
firmwares were correctly updated (all three servers have the 1.2.6 BIOS
revision, the PERC 6/i controllers have the 6.2.0-0013 FW package, and
there doesn't seem to be any upgrade available for the Seagate
ST9300603SS disks).

In the current situation, upgrading the operating system (for example:
when 'apt-get upgrade' downloads and installs a new kernel) is a very
risky operation! We just can't use these servers in production!


I can't honestly understand what's happening here... Does anyone have an
idea ? In fact, I would even be happy to know if someone use
successfully the Debian Lenny amd64 distro on R610 servers...


Thanks in advance,


    K. Roussel
    INRIA-Lorraine




More information about the Linux-PowerEdge mailing list