Linux Debian 5 on PE R610 : random freeze at boot

PONSARD François fponsard at ecritel.net
Mon Sep 14 12:03:17 CDT 2009


Hi, 

We've tested R610 on Debian ( x86 and amd64 & stable/testing) w/o any major problems but we don't use LVM Volumes

Your error looks like more a LVM error 



-- 
François PONSARD :: Département Etudes :: Ingénieur Système
Ecritel :: www.ecritel.fr 
Site de Clichy :
7/9 rue Petit 
92582 Clichy Cedex
Groupe Euro Asian Equities

             

This message and any attachments (the "message") is intended solely for the addressees and is confidential. If you receive this message in error, please delete it and immediately notify the sender. Any use not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. The internet can not guarantee the integrity of this message. ECRITEL and its subsidiaries will not therefore be liable for the message if modified.

---------------------------------------------
Ce message et toutes les pieces jointes (ci-apres le "message") sont établis à l'intention exclusive de ses destinataires et sont confidentiels. Si vous recevez ce message par erreur, merci de le détruire et d'en avertir immediatement l'expediteur. Toute utilisation de ce message non conforme à sa destination, toute diffusion ou toute publication, totale ou partielle, est interdite, sauf autorisation expresse. L'internet ne permettant pas d'assurer l'integrite de ce message, ECRITEL et ses filiales declinent toute responsabilité au titre de ce message, dans l'hypothèse ou il aurait été modifié.

-----Message d'origine-----
De : linux-poweredge-bounces at lists.us.dell.com [mailto:linux-poweredge-bounces at lists.us.dell.com] De la part de ROUSSEL Kévin
Envoyé : lundi 14 septembre 2009 18:49
À : linux-poweredge at lists.us.dell.com
Objet : Linux Debian 5 on PE R610 : random freeze at boot

Hello,

I am in charge of three PowerEdge R610 servers at work, and we plan to operate them with Debian Linux v. 5 ("Lenny") in 64-bit mode ("amd64").

I managed to install the system without much problem (using the additional bnx2 firmware for the NIC), but when we began to use the servers in production, we faced a big problem.

Our problem is : each time we reboot any of the three machines, the system will randomly (with a probability of about 50%) hang at the beginning of the boot process.

To be more precise :
All of our 3 PE R610 servers use SAS disks (300Gb Seagate ST9300603SS
disks) behind a PERC 6/i controller, which "pack" them in RAID-1 arrays, those arrays being presented by the PERC 6/i controller to the Linux kernel, which will in turn use them as LVM physical volumes.

Then, at boot, the kernel will eventually detect them in sequence, producing messages like this :

...

[    1.451160] scsi 0:0:32:0: Enclosure         DP       BACKPLANE
1.07 PQ: 0 ANSI: 5
[    1.465155] scsi 0:2:0:0: Direct-Access     DELL     PERC 6/i
1.22 PQ: 0 ANSI: 5
[    1.465167] scsi 0:2:1:0: Direct-Access     DELL     PERC 6/i
1.22 PQ: 0 ANSI: 5
[    1.465951] scsi 0:2:2:0: Direct-Access     DELL     PERC 6/i
1.22 PQ: 0 ANSI: 5
[    1.491462] Driver 'sd' needs updating - please use bus_type methods
[    1.492786] sd 0:2:0:0: [sda] 584843264 512-byte hardware sectors
(299440 MB)
[    1.492808] sd 0:2:0:0: [sda] Write Protect is off
[    1.492808] sd 0:2:0:0: [sda] Mode Sense: 1f 00 10 08
[    1.492849] sd 0:2:0:0: [sda] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.492885] sd 0:2:0:0: [sda] 584843264 512-byte hardware sectors
(299440 MB)
[    1.492905] sd 0:2:0:0: [sda] Write Protect is off
[    1.492905] sd 0:2:0:0: [sda] Mode Sense: 1f 00 10 08
[    1.492947] sd 0:2:0:0: [sda] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.492947]  sda: sda1 sda2 sda3
[    1.507952] sd 0:2:0:0: [sda] Attached SCSI disk
[    1.510478] sd 0:2:1:0: [sdb] 584843264 512-byte hardware sectors
(299440 MB)
[    1.510486] sd 0:2:1:0: [sdb] Write Protect is off
[    1.510486] sd 0:2:1:0: [sdb] Mode Sense: 1f 00 10 08
[    1.510505] sd 0:2:1:0: [sdb] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.510520] sd 0:2:1:0: [sdb] 584843264 512-byte hardware sectors
(299440 MB)
[    1.510529] sd 0:2:1:0: [sdb] Write Protect is off
[    1.510529] sd 0:2:1:0: [sdb] Mode Sense: 1f 00 10 08
[    1.510548] sd 0:2:1:0: [sdb] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.510548]  sdb: sdb1
[    1.520318] sd 0:2:1:0: [sdb] Attached SCSI disk
[    1.520330] sd 0:2:2:0: [sdc] 584843264 512-byte hardware sectors
(299440 MB)
[    1.523459] sd 0:2:2:0: [sdc] Write Protect is off
[    1.523459] sd 0:2:2:0: [sdc] Mode Sense: 1f 00 10 08
[    1.523484] sd 0:2:2:0: [sdc] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.523500] sd 0:2:2:0: [sdc] 584843264 512-byte hardware sectors
(299440 MB)
[    1.523506] sd 0:2:2:0: [sdc] Write Protect is off
[    1.523506] sd 0:2:2:0: [sdc] Mode Sense: 1f 00 10 08
[    1.523525] sd 0:2:2:0: [sdc] Write cache: disabled, read cache:
enabled, supports DPO and FUA
[    1.523525]  sdc: sdc1
[    1.531530] sd 0:2:2:0: [sdc] Attached SCSI disk
[    1.765212] device-mapper: uevent: version 1.0.3
[    1.765948] device-mapper: ioctl: 4.13.0-ioctl (2007-10-18)
initialised: dm-devel at redhat.com
[    1.847550] PM: Starting manual resume from disk
[    1.875987] kjournald starting.  Commit interval 5 seconds
[    1.875987] EXT3-fs: mounted filesystem with ordered data mode.
[    2.959687] udevd version 125 started
[    3.117734] dcdbas dcdbas: Dell Systems Management Base Driver
(version 5.6.0-3.2)

...


There you have the 'dmesg' sequence recorded during a successful boot.

When the boot process is to fail, the SCSI/SAS detection sequence shown above will be interrupted by a message like:

'Volume group "vg-server" not found.'

Then, immediately after the SCSI/SAS sequence is complete (in the above example, this would be juste after the 'sd 0:2:2:0: [sdc] Attached SCSI disk' line), the kernel prints this line:

'Begin: Waiting for root file system ...'

and just drops dead! It becomes totally unresponsive (even to the "SysRq" key combinations), the only way to revive the server in this state is to turn it off.


The problem occurs with our three servers, which all have a different amount of RAM, a different number of disks (from 2 to 6, always arranged in RAID-1 arrays), and a different number of processors (one Xeon L5520 CPU for one the machines, two Xeon L5520 for the two others). The only hardware constant is the server model (PE R610), and the SAS RAID controller (PERC 6/i).

The computers are brand new (received on August 2009), and I ensured the firmwares were correctly updated (all three servers have the 1.2.6 BIOS revision, the PERC 6/i controllers have the 6.2.0-0013 FW package, and there doesn't seem to be any upgrade available for the Seagate ST9300603SS disks).

In the current situation, upgrading the operating system (for example:
when 'apt-get upgrade' downloads and installs a new kernel) is a very risky operation! We just can't use these servers in production!


I can't honestly understand what's happening here... Does anyone have an idea ? In fact, I would even be happy to know if someone use successfully the Debian Lenny amd64 distro on R610 servers...


Thanks in advance,


    K. Roussel
    INRIA-Lorraine


_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at lists.us.dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq



More information about the Linux-PowerEdge mailing list