hanging DRAC5 cards

Alex Younts ayounts at rcac.purdue.edu
Thu Jun 10 08:58:04 CDT 2010


Greetings,

I've been seeing this issue as well on our 3rd generation 1950's..
There are a great many number of them, spread between several data
centers. We cable up the dedicated ports on our drac's to a management
network and use them for the remote KVM, but we don't do a lot of
monitoring of the drac except to ping it every hour to ensure it's
still there. Sometimes the cards can be reset or reflashed or the
server power cycled to bring them back to life. We see that happen a
few a month. Other times, we can't get the cards back at all and have
to resort to warranty replacing them. In fact, we just did warranty
replacements for three last week that we finally lost all hope in
reviving.

Given the number of them that we have in service, a few locking up or
dying isn't a terrible problem for us.. But, I'd be interested in
knowing anything you find out. If you have any suggestions on
additional information I could collect for the effort when we find
dying drac's, let me know..

-Alex

On Wed, Jun 9, 2010 at 5:59 AM, Pawel Kudzia <kudzia at gmail.com> wrote:
>> We've had a DRAC5 for a couple of years running without any problems.
>> It sounds like yours have some kind of hardware issue - were they all
>> purchased at the same time?  Maybe you should try to get one replaced
>> to see if you still have the problem.
>>
>> You could try logging in to the DRAC as proper root[1] and run dmesg
>> or something in a loop to see if any interesting messages appear just before
>> the cards break.
>
> Hello Adam, thanks a lot for your mail!
>
> sometimes i think it's a bad luck or some curse...servers were both in couple
> batches in the period of ~1 year. some of them are 1st gen of 1950, 2950,
> some are 3rd gen of 1950/2950.
>
> i know this method of gaining real root on dracs - it worked fine with 1.50
> firmware, but it does not anymore with 1.5.1:
>
> $ racadm util mode -set vendor
> ERROR: Invalid subcommand specified.
> $ racadm util mode -isvendor
> ERROR: Invalid subcommand specified.
> $ su -
>
>
> if i think reasonably there can be following causes:
> * dracs are faulty in general [ but your opinion and folks at dell suggest it
>  is not the case ]
> * there is something strange about my host operating systems
>  [ but it happened on machines with esix 4, debian, windows 2008 - so not
>   really ]
> * there is something strange about my operation env - but then again it
>  happened to servers collocated in 2 datacenters and one office; connected
>  to dell / hp / dlink switches; some with very simple network infrastructure,
>  * my monitoring makes those cards misbehave - but i stopped any attempts
>  to monitor the servers and yet after few months i disovered [when needed]
>  that devicses were hang.
>
> i got on my private address few more diagnostics requests from dell - i will
> check what exactly bios reports on the server with hanged drac during reboot.
> i will also run diagnostic tools that try to communicate with the hanged drac
> from the host system.
>
> --
> regards,
> Pawel Kudzia / .PaKud
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
>



More information about the Linux-PowerEdge mailing list