Is the BMC robust to recover from system hangs? impitool unresponsive

Rahul Nabar rpnabar at gmail.com
Sun Aug 29 16:17:01 CDT 2010


I typically use out-of-band ipmitool to reboot machines that might
once in a while be unreachable via ssh remotely because something went
wrong.

Earlier today I was running a new, challenging parallel job over 32
servers and something went wrong. I suspect the nodes ran out of
memory and after that a bunch of nodes became unresponsive. On some I
was able to use impitool to reboot:

/usr/bin/ipmitool -f ~/ipmi_pw -I lanplus -U root -H 172.16.0.13 power cycle

But a bunch of them the BMC just doesn't respond to ipmitool?

/usr/bin/ipmitool -f ~/ipmi_pw -I lanplus -U root -H 172.16.0.13 power status
just hangs.

Isn't this the whole point behind the BMC the ability to be able to
connect and recover out-of-band? Even if (worst case) my kernel
panicked and the tcp stack collapsed shouldn't ipmi still be able to
talk to the BMC? I had checked that before the job crashed the nodes
the BMCs were working and responsive to IPMI.

Am I doing something wrong here or is this non-robustness a documented
shortcoming of the BMC's? Any comments for others using BMC's / IPMI
are very welcome!

Verbose:
no clues

More verbose:
IPMI LAN host 172.16.0.14 port 623

>> Sending IPMI command payload
>>    netfn   : 0x06
>>    command : 0x38
>>    data    : 0x8e 0x04


>> Sending IPMI command payload
>>    netfn   : 0x06
>>    command : 0x38
>>    data    : 0x8e 0x04

Most verbose:
IPMI LAN host 172.16.0.14 port 623

>> Sending IPMI command payload
>>    netfn   : 0x06
>>    command : 0x38
>>    data    : 0x8e 0x04

BUILDING A v1.5 COMMAND
>> IPMI Request Session Header
>>   Authtype   : NONE
>>   Sequence   : 0x00000000
>>   Session ID : 0x00000000
>> IPMI Request Message Header
>>   Rs Addr    : 20
>>   NetFn      : 06
>>   Rs LUN     : 0
>>   Rq Addr    : 81
>>   Rq Seq     : 00
>>   Rq Lun     : 0
>>   Command    : 38

Are there any more options I can pass to ipmitool to increase
verbosity and know at what exact point the hang is occuring?



-- 
Rahul



More information about the Linux-PowerEdge mailing list