IPMI over LAN response time (was: IPMI over LAN under PE-x9xx machines)
Celso K. Webber
celso at webbertek.com.br
Fri Oct 6 23:15:50 CDT 2006
Hello Andrew,
I totally agree with your point of view?
Now a question to those list members that have played with IPMI on
PE-x9xx machines: did you experience a delay beetween sending the IPMI
command (such as "chassis power off") and the machine having the command
actually executed?
In my tests, whenever I issue:
ipmitool -I lan -H x.x.x.x -U root -P password chassis power off
The commands completes sucessfully almost immediately, but the remote
machine takes around 3 to 5 seconds to actually power off.
Is this normal on these machines? Is there a way to remove this delay?
Thank you all.
Regards,
Celso.
Mann, Andrew escreveu:
> It sounds like you're testing with a situation that your setup is specifically weak against. In order to function properly, communication must possible and reliable between the node requesting the reboot and the fence device. If you want to reboot in the event of a network failure, then your fence device must not be connected over the same network. By simultaneously disconnecting the node-node communication path and the node-fence communication path, you're creating a situation which can't be solved.
> Even if the systems shut down faster, you still have a race condition as both systems race to shut the other down.
>
> A node reboot is unlikely to correct a network related failure (such as the cables being disconnected) anyway. The closest situation which a node reboot might fix is if the node stops communicating over the Ethernet devices (due to driver error, or the system locking up). In this case, presumably, the IPMI interface would still be able to communicate over the network, and so a reboot request from the other node would be acted upon. An appropriate test for this might be an 'ifdown' on the interface of one system.
>
> If you're really worried about your entire bonded channel network going down, and you think that a node reboot might solve the problem, then you need another redundant path which the fence device can communicate over.
>
> Andrew
>
> -----Original Message-----
> From: linux-poweredge-bounces at dell.com [mailto:linux-poweredge-bounces at dell.com] On Behalf Of Celso K. Webber
> Sent: Friday, September 29, 2006 9:14 AM
> Cc: Dell Poweredge Linux List
> Subject: Re: IPMI over LAN under PE-x9xx machines
>
> Hello,
>
> I appreciate your recommendations about solving the problem with the
> Cluster.
>
> In fact, the network problem I mention was done on purpose, after
> configuring the environment I usually try taking out the cables just to
> check the cluster behaviour. That is, this time I took out the network
> cables (already configured for redundancy through Linux channel bonding)
> and plugged them back two seconds later.
>
> That caused the problem. In my point of view, I believe Dell's new
> implementation of IPMI is not fast enough for its use with the Cluster,
> so I'd like to know if the delay I have noticed is on purpose or if it
> is because of some technical issues.
>
> Could someone from Dell comment on this? If there is a firmware update
> to solve this, please inform me so that we can plan ahead for an upgrade
> cycle.
>
> Regards to all,
>
> Celso.
>
> Sean Dilda escreveu:
>> Celso K. Webber wrote:
>>> Thanks Johan,
>>>
>>> That's not the case here, I think. Both the normal network access,
>>> IPMI fencing, and heartbeat are channel bonded over the onboard NICs
>>> (eth0 and eth1).
>>>
>>> The problem is that if I loose network link on both interfaces for a
>>> small amount of time, both servers have the opportunity to shoot each
>>> other because of the delay between the reception of the IPMI command
>>> and its actual execution by the BMC controller.
>>>
>> It sounds like there's a core problem here that you're somehow loosing
>> the network link when you shouldn't. What Johan suggested was using two
>> heartbeat paths. That way even if you looks the network link for a
>> second, they can still heartbeat over a serial connection which will
>> keep either of them from trying to shoot the other.
>>
>
--
*Celso Kopp Webber*
celso at webbertek.com.br <mailto:celso at webbertek.com.br>
*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035
--
Esta mensagem foi verificada pelo sistema de antivírus e
acredita-se estar livre de perigo.
More information about the Linux-PowerEdge
mailing list