Watchdog functionality in OpenManage software.

Chris Pascoe c.pascoe at itee.uq.edu.au
Wed Nov 13 02:58:00 CST 2002


Hi Nuno,

Nuno Leitao writes:

> is there a way for third-party applications to use the watchdog
> device in Dell PowerEdge servers ?
>
> What I need to do is for the watchdog to monitor my application
> (as opposed to monitoring only the kernel/operating system) and
> to reboot if my app goes down.

There isn't this functionality built in as far as I can determine, but as I
had a few spare minutes today I designed some code to integrate this into
the current esm module.  Basically, I implemented the software watchdog code
inside the esm module.  When the software watchdog in here expires, it
suppresses the sending of normal wakeup ticks to the hardware watchdog - so
the ESM hardware takes reboot action.

Attached are patches for:
/usr/lib/dell/openmanage/omsa/drivers/open_src/esm/dcesm.c
and
/usr/lib/dell/openmanage/omsa/dellomsaesm (symlinked from /etc/init.d)
that let you create a watchdog device that must be written to from an
application periodically to keep the machine alive.  The device that needs
to be written to periodically will be named /dev/esm0wd.

It implements the Linux Watchdog API so should behave like any other
watchdog device on Linux, with the following caveats:

* You need to set a system recovery timer and action (via Openmanage), say
by running:
  omconfig system recovery action=powercycle timer=120

* The timer that you set on this watchdog device is cumulative with the one
set via Openmanage.  That is, if you set the watchdog timeout to 60 seconds,
and the system recovery timer to 120 seconds (as above), the machine will
actually take 180 seconds to restart in the case of a failure to write to
the /dev/esm0wd device.

* Once the watchdog timeout has expired, there is no way out; that is, in
the above example, if 60 seconds pass without a write and then your process
writing to /dev/esm0wd starts writing again, the machine will still reboot
at time == 180 seconds.

* The no-way-out functionality isn't really no-way-out - if you disable the
system recovery action, then the system won't reboot/turn off/powercycle.

If I get some time, I'll see if I can figure out a way around these caveats
for the future (anyone at Dell want to provide specifications for all of the
IOCTLs?).  For now, it seems to work as described above - I've given it a
few hours testing.

(Of course there's an easy way to do this - install the software watchdog,
and the watchdog RPM, writing action scripts for it to do an 'omconfig
system shutdown action=powercycle osfirst=false'... however, you don't get
the option for a full power-cycle with that in the case of scheduler lockup,
just a plain soft reboot..).

Regards,
Chris
--
Christopher Pascoe
IT Infrastructure Group Manager
School of Information Technology and Electrical Engineering
The University of Queensland   Brisbane  QLD  4072  Australia
Web: http://www.itee.uq.edu.au/~chrisp      Email: c.pascoe at itee.uq.edu.au
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dcesm.c.watchdog-diff
Type: application/octet-stream
Size: 6495 bytes
Desc: not available
Url : http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20021113/f4e95c42/dcesm.c.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dellomsaesm.watchdog-diff
Type: application/octet-stream
Size: 526 bytes
Desc: not available
Url : http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20021113/f4e95c42/dellomsaesm.obj


More information about the Linux-PowerEdge mailing list