semaphore leaks via RHEL3u5 WS + OMSA 5.1.0-354 + check_openmanage 3.4.9 + 'high load'

Trond Hasle Amundsen t.h.amundsen at usit.uio.no
Wed Apr 7 10:51:23 CDT 2010


Nick Silkey <nick at silkey.org> writes:

> We have quite a few production RHEL3u5 WS hosts running OMSA
> 5.1.0-354.  We find pointing Trond Amundsen's check_poweredge Nagios
> plugin at them eventually leads to semaphore resource exhaustion, but
> only when the machine is under high load (low load == semaphores come,
> do their job, and go as expected ; high load == orphaned semaphores
> build up over time and eventually hit kernel limits).
>
> Example syslog from an affected host:
>
> Apr 5 12:00:53 qweqaz Server Administrator (SMIL): Data Engine
> EventID: 0 A semaphore set has to be created but the system limit for
> the maximum number of semaphore sets has been exceeded
>
> Semaphore state on an affected machine:
>
> -bash-2.05b$ ipcs
>
> ------ Shared Memory Segments --------
> key        shmid      owner      perms      bytes      nattch     status
> 0x0001ffb8 0          root      666        76         3
> 0x00025990 32769      root      666        8308       3
> 0x00027cb9 65538      root      666        132256     1
> 0x00027cba 98307      root      666        132256     1
> 0x00027cdc 323092484  nagios    666        132256     0
> 0x00027cdd 393412613  nagios    666        132256     0
> 0x00027cde 393838598  nagios    666        132256     0
> 0x00027cdf 394231815  nagios    666        132256     0
> 0x00027ce0 413794312  nagios    666        132256     0
> 0x00027ce1 455770121  nagios    666        132256     0
> 0x00027ce2 483229706  nagios    666        132256     0
>
> ------ Semaphore Arrays --------
> key        semid      owner      perms      nsems
> 0x00000000 393216     root      666        1
> 0x00000000 425985     root      666        1
> 0x00000000 458754     root      666        1
> 0x00000000 491523     root      666        1
> 0x00000000 524292     root      666        1
> 0x00000000 557061     root      666        1
> 0x00000000 622599     root      666        1
> 0x00000000 655368     root      666        1
> 0x0001ffb8 688137     root      666        1
> 0x000251c0 720906     root      666        1
> 0x000255a8 753675     root      666        1
> 0x00025990 786444     root      666        1
> 0x00000000 1179661    root      666        1
> 0x000278d1 884750     root      666        1
> 0x00027cb9 917519     root      666        1
> 0x00000000 950288     root      666        1
> 0x00000000 983057     root      666        1
> 0x00000000 1015826    root      666        1
> 0x00000000 1048595    root      666        1
> 0x00000000 1081364    root      666        1
> 0x000278d2 1114133    root      666        1
> 0x00027cba 1146902    root      666        1
> 0x000278f4 197754904  nagios    666        1
> 0x00027cdc 197787673  nagios    666        1
> 0x000278f5 378044442  nagios    666        1
> 0x00027cdd 378077211  nagios    666        1
> 0x000278f6 379322396  nagios    666        1
> 0x00027cde 379355165  nagios    666        1
> 0x000278f7 380502046  nagios    666        1
> 0x00027cdf 380534815  nagios    666        1
> 0x000278f8 430571552  nagios    666        1
> 0x00027ce0 430604321  nagios    666        1
> 0x000278f9 538050594  nagios    666        1
> 0x00027ce1 538083363  nagios    666        1
> 0x000278fa 608436260  nagios    666        1
> 0x00027ce2 608469029  nagios    666        1
>
> ------ Message Queues --------
> key        msqid      owner      perms      used-bytes   messages
>
> Kernel proc bits on affected machines:
>
> -bash-2.05b$ cat /proc/sys/kernel/{sem,shm{all,max,mni}}
> 250	32000	32	128
> 2097152
> 33554432
> 4096
>
> This is the latest version of OMSA supported under RHEL3 per Dell Support.
>
> We are also several point releases behind the current check_openmanage
> Nagios plugin.  However the changelog doesnt appear to indicate that
> there is a helpful bugfix.  I mention this only as we have run a local
> heap of Big Brother bash mess against local OMSA on these high load
> hosts for years without any semaphore problems.
>
> We _could_ bump kernel limits _or_ splay Nagios checking to prolong
> intervals between semaphore exhaustion _or_ have something like cron
> or cfengine sweep in and ungracefully blast the orphaned semaphores at
> a fixed interval, but we would welcome a more elegant fix.  Curious to
> know if 1.) others have experienced this bug and 2.) how they have
> gotten around this issue.

Hi Nick,

I've been trying to reproduce this on a 2650 running RHEL3.9 with OMSA
5.4.0. No problems with semaphores whatsoever. Semaphores are created
but they are cleaned up promptly. I can't pinpoint whatever causes the
semaphore leakage at your end, but things are seemingly better with
later RHEL3 and/or newer OMSA. My guess is that OMSA is the culprit and
that upgrading would help.

OMSA 5.1.0 is the latest version listed for RHEL3 on support.dell.com,
but at least on my 2650 later versions work OK. This is not surprising,
as ESX3.5 was supported with OMSA up to and including 5.5.0 IIRC, and
ESX3.5 is "based" on RHEL3.

Is OMSA 5.1.0 really the latest supported version on RHEL3? Perhaps some
Dell folks can shed some light on this..

As for check_openmanage versions... Your assumptions are correct, there
are no recent changes that I can think of that is relevant to your
problem. In local mode, the plugin simply executes omreport commands to
determine the current status of the components you wish to monitor
(usually all of them). If omreport leaks semaphores there isn't much
that check_openmanage can do about it.

PS. We don't monitor hardware health on our RHEL3 boxes. Luckily, RHEL3
is soon EOL :)

Cheers,
-- 
Trond H. Amundsen <t.h.amundsen at usit.uio.no>
Center for Information Technology Services, University of Oslo



More information about the Linux-PowerEdge mailing list