Nagios plugin for Perc Arrays [was: poweredge hardware monitoring]]

Lewis Getschel lgetschel at denver.westerngeco.slb.com
Wed Mar 2 13:34:06 CST 2005


Hmmm, It looks like the mailer wraps the lines funny....

OK, here it goes 2nd plugin for monitoring Perc Arrays in Powervaults
...keep in mind the following:

1) This is provided AS-IS, no promises are made as to your usability.
2) This version of the check_win_perc was written specifically for 
Windows 2k server (running Dells openmanage), I have dual cpu's. If you 
don't, change accordingly.
3) I wrote these for myself to use and didn't "intend" to distribute 
them, (but as soon as I found this list I figured someone would want it 
<smile>.)
4) I'll provide what I can as far as "help", but don't expect me to 
rewrite it for you.
5) If you improve this, let me know, and if it warrants it, I'll start a 
'thread' on sourceforge plugin development
6) I can use some help figuring out a linux version (though I suspect 
it's because the guy who setup the linux system didn't install openmanage)

BTW, I'm 'kinda new to writing linux scripts, YOU may very well find 
better ways to accomplish the same things.
I also used to teach pascal programming (in a previous life), so some of 
my code may appear 'less efficient'
because I tried to write so that others can more easily read it.

Lastly, I write notes to myself in my code, sorry if that makes me look crazy <grin>.

  (have I written enough of a novel yet?, yes?... I'll quit here. <wink>

Lewis

==== cut here ======= Plugin: check_win_cpu_temp =======
#!/bin/bash  
#-x
# Script to check the Windows Dell-PERC for current status
#               
# Written by:   Lewis Getschel
# Date:         12/29/04
# Parameters:   1 - the ip address of the system to check
# Operation:    
# Limitation:  It seems that Nagios will NOT run a /bin/tcsh script at all!!
#               I had to change the script to /bin/sh (bash) to get it to even run a 3 line script
#               that was just echo $1 into the /tmp/file.
#               according to Nagios Plugin recommendations, I used absolute 
#               paths to all commands
#               #
# Version History:
# 12/29/2004   First try, Turned out VERY good. Keeping a temp file seemed the best way to go on this
#               This allows seeing changes. I initially didn't show the number of Global/Dedicated
#               HotSpares, but after a few minutes of monitoring, I realized that since each 
#               "at-that-time-purchased" group had different standards for how they were configured
#               I needed to see the actual numbers of spares
#
# Notes:        The "baseline" (the temp file) is never actually replaced anywhere in this code. If
#               a new baseline is desired, then simply delete the appropriate temp file. This routine
#               will create a NEW baseline (/tmp) file, and use that onward.
#   Additional note:
#               Whenever something changes on the array (ready to offline, etc) 2 things happen:
#               1) nagios goes to critical state
#		2) Nagios will STAY that way until you delete (or rename) the 'baseline' file in /tmp
#		   I just leave it that way until the new drive arrives, then I delete the file.
#		   I let the "new config" be the Warning state for the 1st check, that way it shows
#                  up better in the event log.
#
#  Example:
#  Using the oid for the perc adapter (from arymgr.mib)  
#  This retrieves the "Disks name as represented in Array Manager"
#   /usr/bin/snmpget -v1 -c public master0010:161 1.3.6.1.4.1.674.10893.1.1.140.2.1.2.1
# SNMPv2-SMI::enterprises.674.10893.1.1.140.2.1.2.1 = STRING: "Disk 2"
#   dvws001(dets05)12/29 10:23 /usr/lib/nagios/plugins> /usr/bin/snmpget -v1 -c public m010:161 1.3.6.1.4.1.674.10893.1.1.140.2.1.2.2
# SNMPv2-SMI::enterprises.674.10893.1.1.140.2.1.2.2 = STRING: "Disk 0"
#   dvws001(dets05)12/29 10:23 /usr/lib/nagios/plugins> /usr/bin/snmpget -v1 -c public m010:161 1.3.6.1.4.1.674.10893.1.1.140.2.1.2.3
# SNMPv2-SMI::enterprises.674.10893.1.1.140.2.1.2.3 = STRING: "Disk 1"
#
# Useful OID's (as _I_ see it <smile>)
#    1.3.6.1.4.1.674.10893.1.1.130.1.1.5.x "Status of this controllers subsystem (which includes any devices connected to it"
#       Problem that I see, shows status at THAT moment, shows 6:Degraded while rebuild occurs
#       otherwise it shows 1:Ready (before rebuild, and after rebuild)
#
#    1.3.6.1.4.1.674.10893.1.1.110.1.0 "Global health information for the subsystem"
#    1.3.6.1.4.1.674.10893.1.1.110.2.0 "Previous Global health information for the subsystem"
#       Problem is that I don't know how previous it is (seems to be until rebooted, because on M010
#       It showed 2:Warning until I rebooted and ran Diags, then it showed 1:Normal
#
# I've been thinking that if I keep an array of integers (i.e. "222222222222222222222222222223")
# that represent the current "status of the array disk as a spare" 
# 1.3.6.1.4.1.674.10893.1.1.130.4.1.22.x (1-30). This way I can tell:
#       1) whether HotSpares are in correct place
#       2) when they change positions
#       Problem is that I'd need to write this out somewhere to keep for compares (use OID 
#               1.3.6.1.4.1.674.10893.1.1.130.3.1.7.x (1-3) "enclosure ID (i.e. serial Number)
#       Problem is that the internal enclosure is "Null"
#
# 1.3.6.1.4.1.674.10893.1.1.140.2.1.4.x (1-3) "Current state of the Disk"
# 1.3.6.1.4.1.674.10893.1.1.130.4.1.4.x (1-30) "Current state of the (individual) array disk"
#
# This next one is the 1st of several OID's that show the disk name
# 1.3.6.1.4.1.674.10893.1.1.130.4.1.2.x (1-30) "Name of the array disk represented in Array Manager"
#
# =================================== Script starts below ================================
#
systemdifferences=0
hostnam=$1
# echo $1 >> /tmp/nagios_event_debug.txt
# echo --- `date` --- >> /tmp/nagios_event_debug.txt

if [ "$#" -eq "0" ]; then
   echo "Unknown - No parameter specified"
   exit 3
fi

# these system status's don't hold after a reboot!
currentsystemstatus=`/usr/bin/snmpget -v1 -c public $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.5.2 | awk '{print $NF}'`
previoussystemstatus=`/usr/bin/snmpget -v1 -c public $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.5.2 | awk '{print $NF}'`
total_drives=`snmpwalk -c public -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.1 | tail -1 | awk '{print $NF}'`

for ((a=1; a <= total_drives ; a++))  # Double parentheses, and "total_drives" with no "$".
do
   current_disks_state[${a}]=`/usr/bin/snmpget -v1 -c public $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.4.${a} | awk '{print $NF}'`
done                           # A construct borrowed from 'ksh93'.
# current_disks_state=`snmpwalk -c public -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.4 | awk '{print $NF}'`
system_serial_number=`snmpwalk -v 1 -c  public $hostnam .1.3.6.1.4.1.674.10892.1.300.10.1.11 | awk '{print $NF}' | sed 's/\"//g'`

# === if there is a previousdata file for previous run, read it in.
if [ -e /tmp/${hostnam}_$system_serial_number.txt ]; then
   for ((a=1; a <= total_drives ; a++))  # Double parentheses, and "total_drives" with no "$".
   do
      previous_disks_state[${a}]=`/bin/sed -ne ${a}p /tmp/${hostnam}_$system_serial_number.txt`
   done
   previousdata=1
else # no previous file data, make it now from current (or should I make it manually as 4 3 3 3 1 ..??)
   currentdrive=1
   previousdata=0
   /bin/touch /tmp/${hostnam}_$system_serial_number.txt
   while [ $currentdrive -le $total_drives ]
   do
      echo ${current_disks_state[$currentdrive]} >> /tmp/${hostnam}_$system_serial_number.txt
      currentdrive=`expr $currentdrive + 1`
   done
   echo "WARNING - PERC array wrote first status file on dvws001 /tmp/${hostnam}_$system_serial_number"
   exit 1
fi

# =========== If current status = previous status then it's OK ===================
# It seems that previous and current seem to match even when the system is rebuilding
# So it doesn't seem to be reliable to depend on this

# if [ $currentsystemstatus -eq $previoussystemstatus ]; then
totalhotspares=`/usr/bin/snmpwalk -c public -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.22 | awk '{print $NF}'| awk '/3/ {++x} END {print x}'`
totaldedicatedspares=`/usr/bin/snmpwalk -c public -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.22 | awk '{print $NF}'| awk 'BEGIN {x=0} /4/ {++x} END {print x}'`
#fi
# ========= If current status != previous status then it's Broken, figure out where =============
# except for the FIRST time this script runs, this code only runs because of a mismatch in states
# it seems safe to assume that I should check each array position for where the problem is.
currentdrive=1
while [ $currentdrive -le $total_drives ]
do
   if [ ${current_disks_state[$currentdrive]} -ne ${previous_disks_state[$currentdrive]} ]; then
     # @ currentdrive = $currentdrive + 1
   #else
   # HERE is where they differ
      systemdifferences=1
      echo -n `/usr/bin/snmpget -v1 -c public $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.2.$currentdrive | awk -F\" '{print $2}'`" "
      case "${current_disks_state[$currentdrive]}" in
         "0" )
            echo -n "Unknown";;
         "1" )
            echo -n "Ready"
            case "`/usr/bin/snmpget -v1 -c public $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.22.$currentdrive | awk '{print $NF}'`" in
               "1" )
                  echo -n "-member of virtual disk.";;
               "2" )
                  echo -n "-member of disk group.";;
               "3" )
                  echo -n "-global hot spare.";;
               "4" )
                  echo -n "-dedicated hot spare.";;
                * )
                  echo -n "Bad_ERROR_Code.";;
            esac;;
         "2" )
            echo -n "Failed";;
         "3" )
            echo -n "Online";;
         "4" )
            echo -n "Offline";;
         "6" )
            echo -n "Degraded";;
         "7" )
            echo -n "Recovering";;
         "11" )
            echo -n "Removed";;
         "15" )
            echo -n "Resyncing";;
         "24" )
            echo -n "Rebuild";;
         "25" )
            echo -n "No Media";;
         "26" )
            echo -n "Formatting";;
         "28" )
            echo -n "Diagnostics";;
         "35" )
            echo -n "Initializing";;
         * )
            echo -n "Bad_ERROR_Code";;
      esac
      echo -n " Was: "
      case "${previous_disks_state[$currentdrive]}" in
         "0" )
            echo -n "Unknown. ";;
         "1" )
            echo -n "Ready. ";;
         "2" )
            echo -n "Failed. ";;
         "3" )
            echo -n "Online. ";;
         "4" )
            echo -n "Offline. ";;
         "6" )
            echo -n "Degraded. ";;
         "7" )
            echo -n "Recovering. ";;
         "11" )
            echo -n "Removed. ";;
         "15" )
            echo -n "Resyncing. ";;
         "24" )
            echo -n "Rebuild. ";;
         "25" )
            echo -n "No Media. ";;
         "26" )
            echo -n "Formatting. ";;
         "28" )
            echo -n "Diagnostics. ";;
         "35" )
            echo -n "Initializing. ";;
         * )
            echo -n "Bad_ERROR_Code. ";;
      esac
   fi
   currentdrive=`expr $currentdrive + 1`
done
if [ $systemdifferences -eq 0 ];
then
   echo "OK - PERC Array Status, Global HotSpares=$totalhotspares, DedicatedSpares=$totaldedicatedspares"
   exit 0
else
   echo ""
   exit 2
fi

# echo values for debug/sanity
#echo CurrentSystemStatus = $currentsystemstatus
#echo PreviousSystemStatus = $previoussystemstatus
#echo Total Disk Drives = $total_drives
#echo Total Virtual Disks = $total_virtual_disks
#echo current_ disks_state  = $current_disks_state
#echo current_disks_state-14 = $current_disks_state[14]
#echo system_serial_number = $system_serial_number
#echo previous_disks_state = $previous_disks_state
==== cut here ======= Plugin: check_win_cpu_temp =======

pzero wrote:

> Hi,
>
> Would you be so kind to share your Nagios plugins for Dell servers HW 
> monitoring with the list?
>
> Thanks in advance.


-- 
Lewis Getschel             | Today is done...
WesternGeco                |     Today was fun...
1625 Broadway              |         Tomorrow is another one.
Denver, CO 80202           |
Direct Phone - 303-389-4407|        -- Dr. Seuss --




-- 
Lewis Getschel             | Today is done...
WesternGeco                |     Today was fun...
1625 Broadway              |         Tomorrow is another one.
Denver, CO 80202           |
Direct Phone - 303-389-4407|        -- Dr. Seuss --




More information about the Linux-PowerEdge mailing list