[Linux-PowerEdge] hard disk predictive failures and snmp

Anthony Godford epilepticeel at yahoo.com
Fri Sep 27 13:55:41 CDT 2013


smartd - part of the smartmontools package - can certainly do similar polling and alerting. If you go that route, keep in mind you'll have to use the megaraid or sat+megaraid device type. Ie, with one of my SSD based raid5's on an R720, I have to do:

smartctl -a /dev/sda -d sat+megaraid,0  # to query the first physical disk in the /dev/sda logical disk

smartctl -a /dev/sda -d sat+megaraid,1 # to query the second physical disk in the /dev/sda logical disk
...

(For SAS on the h710p, use just -d megaraid,N)


I never found an easy or fast way to have smartd automatic discovery enumerate all the drives behind an LSI Megaraid based PERC. 


With regards to OMSA and the iDRACs, I noticed that the iDRAC's native web interface does show the status of the PERC storage arrays. Also, the OMSA utilities are able to display the same information through omreport. While raid cards and bmc have traditionally not spoken to each other, there's no real reason why they should not. Especially in controlled-configuration sitations like an R720 and it's PERC. All that's necessary is a reasonable amount of firmware work and a serial line between the renesas processor of the iDRAC and the weirdo embedded LSI ARM core of the Megaraid. 


I just double-checked and it appears the iDRAC's still don't feature their own snmp server. Instead, OMSA passes through information from the iDRAC to the host snmpd as an smux peer. But, since ipmi isn't terribly fast, I imagine that OMSA is answering the queries regarding the storage system itself by querying the hardware instead of requesting data from the iDRAC.h


All of this, however, doesn't really answer John's question. I'd check the OMSA log files for any interesting warnings. Also, I'd check through the web interface and/or omreport that the controller actually did notice the failure condition. I was going to check the MIB but it appears that my python code from a few years ago meant to make head or tail of a MIB database is, uhh, not working nearly as well as I remebered. 


If these drives are spinners, do you have a regular self-test queue'd on them? 


Oh, also, I'd keep around one of those failed drives for testing your alerting in the future.

Good Luck
-A



________________________________
 From: Ryan Cox <ryan_cox at byu.edu>
To: linux-poweredge at dell.com 
Sent: Friday, September 27, 2013 12:38 PM
Subject: Re: [Linux-PowerEdge] hard disk predictive failures and snmp
 


You may want to look at smartd.  smartd can email you about SMART errors and run custom scripts when a problem occurs.  It works well for us.

I don't think that hard drives, RAID cards, etc usually communicate
    with the BMC (aka iDRAC) so it wouldn't have that information at the
    hardware level.  I could be wrong but that certainly seems to be the
    case.

Ryan


On 09/27/2013 10:10 AM, John v2.0 wrote:

Hello List,
>
>
I have a few PowerEdge R710s running Ubuntu 12.04 and OMSA 7.1.0-1 which were reporting disks in 'predictive failure' in OMSA but when queried via snmp were not indicating any failure.  Because of this our Nagios checks failed to notify us the disks were predicted to fail and I'm trying to understand why.
>
>
>When looking at the Nagios check/script I saw the OIDs it was querying and the values returned.
>
>iso.3.6.1.4.1.674.10893.1.20.130.4.1.4.1 = INTEGER: 3 (Online)
>iso.3.6.1.4.1.674.10893.1.20.130.4.1.4.2 = INTEGER: 3 (Online)
>iso.3.6.1.4.1.674.10893.1.20.130.4.1.4.3 = INTEGER: 3 (Online)
>iso.3.6.1.4.1.674.10893.1.20.130.4.1.4.4 = INTEGER: 3 (Online)
>iso.3.6.1.4.1.674.10893.1.20.130.4.1.4.5 = INTEGER: 3 (Online)
>iso.3.6.1.4.1.674.10893.1.20.130.4.1.4.6 = INTEGER: 3 (Online) — This disk is in predictive failure. Should be '34'.
>iso.3.6.1.4.1.674.10893.1.20.130.4.1.4.7 = INTEGER: 2 (Failed)
>iso.3.6.1.4.1.674.10893.1.20.130.4.1.4.8 = INTEGER: 3 (Online)
>
>
>The disks have since been replaced so I don't have a way to test this at the moment however I was hoping someone could provide some insight into why this may be happening.
>
>
Thanks,
>John
>
>
>_______________________________________________
Linux-PowerEdge mailing list Linux-PowerEdge at dell.com https://lists.us.dell.com/mailman/listinfo/linux-poweredge 
>
>-- 
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University http://tech.ryancox.net 
_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20130927/df83e382/attachment-0001.html 


More information about the Linux-PowerEdge mailing list