OMSA continually reports power supply issues

Wayne_Weilnau at Dell.com Wayne_Weilnau at Dell.com
Thu Sep 8 23:05:53 CDT 2011


Chuck,
Sorry to see that you are having so many issues getting updates applied.  I have very limited experience with the update process for nics, so can't really give you any advice.  It is highly unlikely that the nic issues have any relationship to your power supply issues.  I suspect that if you upgrade the power supply firmware, you will see the problem go away.  (Other option would be to swap supplies with the good system to see if the problem follows the power supplies.)

Wayne Weilnau
Systems Management Technologist
Dell | OpenManage Software Development 

Please consider the environment before printing this email.

Confidentiality Notice | This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, immediately contact the sender by reply e-mail and destroy all copies of the original message.


-----Original Message-----
From: linux-poweredge-bounces-Lists On Behalf Of Chuck Anderson
Sent: Thursday, September 08, 2011 6:52 PM
To: linux-poweredge-Lists
Subject: Re: OMSA continually reports power supply issues

On Tue, Sep 06, 2011 at 11:41:42PM -0500, Wayne_Weilnau at Dell.com wrote:
> Chuck,
> The messages for event ID 1151 have a status of unknown.  My guess (without getting somebody to look at code) is that this indicates that the OMSA agent is unable to retrieve readings from the iDrac/BMC or the iDrac/BMC is unable to retrieve the reading from the power supplies.  The fact that the recovery messages come within a few minutes of the failure messages but the failure messages can be hours apart leads me to further suspect that there is a firmware bug most likely in the power supplies.  A few questions:
> 
> 1.  Are you seeing any other monitoring errors?

I'm having a bunch of issues on this one server.  OMSA's SNMP daemon
keeps crashing, and I'm also experiencing SSH freezes.  The 10GB NIC
was showing weird errors until I finally ended up removing it after
attempting to update its firmware.

> 2.  If you look at the hardware log (SEL) via OMSA or iDrac, do you see any of these power supply events?

No events.

> 3.  If you swap power supplies with your good system, does the problem follow the power supply?

Haven't tried this yet.

> 4.  Do your working supplies have the same version of firmware?

No, they have 08.12.00.  The failing ones have 08.05.00.

> 5.  If it is possible the connect the problem system to 110V, do you still see issues?

Haven't tried this yet.

> 6.  What is the FRU data for the power supplies (manufacturer and model) on the failing system?  What about the good system?  (We may have multiple suppliers and the issue could be specific to the supplier or firmware version.)

I assume I need to look at the physical PS stickers?  I haven't done
that yet.

> 7.  What version of iDrac FW and OMSA software are you using?

iDRAC 1.70 (Build 21)
OMSA 6.5.0

> I have not seen this issue reported elsewhere, but the technical support staff is more likely to have seen this type of issue than myself.  In general, I would recommend you ensure you are at the latest iDrac and PS firmware versions.  Technical support may be able to give you more timely and accurate advice than myself......not sure how receptive they will be to your request since you are running a distro that is not officially supported.

Will contact support after I try a few more things.

What I have done so far:

update_firmware -y:

BIOS to 3.0.0 
NICs to 6.2.14 (but the BCM957711 10G SFP+ Dual Port NIC wouldn't "take" this update)
PERC 6/i to 6.3.0-001

Nothing has helped, and I think the NIC update made things worse.  I
removed the 10gig NIC to rule out any problems it might have been
causing.

I have two other of these R710s, so I've been trying to compare them
to find what is different.  I've since tried updating the firmware on
one of the others, to see if I could reproduce these issues with that
one.  The BCM957711 in that one "took" the update to 6.2.14, but after
a reboot it reverted to firmware 5.0.13.  update_firmware offers to
load 6.2.14 on it if I run it again.  And now I'm getting "Parity
errors detected in blocks: MCP SCPAD" whenver something pokes at the
10gig NIC.  The built-in 1gig NICs took the update and work fine with
6.2.14.

I think using update_firmware to load 6.2.14 on the BCM NICs was a bad
idea.  So I went looking for even newer firmware, and support.dell.com
had 6.4.4, A03.  When I tried to load that one the first time, I got
"Unsupported update package".  But after a few reboots of trying other
things, I just tried to apply NETW_FRMW_LX_R309327.BIN again, and this
time it did the update.  Unfortunately, after rebooting, I was still
at 6.2.14 on the GbE and 5.0.13 on the 10GbE NICs.

> Wayne Weilnau
> Systems Management Technologist
> Dell | OpenManage Software Development 
> 
> Please consider the environment before printing this email.
> 
> Confidentiality Notice | This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, immediately contact the sender by reply e-mail and destroy all copies of the original message.
> 
> 
> -----Original Message-----
> From: linux-poweredge-bounces-Lists On Behalf Of Chuck Anderson
> Sent: Tuesday, September 06, 2011 5:47 PM
> To: linux-poweredge-Lists
> Subject: Re: OMSA continually reports power supply issues
> 
> BTW, this is Scientific Linux 6.1 (RHEL 6.1 clone) with
> srvadmin-all-6.5.0-1.1.1.el6.x86_64, running on a Dell PowerEdge R710.
> 
> And I have another pretty much identical R710 with the same setup
> where this is NOT happening.  A notable difference is that one is
> running on 208V instead of 120V.
> 
> On Tue, Sep 06, 2011 at 06:41:37PM -0400, Chuck Anderson wrote:
> > OMSA is telling me both of my power supplies keep changing from 118
> > Volts input to 0 Volts input.  I've checked and rechecked the power
> > cords, reseated the power supplies, etc. but the logs still keep
> > coming in.  The iDRAC reports no issues with the power supplies.  Has
> > anyone else seen this?  Is this is software/firmware issue or some
> > real hardware issue?
> > 
> > According to iDRAC, the power supplies have firmware 08.05.00:
> > 
> > Individual Power Supply Elements
> >    Status 	Location	Type	Input Wattage	Max Wattage	Online Status	FW Version	
> >  		PS 1 		AC	1080  		870		Present		08.05.00	
> > 		PS 2 		AC	1080  		870		Present		08.05.00	
> > 
> > Sep  6 11:27:51 hostname Server Administrator: Instrumentation Service EventID: 1152  Voltage sensor returned to a normal value #012Sensor location: PS 1 Voltage #012Chassis location: Main System Chassis #012Previous state was: Unknown #012Voltage sensor value (in Volts): 118.000
> > Sep  6 13:22:29 hostname Server Administrator: Instrumentation Service EventID: 1151  Voltage sensor value unknown #012Sensor location: PS 1 Voltage #012Chassis location: Main System Chassis #012Previous state was: OK (Normal) #012Voltage sensor value (in Volts): 0.000
> > Sep  6 13:26:57 hostname Server Administrator: Instrumentation Service EventID: 1152  Voltage sensor returned to a normal value #012Sensor location: PS 1 Voltage #012Chassis location: Main System Chassis #012Previous state was: Unknown #012Voltage sensor value (in Volts): 118.000
> > Sep  6 13:57:04 hostname Server Administrator: Instrumentation Service EventID: 1151  Voltage sensor value unknown #012Sensor location: PS 2 Voltage #012Chassis location: Main System Chassis #012Previous state was: OK (Normal) #012Voltage sensor value (in Volts): 0.000
> > Sep  6 14:00:57 hostname Server Administrator: Instrumentation Service EventID: 1152  Voltage sensor returned to a normal value #012Sensor location: PS 2 Voltage #012Chassis location: Main System Chassis #012Previous state was: Unknown #012Voltage sensor value (in Volts): 118.000
> > Sep  6 15:31:36 hostname Server Administrator: Instrumentation Service EventID: 1151  Voltage sensor value unknown #012Sensor location: PS 2 Voltage #012Chassis location: Main System Chassis #012Previous state was: OK (Normal) #012Voltage sensor value (in Volts): 0.000
> > Sep  6 15:34:41 hostname Server Administrator: Instrumentation Service EventID: 1152  Voltage sensor returned to a normal value #012Sensor location: PS 2 Voltage #012Chassis location: Main System Chassis #012Previous state was: Unknown #012Voltage sensor value (in Volts): 118.000
> > Sep  6 16:03:55 hostname Server Administrator: Instrumentation Service EventID: 1151  Voltage sensor value unknown #012Sensor location: PS 2 Voltage #012Chassis location: Main System Chassis #012Previous state was: OK (Normal) #012Voltage sensor value (in Volts): 0.000
> > Sep  6 16:04:20 hostname Server Administrator: Instrumentation Service EventID: 1152  Voltage sensor returned to a normal value #012Sensor location: PS 2 Voltage #012Chassis location: Main System Chassis #012Previous state was: Unknown #012Voltage sensor value (in Volts): 118.000
> > Sep  6 18:02:10 hostname Server Administrator: Instrumentation Service EventID: 1151  Voltage sensor value unknown #012Sensor location: PS 1 Voltage #012Chassis location: Main System Chassis #012Previous state was: OK (Normal) #012Voltage sensor value (in Volts): 0.000
> > Sep  6 18:04:40 hostname Server Administrator: Instrumentation Service EventID: 1152  Voltage sensor returned to a normal value #012Sensor location: PS 1 Voltage #012Chassis location: Main System Chassis #012Previous state was: Unknown #012Voltage sensor value (in Volts): 118.000
> > 
> > Thanks,
> > Chuck

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge



More information about the Linux-PowerEdge mailing list