Linux-PowerEdge Digest, Vol 88, Issue 12

prashant.sashidharan at wipro.com prashant.sashidharan at wipro.com
Thu Sep 8 23:08:52 CDT 2011


HI Mario,

There is a hardware failure. Need to get it replaced from Dell. Need to
log a call with them. 

-----Original Message-----
From: linux-poweredge-bounces at dell.com
[mailto:linux-poweredge-bounces at dell.com] On Behalf Of
linux-poweredge-request at dell.com
Sent: Friday, September 09, 2011 9:36 AM
To: linux-poweredge at dell.com
Subject: Linux-PowerEdge Digest, Vol 88, Issue 12

Send Linux-PowerEdge mailing list submissions to
	linux-poweredge at dell.com

To subscribe or unsubscribe via the World Wide Web, visit
	https://lists.us.dell.com/mailman/listinfo/linux-poweredge
or, via email, send a message with subject or body 'help' to
	linux-poweredge-request at dell.com

You can reach the person managing the list at
	linux-poweredge-owner at dell.com

When replying, please edit your Subject line so it is more specific than
"Re: Contents of Linux-PowerEdge digest..."


Today's Topics:

   1. sdb:<3>Buffer I/O error on device sdb, logical block 0
      (Mario Chancay)
   2. Re: OMSA continually reports power supply issues (Chuck Anderson)
   3. RE: OMSA continually reports power supply issues
      (Wayne_Weilnau at Dell.com)


----------------------------------------------------------------------

Message: 1
Date: Thu, 8 Sep 2011 13:07:13 -0700 (PDT)
From: Mario Chancay <mario.chancay at yahoo.com>
Subject: sdb:<3>Buffer I/O error on device sdb, logical block 0
To: "linux-poweredge at dell.com" <linux-poweredge at dell.com>
Message-ID:
	<1315512433.95368.YahooMailNeo at web45201.mail.sp1.yahoo.com>
Content-Type: text/plain; charset="utf-8"

Hi, we have a Dell? PowerEdge R710 with 6 x 600 Gb SAS Drivers.? Today
we started to notice the following error messages :

sdb: assuming drive cache: write through ?sdb:<3>Buffer I/O error on
device sdb, logical block 0 Buffer I/O error on device sdb, logical
block 0 Buffer I/O error on device sdb, logical block 0 Buffer I/O error
on device sdb, logical block 0 Buffer I/O error on device sdb, logical
block 0 Buffer I/O error on device sdb, logical block 0 Buffer I/O error
on device sdb, logical block 0 ?unable to read partition table sdb :
READ CAPACITY failed.
sdb : status=0, message=00, host=7, driver=00 sdb : sense not available.
sdb: Write Protect is off
sdb: Mode Sense: 23 00 00 00
sdb: assuming drive cache: write through sdb : READ CAPACITY failed.
sdb : status=0, message=00, host=7, driver=00 sdb : sense not available.
sdb: Write Protect is off
sdb: Mode Sense: 23 00 00 00

The LED displays the following error message :

E1810 hard drive 0 fault. Review & clear SEL?

Need to understand the how to proceed with this kind of error messages.

?
Regards

Mario
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20110908/
cffe1708/attachment-0001.html 

------------------------------

Message: 2
Date: Thu, 8 Sep 2011 19:51:45 -0400
From: Chuck Anderson <cra at WPI.EDU>
Subject: Re: OMSA continually reports power supply issues
To: linux-poweredge at lists.us.dell.com
Message-ID: <20110908235145.GV19355 at angus.ind.WPI.EDU>
Content-Type: text/plain; charset=us-ascii

On Tue, Sep 06, 2011 at 11:41:42PM -0500, Wayne_Weilnau at Dell.com wrote:
> Chuck,
> The messages for event ID 1151 have a status of unknown.  My guess
(without getting somebody to look at code) is that this indicates that
the OMSA agent is unable to retrieve readings from the iDrac/BMC or the
iDrac/BMC is unable to retrieve the reading from the power supplies.
The fact that the recovery messages come within a few minutes of the
failure messages but the failure messages can be hours apart leads me to
further suspect that there is a firmware bug most likely in the power
supplies.  A few questions:
> 
> 1.  Are you seeing any other monitoring errors?

I'm having a bunch of issues on this one server.  OMSA's SNMP daemon
keeps crashing, and I'm also experiencing SSH freezes.  The 10GB NIC was
showing weird errors until I finally ended up removing it after
attempting to update its firmware.

> 2.  If you look at the hardware log (SEL) via OMSA or iDrac, do you
see any of these power supply events?

No events.

> 3.  If you swap power supplies with your good system, does the problem
follow the power supply?

Haven't tried this yet.

> 4.  Do your working supplies have the same version of firmware?

No, they have 08.12.00.  The failing ones have 08.05.00.

> 5.  If it is possible the connect the problem system to 110V, do you
still see issues?

Haven't tried this yet.

> 6.  What is the FRU data for the power supplies (manufacturer and 
> model) on the failing system?  What about the good system?  (We may 
> have multiple suppliers and the issue could be specific to the 
> supplier or firmware version.)

I assume I need to look at the physical PS stickers?  I haven't done
that yet.

> 7.  What version of iDrac FW and OMSA software are you using?

iDRAC 1.70 (Build 21)
OMSA 6.5.0

> I have not seen this issue reported elsewhere, but the technical
support staff is more likely to have seen this type of issue than
myself.  In general, I would recommend you ensure you are at the latest
iDrac and PS firmware versions.  Technical support may be able to give
you more timely and accurate advice than myself......not sure how
receptive they will be to your request since you are running a distro
that is not officially supported.

Will contact support after I try a few more things.

What I have done so far:

update_firmware -y:

BIOS to 3.0.0
NICs to 6.2.14 (but the BCM957711 10G SFP+ Dual Port NIC wouldn't "take"
this update) PERC 6/i to 6.3.0-001

Nothing has helped, and I think the NIC update made things worse.  I
removed the 10gig NIC to rule out any problems it might have been
causing.

I have two other of these R710s, so I've been trying to compare them to
find what is different.  I've since tried updating the firmware on one
of the others, to see if I could reproduce these issues with that one.
The BCM957711 in that one "took" the update to 6.2.14, but after a
reboot it reverted to firmware 5.0.13.  update_firmware offers to load
6.2.14 on it if I run it again.  And now I'm getting "Parity errors
detected in blocks: MCP SCPAD" whenver something pokes at the 10gig NIC.
The built-in 1gig NICs took the update and work fine with 6.2.14.

I think using update_firmware to load 6.2.14 on the BCM NICs was a bad
idea.  So I went looking for even newer firmware, and support.dell.com
had 6.4.4, A03.  When I tried to load that one the first time, I got
"Unsupported update package".  But after a few reboots of trying other
things, I just tried to apply NETW_FRMW_LX_R309327.BIN again, and this
time it did the update.  Unfortunately, after rebooting, I was still at
6.2.14 on the GbE and 5.0.13 on the 10GbE NICs.

> Wayne Weilnau
> Systems Management Technologist
> Dell | OpenManage Software Development
> 
> Please consider the environment before printing this email.
> 
> Confidentiality Notice | This e-mail message, including any
attachments, is for the sole use of the intended recipient(s) and may
contain confidential or proprietary information. Any unauthorized
review, use, disclosure or distribution is prohibited. If you are not
the intended recipient, immediately contact the sender by reply e-mail
and destroy all copies of the original message.
> 
> 
> -----Original Message-----
> From: linux-poweredge-bounces-Lists On Behalf Of Chuck Anderson
> Sent: Tuesday, September 06, 2011 5:47 PM
> To: linux-poweredge-Lists
> Subject: Re: OMSA continually reports power supply issues
> 
> BTW, this is Scientific Linux 6.1 (RHEL 6.1 clone) with 
> srvadmin-all-6.5.0-1.1.1.el6.x86_64, running on a Dell PowerEdge R710.
> 
> And I have another pretty much identical R710 with the same setup 
> where this is NOT happening.  A notable difference is that one is 
> running on 208V instead of 120V.
> 
> On Tue, Sep 06, 2011 at 06:41:37PM -0400, Chuck Anderson wrote:
> > OMSA is telling me both of my power supplies keep changing from 118 
> > Volts input to 0 Volts input.  I've checked and rechecked the power 
> > cords, reseated the power supplies, etc. but the logs still keep 
> > coming in.  The iDRAC reports no issues with the power supplies.  
> > Has anyone else seen this?  Is this is software/firmware issue or 
> > some real hardware issue?
> > 
> > According to iDRAC, the power supplies have firmware 08.05.00:
> > 
> > Individual Power Supply Elements
> >    Status 	Location	Type	Input Wattage	Max Wattage
Online Status	FW Version	
> >  		PS 1 		AC	1080  		870
Present		08.05.00	
> > 		PS 2 		AC	1080  		870
Present		08.05.00	
> > 
> > Sep  6 11:27:51 hostname Server Administrator: Instrumentation 
> > Service EventID: 1152  Voltage sensor returned to a normal value 
> > #012Sensor location: PS 1 Voltage #012Chassis location: Main System 
> > Chassis #012Previous state was: Unknown #012Voltage sensor value (in

> > Volts): 118.000 Sep  6 13:22:29 hostname Server Administrator: 
> > Instrumentation Service EventID: 1151  Voltage sensor value unknown 
> > #012Sensor location: PS 1 Voltage #012Chassis location: Main System 
> > Chassis #012Previous state was: OK (Normal) #012Voltage sensor value

> > (in Volts): 0.000 Sep  6 13:26:57 hostname Server Administrator: 
> > Instrumentation Service EventID: 1152  Voltage sensor returned to a 
> > normal value #012Sensor location: PS 1 Voltage #012Chassis location:

> > Main System Chassis #012Previous state was: Unknown #012Voltage 
> > sensor value (in Volts): 118.000 Sep  6 13:57:04 hostname Server 
> > Administrator: Instrumentation Service EventID: 1151  Voltage sensor

> > value unknown #012Sensor location: PS 2 Voltage #012Chassis 
> > location: Main System Chassis #012Previous state was: OK (Normal) 
> > #012Voltage sensor value (in Volts): 0.000 Sep  6 14:00:57 hostname 
> > Server Administrator: Instrumentation Service EventID: 1152  Voltage

> > sensor returned to a normal value #012Sensor location: PS 2 Voltage 
> > #012Chassis location: Main System Chassis #012Previous state was: 
> > Unknown #012Voltage sensor value (in Volts): 118.000 Sep  6 15:31:36

> > hostname Server Administrator: Instrumentation Service EventID: 1151

> > Voltage sensor value unknown #012Sensor location: PS 2 Voltage 
> > #012Chassis location: Main System Chassis #012Previous state was: OK

> > (Normal) #012Voltage sensor value (in Volts): 0.000 Sep  6 15:34:41 
> > hostname Server Administrator: Instrumentation Service EventID: 1152

> > Voltage sensor returned to a normal value #012Sensor location: PS 2 
> > Voltage #012Chassis location: Main System Chassis #012Previous state

> > was: Unknown #012Voltage sensor value (in Volts): 118.000 Sep  6 
> > 16:03:55 hostname Server Administrator: Instrumentation Service 
> > EventID: 1151  Voltage sensor value unknown #012Sensor location: PS 
> > 2 Voltage #012Chassis location: Main System Chassis #012Previous 
> > state was: OK (Normal) #012Voltage sensor value (in Volts): 0.000 
> > Sep  6 16:04:20 hostname Server Administrator: Instrumentation 
> > Service EventID: 1152  Voltage sensor returned to a normal value 
> > #012Sensor location: PS 2 Voltage #012Chassis location: Main System 
> > Chassis #012Previous state was: Unknown #012Voltage sensor value (in

> > Volts): 118.000 Sep  6 18:02:10 hostname Server Administrator: 
> > Instrumentation Service EventID: 1151  Voltage sensor value unknown 
> > #012Sensor location: PS 1 Voltage #012Chassis location: Main System 
> > Chassis #012Previous state was: OK (Normal) #012Voltage sensor value

> > (in Volts): 0.000 Sep  6 18:04:40 hostname Server Administrator: 
> > Instrumentation Service EventID: 1152  Voltage sensor returned to a 
> > normal value #012Sensor location: PS 1 Voltage #012Chassis location:

> > Main System Chassis #012Previous state was: Unknown #012Voltage 
> > sensor value (in Volts): 118.000
> > 
> > Thanks,
> > Chuck



------------------------------

Message: 3
Date: Thu, 8 Sep 2011 23:05:53 -0500
From: <Wayne_Weilnau at Dell.com>
Subject: RE: OMSA continually reports power supply issues
To: <cra at WPI.EDU>, <linux-poweredge at lists.us.dell.com>
Message-ID:
	
<07E32F241046DA418A0381C1225BFC011A94ACC055 at AUSX7MCPS301.AMER.DELL.COM>
	
Content-Type: text/plain; charset="us-ascii"

Chuck,
Sorry to see that you are having so many issues getting updates applied.
I have very limited experience with the update process for nics, so
can't really give you any advice.  It is highly unlikely that the nic
issues have any relationship to your power supply issues.  I suspect
that if you upgrade the power supply firmware, you will see the problem
go away.  (Other option would be to swap supplies with the good system
to see if the problem follows the power supplies.)

Wayne Weilnau
Systems Management Technologist
Dell | OpenManage Software Development 

Please consider the environment before printing this email.

Confidentiality Notice | This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential or proprietary information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, immediately contact the sender by reply e-mail and destroy
all copies of the original message.


-----Original Message-----
From: linux-poweredge-bounces-Lists On Behalf Of Chuck Anderson
Sent: Thursday, September 08, 2011 6:52 PM
To: linux-poweredge-Lists
Subject: Re: OMSA continually reports power supply issues

On Tue, Sep 06, 2011 at 11:41:42PM -0500, Wayne_Weilnau at Dell.com wrote:
> Chuck,
> The messages for event ID 1151 have a status of unknown.  My guess
(without getting somebody to look at code) is that this indicates that
the OMSA agent is unable to retrieve readings from the iDrac/BMC or the
iDrac/BMC is unable to retrieve the reading from the power supplies.
The fact that the recovery messages come within a few minutes of the
failure messages but the failure messages can be hours apart leads me to
further suspect that there is a firmware bug most likely in the power
supplies.  A few questions:
> 
> 1.  Are you seeing any other monitoring errors?

I'm having a bunch of issues on this one server.  OMSA's SNMP daemon
keeps crashing, and I'm also experiencing SSH freezes.  The 10GB NIC was
showing weird errors until I finally ended up removing it after
attempting to update its firmware.

> 2.  If you look at the hardware log (SEL) via OMSA or iDrac, do you
see any of these power supply events?

No events.

> 3.  If you swap power supplies with your good system, does the problem
follow the power supply?

Haven't tried this yet.

> 4.  Do your working supplies have the same version of firmware?

No, they have 08.12.00.  The failing ones have 08.05.00.

> 5.  If it is possible the connect the problem system to 110V, do you
still see issues?

Haven't tried this yet.

> 6.  What is the FRU data for the power supplies (manufacturer and 
> model) on the failing system?  What about the good system?  (We may 
> have multiple suppliers and the issue could be specific to the 
> supplier or firmware version.)

I assume I need to look at the physical PS stickers?  I haven't done
that yet.

> 7.  What version of iDrac FW and OMSA software are you using?

iDRAC 1.70 (Build 21)
OMSA 6.5.0

> I have not seen this issue reported elsewhere, but the technical
support staff is more likely to have seen this type of issue than
myself.  In general, I would recommend you ensure you are at the latest
iDrac and PS firmware versions.  Technical support may be able to give
you more timely and accurate advice than myself......not sure how
receptive they will be to your request since you are running a distro
that is not officially supported.

Will contact support after I try a few more things.

What I have done so far:

update_firmware -y:

BIOS to 3.0.0
NICs to 6.2.14 (but the BCM957711 10G SFP+ Dual Port NIC wouldn't "take"
this update) PERC 6/i to 6.3.0-001

Nothing has helped, and I think the NIC update made things worse.  I
removed the 10gig NIC to rule out any problems it might have been
causing.

I have two other of these R710s, so I've been trying to compare them to
find what is different.  I've since tried updating the firmware on one
of the others, to see if I could reproduce these issues with that one.
The BCM957711 in that one "took" the update to 6.2.14, but after a
reboot it reverted to firmware 5.0.13.  update_firmware offers to load
6.2.14 on it if I run it again.  And now I'm getting "Parity errors
detected in blocks: MCP SCPAD" whenver something pokes at the 10gig NIC.
The built-in 1gig NICs took the update and work fine with 6.2.14.

I think using update_firmware to load 6.2.14 on the BCM NICs was a bad
idea.  So I went looking for even newer firmware, and support.dell.com
had 6.4.4, A03.  When I tried to load that one the first time, I got
"Unsupported update package".  But after a few reboots of trying other
things, I just tried to apply NETW_FRMW_LX_R309327.BIN again, and this
time it did the update.  Unfortunately, after rebooting, I was still at
6.2.14 on the GbE and 5.0.13 on the 10GbE NICs.

> Wayne Weilnau
> Systems Management Technologist
> Dell | OpenManage Software Development
> 
> Please consider the environment before printing this email.
> 
> Confidentiality Notice | This e-mail message, including any
attachments, is for the sole use of the intended recipient(s) and may
contain confidential or proprietary information. Any unauthorized
review, use, disclosure or distribution is prohibited. If you are not
the intended recipient, immediately contact the sender by reply e-mail
and destroy all copies of the original message.
> 
> 
> -----Original Message-----
> From: linux-poweredge-bounces-Lists On Behalf Of Chuck Anderson
> Sent: Tuesday, September 06, 2011 5:47 PM
> To: linux-poweredge-Lists
> Subject: Re: OMSA continually reports power supply issues
> 
> BTW, this is Scientific Linux 6.1 (RHEL 6.1 clone) with 
> srvadmin-all-6.5.0-1.1.1.el6.x86_64, running on a Dell PowerEdge R710.
> 
> And I have another pretty much identical R710 with the same setup 
> where this is NOT happening.  A notable difference is that one is 
> running on 208V instead of 120V.
> 
> On Tue, Sep 06, 2011 at 06:41:37PM -0400, Chuck Anderson wrote:
> > OMSA is telling me both of my power supplies keep changing from 118 
> > Volts input to 0 Volts input.  I've checked and rechecked the power 
> > cords, reseated the power supplies, etc. but the logs still keep 
> > coming in.  The iDRAC reports no issues with the power supplies.  
> > Has anyone else seen this?  Is this is software/firmware issue or 
> > some real hardware issue?
> > 
> > According to iDRAC, the power supplies have firmware 08.05.00:
> > 
> > Individual Power Supply Elements
> >    Status 	Location	Type	Input Wattage	Max Wattage
Online Status	FW Version	
> >  		PS 1 		AC	1080  		870
Present		08.05.00	
> > 		PS 2 		AC	1080  		870
Present		08.05.00	
> > 
> > Sep  6 11:27:51 hostname Server Administrator: Instrumentation 
> > Service EventID: 1152  Voltage sensor returned to a normal value 
> > #012Sensor location: PS 1 Voltage #012Chassis location: Main System 
> > Chassis #012Previous state was: Unknown #012Voltage sensor value (in

> > Volts): 118.000 Sep  6 13:22:29 hostname Server Administrator: 
> > Instrumentation Service EventID: 1151  Voltage sensor value unknown 
> > #012Sensor location: PS 1 Voltage #012Chassis location: Main System 
> > Chassis #012Previous state was: OK (Normal) #012Voltage sensor value

> > (in Volts): 0.000 Sep  6 13:26:57 hostname Server Administrator: 
> > Instrumentation Service EventID: 1152  Voltage sensor returned to a 
> > normal value #012Sensor location: PS 1 Voltage #012Chassis location:

> > Main System Chassis #012Previous state was: Unknown #012Voltage 
> > sensor value (in Volts): 118.000 Sep  6 13:57:04 hostname Server 
> > Administrator: Instrumentation Service EventID: 1151  Voltage sensor

> > value unknown #012Sensor location: PS 2 Voltage #012Chassis 
> > location: Main System Chassis #012Previous state was: OK (Normal) 
> > #012Voltage sensor value (in Volts): 0.000 Sep  6 14:00:57 hostname 
> > Server Administrator: Instrumentation Service EventID: 1152  Voltage

> > sensor returned to a normal value #012Sensor location: PS 2 Voltage 
> > #012Chassis location: Main System Chassis #012Previous state was: 
> > Unknown #012Voltage sensor value (in Volts): 118.000 Sep  6 15:31:36

> > hostname Server Administrator: Instrumentation Service EventID: 1151

> > Voltage sensor value unknown #012Sensor location: PS 2 Voltage 
> > #012Chassis location: Main System Chassis #012Previous state was: OK

> > (Normal) #012Voltage sensor value (in Volts): 0.000 Sep  6 15:34:41 
> > hostname Server Administrator: Instrumentation Service EventID: 1152

> > Voltage sensor returned to a normal value #012Sensor location: PS 2 
> > Voltage #012Chassis location: Main System Chassis #012Previous state

> > was: Unknown #012Voltage sensor value (in Volts): 118.000 Sep  6 
> > 16:03:55 hostname Server Administrator: Instrumentation Service 
> > EventID: 1151  Voltage sensor value unknown #012Sensor location: PS 
> > 2 Voltage #012Chassis location: Main System Chassis #012Previous 
> > state was: OK (Normal) #012Voltage sensor value (in Volts): 0.000 
> > Sep  6 16:04:20 hostname Server Administrator: Instrumentation 
> > Service EventID: 1152  Voltage sensor returned to a normal value 
> > #012Sensor location: PS 2 Voltage #012Chassis location: Main System 
> > Chassis #012Previous state was: Unknown #012Voltage sensor value (in

> > Volts): 118.000 Sep  6 18:02:10 hostname Server Administrator: 
> > Instrumentation Service EventID: 1151  Voltage sensor value unknown 
> > #012Sensor location: PS 1 Voltage #012Chassis location: Main System 
> > Chassis #012Previous state was: OK (Normal) #012Voltage sensor value

> > (in Volts): 0.000 Sep  6 18:04:40 hostname Server Administrator: 
> > Instrumentation Service EventID: 1152  Voltage sensor returned to a 
> > normal value #012Sensor location: PS 1 Voltage #012Chassis location:

> > Main System Chassis #012Previous state was: Unknown #012Voltage 
> > sensor value (in Volts): 118.000
> > 
> > Thanks,
> > Chuck

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge



------------------------------

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge

End of Linux-PowerEdge Digest, Vol 88, Issue 12
***********************************************

Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. 

www.wipro.com



More information about the Linux-PowerEdge mailing list