memory test on Linux

Fred Skrotzki fskrotzki at textwise.com
Thu Oct 12 20:40:57 CDT 2006


Here is how to use the BMC to determine which memory dimms are bad for 1425SC, 2850 and 2950 (That is all we have and I can confirm this works on those).
 
First off do a   
ipmitool -I open sel list   If local on the box
or ipmitool -I lan -U root .... sel list    if not local
 
You will then get a list of all events
   1 | 04/16/2005 | 04:55:51 | Event Logging Disabled #0x51 | Log area reset/cleared | Asserted
   2 | Pre-Init Time-stamp   | Physical Security #0x52 | General Chassis intrusion | Asserted
   3 | Pre-Init Time-stamp   | Physical Security #0x52 | General Chassis intrusion | Deasserted
   4 | Pre-Init Time-stamp   | Power Supply #0x42 | Failure detected | Asserted
   5 | Pre-Init Time-stamp   | Power Supply #0x42 | Power Supply AC lost | Asserted
...
  58 | 09/13/2006 | 06:27:59 | Memory #0x01 | Correctable ECC | Asserted
  59 | 09/16/2006 | 14:54:27 | Memory #0x01 | Correctable ECC | Asserted
  5a | 09/27/2006 | 15:12:49 | Memory #0x01 | Correctable ECC | Asserted
  5b | 10/03/2006 | 12:56:34 | Memory #0x01 | Correctable ECC | Asserted
  5c | 10/04/2006 | 11:55:33 | Memory #0x01 | Correctable ECC | Asserted
  5d | 10/07/2006 | 12:42:21 | Memory #0x01 | Correctable ECC | Asserted
  5e | 10/08/2006 | 11:41:20 | Memory #0x01 | Correctable ECC | Asserted
  5f | 10/09/2006 | 06:50:30 | Memory #0x01 | Correctable ECC | Asserted

Next if you have a LONG list of events like this server (Yea we have a memory issue also, seems to be a general issue with 2 gig Dimms lately.
 
you can get more details of the event by adding a -v to the end of the command and if it is a long list you can use a command like sel list last 5 -v which will dump the last 5 events.
 
SEL Record ID          : 005b
 Record Type           : 02
 Timestamp             : 10/03/2006 12:56:34
 Generator ID          : 00b1
 EvM Revision          : 04
 Sensor Type           : Memory
 Sensor Number         : 01
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : a0f000
 Description           : Correctable ECC
SEL Record ID          : 005c
 Record Type           : 02
 Timestamp             : 10/04/2006 11:55:33
 Generator ID          : 00b1
 EvM Revision          : 04
 Sensor Type           : Memory
 Sensor Number         : 01
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : a0f000
 Description           : Correctable ECC
SEL Record ID          : 005d
 Record Type           : 02
 Timestamp             : 10/07/2006 12:42:21
 Generator ID          : 00b1
 EvM Revision          : 04
 Sensor Type           : Memory
 Sensor Number         : 01
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : a0f000
 Description           : Correctable ECC
SEL Record ID          : 005e
 Record Type           : 02
 Timestamp             : 10/08/2006 11:41:20
 Generator ID          : 00b1
 EvM Revision          : 04
 Sensor Type           : Memory
 Sensor Number         : 01
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : a0f000
 Description           : Correctable ECC
SEL Record ID          : 005f
 Record Type           : 02
 Timestamp             : 10/09/2006 06:50:30
 Generator ID          : 00b1
 EvM Revision          : 04
 Sensor Type           : Memory
 Sensor Number         : 01
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : a0f000
 Description           : Correctable ECC
 
Now to decode exactly which Dimm look at the Event Data.  The Forth and Sixth digits determine which Dimm group and pair position.  So in our case dimm group 0 first memory stick.  a0f001 would be second dimm in group 0.  a0f100 would be group 1 first dimm.
 
I've used exactly what I've posted above to get memory swapped out without a issue multiple times.  Nice thing about this information is it is OS independent so you don't need to talk to a specialist in Linux support or Windows support just anybody in hardware support.  which allows me to normally get in and out in around 5 minutes, 10 minutes if I use the online chat with a tech.  Some are very impressed that I can get this specific information from the server if they have never seen it before. 

	-----Original Message----- 
	From: linux-poweredge-bounces at dell.com on behalf of Artur Shnayder 
	Sent: Tue 10/10/2006 3:01 PM 
	To: linux-poweredge at dell.com 
	Cc: 
	Subject: memory test on Linux
	
	

	 Hello,

	 

	I need to test physical memory in PE1425SC. I have serial console to the computerandIrunRHE4OS.DellprovidesonlyISOimageshttp://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=en&s=bsd&SystemID=PWE_XEO_1425SC&os=LIN4&osl=en&deviceid=3053&typecnt=1&libid=13&releaseid=R102626&vercnt=2. Can anyone advise a way to run Dell memory test remotely, like http://www.memtest86.com/, where special memory test kernel can be specified in boot manager.

	 

	Thanks,

	Artur

	 

	   




More information about the Linux-PowerEdge mailing list