Memory issues and BMC detecting them...

Fred Skrotzki fskrotzki at textwise.com
Wed May 3 11:39:22 CDT 2006


Nothing to do directly with Linus except that it was running on each of
the boxes mentioned bellow.  This is purely a hardware issue.
 
We had a bad 2 gig stick of memory suddenly goes in one of our 2850's
over the weekend.  What has me a bit confused is the following.  
 
We know when it started acting up as we have a entry in the systemlog
and the system suddenly stopped responding 10:16am.  But the BMC and
memory reporting didn't show the error for another 14 minutes and then
labeled it as a Memory #0x01 | Correctable ECC | Asserted.  Boy was that
wrong it was not correctable.
 
If you placed this stick of memory into ANY system (another 2850, or
even a 1425SC) in first DIMM slot and power the unit up and starts the
beeping complaining of memory (exact same as if none was installed).  
 
Now durring our tests to determine which stick it was bad we we used a
spare 1425SC. If this stick was NOT in the first slot the system would
start to do memory testing then suddenly just reboot (No error showing
up in the BMC logs). Only when we placed this stick in the first slot
and powered it up thus getting the beeping would it show up in the logs
and then only after several minutes. Now you'll say are you sure it took
several minutes and NOT just a differance between your clock and the
computers.  I'm sure because I did the test, I opened the unit installed
the memory and closed the unit in one minute then powered it up.  It
reported 8+ minutes between the time the computer was opened and the
first detection of a memory failure, mean while it was just beeping
away.
 
Questions.
 
Why did it take time to report the error?
Why was it the same on both servers classes?
Why did it NOT report a total failure of the memory which is what the
real issue was.
 
Now we are concerned that we can not trust the reporting of the BMC
system in regards to memory because of these issues, so how can we
re-assured.
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20060503/96a92925/attachment-0001.htm 


More information about the Linux-PowerEdge mailing list