Drive failed? or not?

Jeff Boyce jboyce at meridianenv.com
Wed Apr 8 12:28:07 CDT 2009


Greetings -

I had a hard drive fail last night and my raid rebuilt to the hotspare. 
Looking through my log files this morning, I am now not sure that the drive 
completely failed.  My system details are: Dell PE2600, Perc 4/Di, Raid 5 
with five 36GB drives, OMSA 5.1, RHEL 3 update 9, all firmware updated last 
week.  System log files are listed below.  My interpretation and questions 
are below the log files.

Embedded System Management (ESM) Log :  (sorted most recent at top)
Ok;Tue Apr  7 20:50:33 2009;Drive 5 drive slot sensor drive ok
Critical;Tue Apr  7 20:50:33 2009;Drive 5 drive slot sensor drive fault 
detected

/var/log/megaserv.log
[04/04/2009 (19:11:19)]:
 Adapter 0 Channel 0 Target 0:  Media Error Count=0, Other Error Count=1
[04/04/2009 (19:11:19)]:
 Adapter 0 Channel 0 Target 1:  Media Error Count=0, Other Error Count=1
[04/04/2009 (19:11:19)]:
 Adapter 0 Channel 0 Target 2:  Media Error Count=0, Other Error Count=1
[04/04/2009 (19:11:19)]:
 Adapter 0 Channel 0 Target 3:  Media Error Count=0, Other Error Count=1
[04/04/2009 (19:11:19)]:
 Adapter 0 Channel 0 Target 4:  Media Error Count=0, Other Error Count=1
[04/04/2009 (19:11:19)]:
 Adapter 0 Channel 0 Target 5:  Media Error Count=0, Other Error Count=1
[04/05/2009 (09:47:34)]:
 Adapter 0:  No of Charge Cycles = 1082
[04/06/2009 (13:37:19)]:
 Adapter 0:  No of Charge Cycles = 1083
[04/07/2009 (20:50:42)]:
 Adapter 0 Logical Drive 0  is DEGRADED.
[04/07/2009 (20:50:46)]:
 Adapter 0 Channel 0 Target 2:
  Physical Drive[HITACHI HUS151436VL3800 S3BA]is Changed to REBUILD.
[04/07/2009 (20:50:47)]:
 Adapter 0 Channel 0 Target 5:
  Physical Drive[SEAGATE ST336753LC      DX10] is Changed to READY.
 Reason_0=Fail by host. Reason_1=Select timeout.
[04/07/2009 (20:51:43)]:
 Adapter 0 Channel 0 Target 2:
  Physical Drive[HITACHI HUS151436VL3800 S3BA]: REBUILD PROGRESS 1%

/var/log/messages
Apr  7 20:51:04 Bison Server Administrator: Storage Service EventID: 2065 
Physical disk Rebuild started:  Physical Disk 0:2 Controller 0, Connector 0
Apr  7 20:51:05 Bison Server Administrator: Storage Service EventID: 2196 
Dedicated hot spare unassigned physical disk 0:2:  Virtual Disk 0 (Virtual 
Disk 0) Controller 0 (PERC 4/Di)
Apr  7 20:51:05 Bison snmpd[2636]: Got trap from peer on fd 11
Apr  7 20:51:05 Bison Server Administrator: Storage Service EventID: 2123 
Redundancy lost:  Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 4/Di)
Apr  7 20:51:05 Bison Server Administrator: Storage Service EventID: 2057 
Virtual disk degraded:  Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 
4/Di)
Apr  7 20:51:06 Bison snmpd[2636]: Got trap from peer on fd 11
Apr  7 20:51:08 Bison last message repeated 2 times
Apr  7 22:19:45 Bison Server Administrator: Storage Service EventID: 2124 
Redundancy normal:  Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 4/Di)
Apr  7 22:19:45 Bison snmpd[2636]: Got trap from peer on fd 11
Apr  7 22:19:46 Bison Server Administrator: Storage Service EventID: 2092 
Physical disk Rebuild completed:  Physical Disk 0:2 Controller 0, Connector 
0
Apr  7 22:19:46 Bison snmpd[2636]: Got trap from peer on fd 11

I have looked through the OMSA documents and have done a fair amount of 
googling, but have not come up with any clear conclusions.

Interpretation / Questions
1.  The ESM log shows that the drive had a critical error detected, then was 
immediately listed as OK.
2.  The megaserv.log shows that a few days ago there was some type of error 
encountered (other error count=1).  Does any one know how I might identify 
what this error might be, and if it is related to the fault detected by the 
ESM log?
3.  The megaserv.log  also shows that my failed drive (Target 5) was changed 
from "online" to "ready" with the reasons "failed by host" and "select 
timeout".  I am looking for additional descriptions or interpretations of 
what these reasons mean?
4.  The messages log identifies some snmpd traps received.  Are these 
indications of errors on the drive?  How can I find more description of 
these traps?

The drive that failed was a newly purchased (rebuilt) drive that I just 
installed in the system about 2 months ago when I expanded my raid from 3 
drives to 5 drives.  Since the supposedly failed drive was listed as "ready" 
in OMSA, I have assigned it as a global hot spare for the raid; but without 
knowing for sure what happened to this drive can I really trust it?

Thanks for any advice and pointers you can provide.

Jeff Boyce
www.meridianenv.com



More information about the Linux-PowerEdge mailing list