Drive failed? or not?
Jeff Boyce
jboyce at meridianenv.com
Wed Apr 8 12:28:07 CDT 2009
Greetings -
I had a hard drive fail last night and my raid rebuilt to the hotspare.
Looking through my log files this morning, I am now not sure that the drive
completely failed. My system details are: Dell PE2600, Perc 4/Di, Raid 5
with five 36GB drives, OMSA 5.1, RHEL 3 update 9, all firmware updated last
week. System log files are listed below. My interpretation and questions
are below the log files.
Embedded System Management (ESM) Log : (sorted most recent at top)
Ok;Tue Apr 7 20:50:33 2009;Drive 5 drive slot sensor drive ok
Critical;Tue Apr 7 20:50:33 2009;Drive 5 drive slot sensor drive fault
detected
/var/log/megaserv.log
[04/04/2009 (19:11:19)]:
Adapter 0 Channel 0 Target 0: Media Error Count=0, Other Error Count=1
[04/04/2009 (19:11:19)]:
Adapter 0 Channel 0 Target 1: Media Error Count=0, Other Error Count=1
[04/04/2009 (19:11:19)]:
Adapter 0 Channel 0 Target 2: Media Error Count=0, Other Error Count=1
[04/04/2009 (19:11:19)]:
Adapter 0 Channel 0 Target 3: Media Error Count=0, Other Error Count=1
[04/04/2009 (19:11:19)]:
Adapter 0 Channel 0 Target 4: Media Error Count=0, Other Error Count=1
[04/04/2009 (19:11:19)]:
Adapter 0 Channel 0 Target 5: Media Error Count=0, Other Error Count=1
[04/05/2009 (09:47:34)]:
Adapter 0: No of Charge Cycles = 1082
[04/06/2009 (13:37:19)]:
Adapter 0: No of Charge Cycles = 1083
[04/07/2009 (20:50:42)]:
Adapter 0 Logical Drive 0 is DEGRADED.
[04/07/2009 (20:50:46)]:
Adapter 0 Channel 0 Target 2:
Physical Drive[HITACHI HUS151436VL3800 S3BA]is Changed to REBUILD.
[04/07/2009 (20:50:47)]:
Adapter 0 Channel 0 Target 5:
Physical Drive[SEAGATE ST336753LC DX10] is Changed to READY.
Reason_0=Fail by host. Reason_1=Select timeout.
[04/07/2009 (20:51:43)]:
Adapter 0 Channel 0 Target 2:
Physical Drive[HITACHI HUS151436VL3800 S3BA]: REBUILD PROGRESS 1%
/var/log/messages
Apr 7 20:51:04 Bison Server Administrator: Storage Service EventID: 2065
Physical disk Rebuild started: Physical Disk 0:2 Controller 0, Connector 0
Apr 7 20:51:05 Bison Server Administrator: Storage Service EventID: 2196
Dedicated hot spare unassigned physical disk 0:2: Virtual Disk 0 (Virtual
Disk 0) Controller 0 (PERC 4/Di)
Apr 7 20:51:05 Bison snmpd[2636]: Got trap from peer on fd 11
Apr 7 20:51:05 Bison Server Administrator: Storage Service EventID: 2123
Redundancy lost: Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 4/Di)
Apr 7 20:51:05 Bison Server Administrator: Storage Service EventID: 2057
Virtual disk degraded: Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC
4/Di)
Apr 7 20:51:06 Bison snmpd[2636]: Got trap from peer on fd 11
Apr 7 20:51:08 Bison last message repeated 2 times
Apr 7 22:19:45 Bison Server Administrator: Storage Service EventID: 2124
Redundancy normal: Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 4/Di)
Apr 7 22:19:45 Bison snmpd[2636]: Got trap from peer on fd 11
Apr 7 22:19:46 Bison Server Administrator: Storage Service EventID: 2092
Physical disk Rebuild completed: Physical Disk 0:2 Controller 0, Connector
0
Apr 7 22:19:46 Bison snmpd[2636]: Got trap from peer on fd 11
I have looked through the OMSA documents and have done a fair amount of
googling, but have not come up with any clear conclusions.
Interpretation / Questions
1. The ESM log shows that the drive had a critical error detected, then was
immediately listed as OK.
2. The megaserv.log shows that a few days ago there was some type of error
encountered (other error count=1). Does any one know how I might identify
what this error might be, and if it is related to the fault detected by the
ESM log?
3. The megaserv.log also shows that my failed drive (Target 5) was changed
from "online" to "ready" with the reasons "failed by host" and "select
timeout". I am looking for additional descriptions or interpretations of
what these reasons mean?
4. The messages log identifies some snmpd traps received. Are these
indications of errors on the drive? How can I find more description of
these traps?
The drive that failed was a newly purchased (rebuilt) drive that I just
installed in the system about 2 months ago when I expanded my raid from 3
drives to 5 drives. Since the supposedly failed drive was listed as "ready"
in OMSA, I have assigned it as a global hot spare for the raid; but without
knowing for sure what happened to this drive can I really trust it?
Thanks for any advice and pointers you can provide.
Jeff Boyce
www.meridianenv.com
More information about the Linux-PowerEdge
mailing list