new PERC 5/i firmware crash problems

Joe Malicki jmalicki at
Wed Jan 3 18:07:56 CST 2007

After upgrading to the new 5.0.3-0001 "package build" firmware, released
12/12/06, from,
we just experienced one firmware problem that's leaving a clear
traceback.  I don't know if this is

1) the same problem we were experiencing before, that the new firmware
introduced debugging/a detailed error message for (if this is the case,
I do really appreciate that Dell did this, since it may help to fix
these problems eventually),
2) A problem introduced by the new firmware, or
3) A preexisting problem that we never happened to experience before.

In the firmware logs at the end of this message, note that just 15
minutes after a battery relearn is finished and the battery finished
charging, we see the message:

01/02/07  0:33:50: Diag Retention test is running...all activities are

This corresponds to when the megasas driver timed out SCSI commands and
the controller stopped responding.

1) Does anyone know what a "Diag Retention test" is?  Documentation
mentions "BBU Retention tests" and "NVRAM Retention tests", but not
"Diag Retention test" - is the "Diag Retention test" a synonym for one
of these, or is it something different?
2) Has anyone seen a similar failure?

Note that 4 hours after the controller has been offline, a stack
backtrace, with a firmware source code file and line number, appears in
the firmware logs - which is something I wouldn't expect to happen under
any circumstances on a stable product - and seems to drop to a debug
console (we haven't tried hooking up a serial port to what look like the
headers on the PERC card, we didn't experiment too much the first time
it happened as it's a production machine we wanted to get back up quickly).

We have previously noticed failures corresponding with patrol reads, and
this failure takes place several hours later, and the traceback happens
within the "PatrolReadTimer" procedure - is this the same failure as before?

We don't yet have a clear reproduction case, but are working on it with
additional information we have from this crash (as we've begun remote
logging to capture the state of the machine as it's dying, since syslog
failing because it couldn't write to disk in previous crashes lowered
the amount of information we could get).


Logs follow:

01/01/07 20:16:57: PR cycle complete
01/01/07 20:16:57: EVT#06277-01/01/07 20:16:57:  35=Patrol Read complete
01/01/07 20:16:57: Next PR scheduled to start at 01/02/07 18:13:20
01/01/07 21:17:01: EVT#06278-01/01/07 21:17:01:  44=Time established as
01/01/07 21:17:01; (1727059 seconds since power on)
01/01/07 21:23:40: EVT#06279-01/01/07 21:23:40: 162=Current capacity of
the battery is below threshold
01/01/07 21:23:40: EVT#06280-01/01/07 21:23:40: 195=BBU disabled;
changing WB virtual disks to WT
01/01/07 21:26:40: EVT#06281-01/01/07 21:26:40: 153=Battery relearn
01/01/07 21:26:40: Learn completed successfully
01/01/07 21:26:40: Next Learn will start on 04 01 2007

01/01/07 21:26:40:       *** BATTERY FEATURE PROPERTIES ***
01/01/07 21:26:40:  _________________________________________________

01/01/07 21:26:40:       Auto Learn Period     : 90  days
01/01/07 21:26:40:       Next Learn Time       : 228778000
01/01/07 21:26:40:       Battery ID            : 34ec019f
01/01/07 21:26:40:       Delayed Learn Interval: 0  hours from scheduled
01/01/07 21:26:40:       Next Learn cheduled on: 04 01 2007
01/01/07 21:26:40:  _________________________________________________

01/01/07 21:26:55: EVT#06282-01/01/07 21:26:55: 147=Battery started charging
01/01/07 21:26:55: EVT#06283-01/01/07 21:26:55: 162=Current capacity of
the battery is below threshold
01/01/07 21:49:40: EVT#06284-01/01/07 21:49:40: 163=Current capacity of
the battery is above threshold
01/01/07 21:49:40: EVT#06285-01/01/07 21:49:40: 194=BBU enabled;
changing WT virtual disks to WB
01/01/07 23:16:52: EVT#06286-01/01/07 23:16:52:  73=VD 00/0 Properties
updated to [ID=00,dcp=0d,ccp=0c,ap=0,dc=0,dbgi=0] (from
01/02/07  0:18:05: EVT#06287-01/02/07  0:18:05: 242=Battery charge complete
01/02/07  0:33:50: Diag Retention test is running...all activities are
01/02/07  4:41:08: TaskAdd: No more tasks available!!!
[0]: fp=a00ffde4, lr=a0885aac  -  TaskAdd+7c
[1]: fp=a00ffe00, lr=a086a3ac  -  PatrolReadTimer+fc
[2]: fp=a00ffe40, lr=a0885f2c  -  TimerISR+a4
[3]: fp=a00ffe60, lr=a088e428  -  FIQ_isr+48
[4]: fp=a00ffe88, lr=a000a848  -  dbits+1787e34
[5]: fp=a00ffe9c, lr=a000a24c  -  dbits+1787838
[6]: fp=a00ffee4, lr=a0883440  -  kbhit+48
[7]: fp=a00ffef8, lr=a0866e28  -  MonCheck+14
[8]: fp=a00fff0c, lr=a0815930  -  diagRetentionCmdBlockDone+7c
[9]: fp=a00fff34, lr=a084d630  -  CmdBlocked+1b4
[10]: fp=a00fff60, lr=a0874c28  -  set_state+278
[11]: fp=a00fff94, lr=a08748b0  -  raid_task+2f0
[12]: fp=a00fffb8, lr=a088e0b0  -  main+3b0
[13]: fp=a00fffe4, lr=a088c774  -  c_start+30
[14]: fp=a00ffffc, lr=9e8804cc  -  _start+6c
[15]: fp=a0018344, lr=a00061d0  -  dbits+17837bc
[16]: fp=a00183fc, lr=4c0  -  000004c0
MonTask: line 100 in file ../../raid/taskman.c
INTCTL=16c00000:1003dcf, IINTSRC=0:0, FINTSRC=0:0, CPSR=600000d3,

T0: LSI Logic MegaRAID firmware loaded
T0: Firmware version 1.00.02-0163 built on Nov 13 2006 at 18:32:21
T0: Board is type 1028/0015/1028/1f03

T0: Initializing 1MB memory pool
T0: LogInit: Flushing events from previous boot
T0: EVT#06288-01/02/07  4:41:08:  15=Fatal firmware error: Line 100 in

T0: EVT#06289-T0:   0=Firmware initialization started (PCI ID
T0: EVT#06290-T0:   1=Firmware version 1.00.02-0163
T0: EVT#06291-T0: 209=BBU Retention test was initiated on previous boot
T12: EVT#06292-T12: 210=BBU Retention test passed
T12: EVT#06293-T12: 212=NVRAM Retention test was initiated on previous boot
T12: EVT#06294-T12: 213=NVRAM Retention test passed
T12: Authenticating RAID key: Done!

