megaraid_sas xscale interrupt mask?
Joe Malicki
jmalicki at metacarta.com
Wed Jan 3 19:41:24 CST 2007
Hi Sumant,
While trying to debug Dell PERC 5/i RAID controller problems we've been
having with the megaraid_sas driver, we've been inspecting differences
between the Red Hat EL 4 kernel (which Dell officially supports) versus
the stock Linux 2.6.17.13 driver we use. We found a very interesting
change, introduced into linux 2.6.16, that seems very odd to us:
http://groups.google.com/group/fa.linux.kernel/browse_frm/thread/51f889bd09bafd2d/cbbe2a30b8c2eb94?lnk=st&q=outbound_intr_mask+0x1f+0x00000001&rnum=1#cbbe2a30b8c2eb94
The title of the thread is "megaraid_sas: new template defined to
represent each type of controllers", and introduces this curious change:
/**
* megasas_disable_intr - Disables interrupts
* @regs: MFI register set
*/
static inline void
megasas_disable_intr(struct megasas_register_set __iomem * regs)
{
- u32 mask = readl(®s->outbound_intr_mask) & (~0x00000001);
+ u32 mask = 0x1f;
writel(mask, ®s->outbound_intr_mask);
/* Dummy readl to force pci flush */
Interrupts are enabled by writing "1" to the same register.
Is there a specific reason for this? Is it possible that Dell PERC 5/i
controllers differ from LSI controllers in this respect? It seems odd
that this change would be introduced without any explanation for what
it's meant to do, so I am very curious if it could be an inadvertently
introduced bug that is causing some problems.
Thanks!
Joe Malicki
--
Joseph Malicki
Software Engineer
Metacarta, Inc.
350 Massachusetts Avenue
4th Floor
Cambridge, MA 02451 USA
email: joe.malicki at metacarta.com
http://www.metacarta.com
Joe Malicki wrote:
> After upgrading to the new 5.0.3-0001 "package build" firmware, released
> 12/12/06, from
> http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&osl=en&deviceid=9182&releaseid=R141188,
> we just experienced one firmware problem that's leaving a clear
> traceback. I don't know if this is
>
> 1) the same problem we were experiencing before, that the new firmware
> introduced debugging/a detailed error message for (if this is the case,
> I do really appreciate that Dell did this, since it may help to fix
> these problems eventually),
> 2) A problem introduced by the new firmware, or
> 3) A preexisting problem that we never happened to experience before.
>
> In the firmware logs at the end of this message, note that just 15
> minutes after a battery relearn is finished and the battery finished
> charging, we see the message:
>
> 01/02/07 0:33:50: Diag Retention test is running...all activities are
> stopped
>
> This corresponds to when the megasas driver timed out SCSI commands and
> the controller stopped responding.
>
> 1) Does anyone know what a "Diag Retention test" is? Documentation
> mentions "BBU Retention tests" and "NVRAM Retention tests", but not
> "Diag Retention test" - is the "Diag Retention test" a synonym for one
> of these, or is it something different?
> 2) Has anyone seen a similar failure?
>
> Note that 4 hours after the controller has been offline, a stack
> backtrace, with a firmware source code file and line number, appears in
> the firmware logs - which is something I wouldn't expect to happen under
> any circumstances on a stable product - and seems to drop to a debug
> console (we haven't tried hooking up a serial port to what look like the
> headers on the PERC card, we didn't experiment too much the first time
> it happened as it's a production machine we wanted to get back up quickly).
>
> We have previously noticed failures corresponding with patrol reads, and
> this failure takes place several hours later, and the traceback happens
> within the "PatrolReadTimer" procedure - is this the same failure as before?
>
> We don't yet have a clear reproduction case, but are working on it with
> additional information we have from this crash (as we've begun remote
> logging to capture the state of the machine as it's dying, since syslog
> failing because it couldn't write to disk in previous crashes lowered
> the amount of information we could get).
>
> Thanks,
> Joe
>
> Logs follow:
>
> 01/01/07 20:16:57: PR cycle complete
> 01/01/07 20:16:57: EVT#06277-01/01/07 20:16:57: 35=Patrol Read complete
> 01/01/07 20:16:57: Next PR scheduled to start at 01/02/07 18:13:20
> 01/01/07 21:17:01: EVT#06278-01/01/07 21:17:01: 44=Time established as
> 01/01/07 21:17:01; (1727059 seconds since power on)
> 01/01/07 21:23:40: EVT#06279-01/01/07 21:23:40: 162=Current capacity of
> the battery is below threshold
> 01/01/07 21:23:40: EVT#06280-01/01/07 21:23:40: 195=BBU disabled;
> changing WB virtual disks to WT
> 01/01/07 21:26:40: EVT#06281-01/01/07 21:26:40: 153=Battery relearn
> completed
> 01/01/07 21:26:40: Learn completed successfully
> 01/01/07 21:26:40: Next Learn will start on 04 01 2007
>
> 01/01/07 21:26:40: *** BATTERY FEATURE PROPERTIES ***
> 01/01/07 21:26:40: _________________________________________________
>
> 01/01/07 21:26:40: Auto Learn Period : 90 days
> 01/01/07 21:26:40: Next Learn Time : 228778000
> 01/01/07 21:26:40: Battery ID : 34ec019f
> 01/01/07 21:26:40: Delayed Learn Interval: 0 hours from scheduled
> time
> 01/01/07 21:26:40: Next Learn cheduled on: 04 01 2007
> 01/01/07 21:26:40: _________________________________________________
>
> 01/01/07 21:26:55: EVT#06282-01/01/07 21:26:55: 147=Battery started charging
> 01/01/07 21:26:55: EVT#06283-01/01/07 21:26:55: 162=Current capacity of
> the battery is below threshold
> 01/01/07 21:49:40: EVT#06284-01/01/07 21:49:40: 163=Current capacity of
> the battery is above threshold
> 01/01/07 21:49:40: EVT#06285-01/01/07 21:49:40: 194=BBU enabled;
> changing WT virtual disks to WB
> 01/01/07 23:16:52: EVT#06286-01/01/07 23:16:52: 73=VD 00/0 Properties
> updated to [ID=00,dcp=0d,ccp=0c,ap=0,dc=0,dbgi=0] (from
> [ID=00,dcp=0c,ccp=0c,ap=0,dc=0,dbgi=0])
> 01/02/07 0:18:05: EVT#06287-01/02/07 0:18:05: 242=Battery charge complete
> 01/02/07 0:33:50: Diag Retention test is running...all activities are
> stopped
> 01/02/07 4:41:08: TaskAdd: No more tasks available!!!
> [0]: fp=a00ffde4, lr=a0885aac - TaskAdd+7c
> [1]: fp=a00ffe00, lr=a086a3ac - PatrolReadTimer+fc
> [2]: fp=a00ffe40, lr=a0885f2c - TimerISR+a4
> [3]: fp=a00ffe60, lr=a088e428 - FIQ_isr+48
> [4]: fp=a00ffe88, lr=a000a848 - dbits+1787e34
> [5]: fp=a00ffe9c, lr=a000a24c - dbits+1787838
> [6]: fp=a00ffee4, lr=a0883440 - kbhit+48
> [7]: fp=a00ffef8, lr=a0866e28 - MonCheck+14
> [8]: fp=a00fff0c, lr=a0815930 - diagRetentionCmdBlockDone+7c
> [9]: fp=a00fff34, lr=a084d630 - CmdBlocked+1b4
> [10]: fp=a00fff60, lr=a0874c28 - set_state+278
> [11]: fp=a00fff94, lr=a08748b0 - raid_task+2f0
> [12]: fp=a00fffb8, lr=a088e0b0 - main+3b0
> [13]: fp=a00fffe4, lr=a088c774 - c_start+30
> [14]: fp=a00ffffc, lr=9e8804cc - _start+6c
> [15]: fp=a0018344, lr=a00061d0 - dbits+17837bc
> [16]: fp=a00183fc, lr=4c0 - 000004c0
> MonTask: line 100 in file ../../raid/taskman.c
> INTCTL=16c00000:1003dcf, IINTSRC=0:0, FINTSRC=0:0, CPSR=600000d3,
> sp=a00ffb28
> MegaMon>
>
> T0: LSI Logic MegaRAID firmware loaded
> T0: Firmware version 1.00.02-0163 built on Nov 13 2006 at 18:32:21
> T0: Board is type 1028/0015/1028/1f03
>
> T0: Initializing 1MB memory pool
> T0: LogInit: Flushing events from previous boot
> T0: EVT#06288-01/02/07 4:41:08: 15=Fatal firmware error: Line 100 in
> ../../raid/taskman.c
>
> T0: EVT#06289-T0: 0=Firmware initialization started (PCI ID
> 0015/1028/1f03/1028)
> T0: EVT#06290-T0: 1=Firmware version 1.00.02-0163
> T0: EVT#06291-T0: 209=BBU Retention test was initiated on previous boot
> T12: EVT#06292-T12: 210=BBU Retention test passed
> T12: EVT#06293-T12: 212=NVRAM Retention test was initiated on previous boot
> T12: EVT#06294-T12: 213=NVRAM Retention test passed
> T12: Authenticating RAID key: Done!
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
>
More information about the Linux-PowerEdge
mailing list