rejecting I/O to offline device (PERC woes)

Kurt_Olsson at Dell.com Kurt_Olsson at Dell.com
Tue Apr 22 09:50:14 CDT 2008


Please attempt to run a consistency check on the logical disk (raid set)
that is comprised of the disks that show the fault(s)

-----Original Message-----
From: linux-poweredge-bounces at dell.com
[mailto:linux-poweredge-bounces at dell.com] On Behalf Of Curtis H. Wilbar
Jr.
Sent: Tuesday, April 22, 2008 9:22 AM
To: linux-poweredge-Lists
Subject: Re: rejecting I/O to offline device (PERC woes)

On Mon, 2008-04-21 at 17:50, Tino Schwarze wrote:
> On Mon, Apr 21, 2008 at 05:38:21PM -0400, Curtis H. Wilbar Jr. wrote:
> 
> > Haven't gotten any tips on a solution to the problem below.
> > It happened again this weekend.
> > 
> > My next test steps (order not determined):
> > 
> > 1. Downgrade to CentOS 4/RHEL 4
> > 2. Swap out PERC controller with a spare
> > 
> > I have never had a problem with the PERC4/DC controllers on our
> > other machines (RHEL3/4, CentOS 4).  Although, I've no other
> > machine that has 5 300G Fujitsu SCSI drives either.
> > 
> > Any suggestions on the below, or which order on the above to
> > try ?
> > 
> > Thanks,
> > 
> > -- Curt
> > 
> > -------------------------------
> > 
> > I have a 6650 with a PERC4/DC running CentOS5.
> > 
> > After 1 to 3 weeks of operation (running VMWare Server) it
> > 'dies' (raid array gets taken offline) and you get rejecting
> > I/O to offline device.
> > 
> > When this system was setup late last year, the 6650 was
> > given all the latest firmware along with the PERC4/DC.
> > 
> > using linttylog, the last entries from when the system must
> > have 'checked out' last night, I see the data attached below.
> > 
> > Some time back I thought I had cured this problem by adding
> > noapic to the kernel boot parameters in boot.conf.  It had
> > gone away for a long time... but is now back.
> > 
> > according to lintty, it reports controller firmware is:
> > 
> > T0: Firmware version 352D build on Mar 19 2007 at 17:43:23
> > T0: MegaRAID Series 518 firmware version 352D
> > 
> > using strings tty.log | grep 'MedErr on pd' | cut -c17- | sort  |
uniq
> > -c | sort -n I see:
> > 
> >     163 REC:log MedErr on pd[1] #retries=0
> >     165 REC:log MedErr on pd[4] #retries=0
> >     168 REC:log MedErr on pd[2] #retries=0
> > 
> > If I am to believe this, Patrol read is finding media errors on
> > physical drives 1, 2, and 4 ! ?
> > 
> > These drives are not even a year old, and to have an almost even
> > distribution of errors across 3 drives seems far fetched (unless
> > patrol read is reading past the end of drive ?, but then it would
> > be doing that with all 5 drives).
> > 
> > Is the PERC busted ?  driver issue ?
> > 
> > I'm running CentOS 5 with kernel 2.6.18-53.1.13.el5PAE
> > 
> > from dmesg, megaraid related driver versions:
> > 
> > megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
> > megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
> > 
> > Anyone seen this behavior before ?  Anyone have a solution ?
> > We have several Dells in a hosting environment with PERC4/DC
> > running RHEL3, RHEL4.X, and CentOS4.X.  We have not had this
> > issue on any of them (though they do not have 5 300G Fujitsu
> > SCSI drives in a RAID 5 config either (as this one does)).
> > 
> > Hoping someone can shed some light on this... so far I keep
> > coming up short on finding a solution.
> > 
> > Here is the full content of the last lines recorded in the PERC
> > as pulled by linttylog:
> > 
> > 03/24 21:43:41: Next PR scheduled to start at 03/25  0:00:00
> > 03/25  3:47:41: REC:log MedErr on pd[4] #retries=0
> > 03/25  3:47:41: LogSense: pd=04, cdb=2f 00 14 30 03 12 00 ff ff 00 
> > 03/25  3:47:41:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00
11
> > 01 00 00 00 3f 
> > 03/25  3:47:41: REC: MedErr on LD[4] BadLba=143014e6
> > 03/25  3:47:41: prCallback: Medium Error on pd=04,
StartLba=14300312,
> > ErrLba=143014e6
> > 03/25  3:47:42: prRecQueue: starting pd=04 recovery - blocking host
> > commands
> [...]
> 
> I've had a similar issue with an external RAID last year. IIRC, it was
a
> PERC3/Di. The problem was that the external RAID took too long to
> recover from media errors therefore the whole thing got disconnected.
> 
> The solution was to turn off disconnect in the BIOS although I've got
no
> idea why this helped - I just tried it out of despair. This is an old
> SuSE box, though, using the aacraid driver.
> 
> I'd replace the failing disk ASAP anyway.

I'd have to replace all of the disks.  The PERC4/DC only has
a predictive failure count for drives.  I find it hard to believe
that 4-5 drives less than 6 months old would all be meeting their
demise... and I had this problem from the start.  

Also, the drives with errors have almost matching error counts...
the statistical probability of this I find near impossible.

I did add noapic to the boot options for the kernel when I first
found it (based on a recommendation from a thread out on the net)
and it initially helped.

There are no options in the PERC4/DC (LSI Megaraid) that I see
(I'm gathering PERC3 is Adaptec, probably with completely 
different options).

I'm really at a loss....  only thing I can 'guess' at is:

1. flaky controller
2. controller/drive firmware incompatabilities
3. bug in either megaraid drivers or sd drivers that pops out under
    specific conditions

Since the drives are not 'Dell' drives and I don't see firmware
for Fujitsu 300G drives, I can't test #2.  That leaves me with
figuring out how to swap in a spare PERC4/DC that we have (I've
never done that (never had to)) and downgrading to CentOS 4 or
RHEL4 to see if that might confirm #3.

I do find a number of occurances on the net of this, but nowhere
have I found a definitive this was the problem, this was the fix.

-- Curt


> 
> HTH,
> 
> Tino.

_______________________________________

Curtis Wilbar ~  Senior Systems and Network Administrator

125 CambridgePark Drive
Cambridge, MA 02140
e: curtis at cmarket.com
t: 617.252.6415
f: 617.374.9015
www.cmarket.com
www.biddingforgood.com



_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq



More information about the Linux-PowerEdge mailing list