rejecting I/O to offline device (PERC woes)
Kurt_Olsson at Dell.com
Kurt_Olsson at Dell.com
Tue Apr 22 09:50:14 CDT 2008
Please attempt to run a consistency check on the logical disk (raid set)
that is comprised of the disks that show the fault(s)
-----Original Message-----
From: linux-poweredge-bounces at dell.com
[mailto:linux-poweredge-bounces at dell.com] On Behalf Of Curtis H. Wilbar
Jr.
Sent: Tuesday, April 22, 2008 9:22 AM
To: linux-poweredge-Lists
Subject: Re: rejecting I/O to offline device (PERC woes)
On Mon, 2008-04-21 at 17:50, Tino Schwarze wrote:
> On Mon, Apr 21, 2008 at 05:38:21PM -0400, Curtis H. Wilbar Jr. wrote:
>
> > Haven't gotten any tips on a solution to the problem below.
> > It happened again this weekend.
> >
> > My next test steps (order not determined):
> >
> > 1. Downgrade to CentOS 4/RHEL 4
> > 2. Swap out PERC controller with a spare
> >
> > I have never had a problem with the PERC4/DC controllers on our
> > other machines (RHEL3/4, CentOS 4). Although, I've no other
> > machine that has 5 300G Fujitsu SCSI drives either.
> >
> > Any suggestions on the below, or which order on the above to
> > try ?
> >
> > Thanks,
> >
> > -- Curt
> >
> > -------------------------------
> >
> > I have a 6650 with a PERC4/DC running CentOS5.
> >
> > After 1 to 3 weeks of operation (running VMWare Server) it
> > 'dies' (raid array gets taken offline) and you get rejecting
> > I/O to offline device.
> >
> > When this system was setup late last year, the 6650 was
> > given all the latest firmware along with the PERC4/DC.
> >
> > using linttylog, the last entries from when the system must
> > have 'checked out' last night, I see the data attached below.
> >
> > Some time back I thought I had cured this problem by adding
> > noapic to the kernel boot parameters in boot.conf. It had
> > gone away for a long time... but is now back.
> >
> > according to lintty, it reports controller firmware is:
> >
> > T0: Firmware version 352D build on Mar 19 2007 at 17:43:23
> > T0: MegaRAID Series 518 firmware version 352D
> >
> > using strings tty.log | grep 'MedErr on pd' | cut -c17- | sort |
uniq
> > -c | sort -n I see:
> >
> > 163 REC:log MedErr on pd[1] #retries=0
> > 165 REC:log MedErr on pd[4] #retries=0
> > 168 REC:log MedErr on pd[2] #retries=0
> >
> > If I am to believe this, Patrol read is finding media errors on
> > physical drives 1, 2, and 4 ! ?
> >
> > These drives are not even a year old, and to have an almost even
> > distribution of errors across 3 drives seems far fetched (unless
> > patrol read is reading past the end of drive ?, but then it would
> > be doing that with all 5 drives).
> >
> > Is the PERC busted ? driver issue ?
> >
> > I'm running CentOS 5 with kernel 2.6.18-53.1.13.el5PAE
> >
> > from dmesg, megaraid related driver versions:
> >
> > megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
> > megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
> >
> > Anyone seen this behavior before ? Anyone have a solution ?
> > We have several Dells in a hosting environment with PERC4/DC
> > running RHEL3, RHEL4.X, and CentOS4.X. We have not had this
> > issue on any of them (though they do not have 5 300G Fujitsu
> > SCSI drives in a RAID 5 config either (as this one does)).
> >
> > Hoping someone can shed some light on this... so far I keep
> > coming up short on finding a solution.
> >
> > Here is the full content of the last lines recorded in the PERC
> > as pulled by linttylog:
> >
> > 03/24 21:43:41: Next PR scheduled to start at 03/25 0:00:00
> > 03/25 3:47:41: REC:log MedErr on pd[4] #retries=0
> > 03/25 3:47:41: LogSense: pd=04, cdb=2f 00 14 30 03 12 00 ff ff 00
> > 03/25 3:47:41: sense=f0 00 03 14 30 14 e6 28 00 00 00 00
11
> > 01 00 00 00 3f
> > 03/25 3:47:41: REC: MedErr on LD[4] BadLba=143014e6
> > 03/25 3:47:41: prCallback: Medium Error on pd=04,
StartLba=14300312,
> > ErrLba=143014e6
> > 03/25 3:47:42: prRecQueue: starting pd=04 recovery - blocking host
> > commands
> [...]
>
> I've had a similar issue with an external RAID last year. IIRC, it was
a
> PERC3/Di. The problem was that the external RAID took too long to
> recover from media errors therefore the whole thing got disconnected.
>
> The solution was to turn off disconnect in the BIOS although I've got
no
> idea why this helped - I just tried it out of despair. This is an old
> SuSE box, though, using the aacraid driver.
>
> I'd replace the failing disk ASAP anyway.
I'd have to replace all of the disks. The PERC4/DC only has
a predictive failure count for drives. I find it hard to believe
that 4-5 drives less than 6 months old would all be meeting their
demise... and I had this problem from the start.
Also, the drives with errors have almost matching error counts...
the statistical probability of this I find near impossible.
I did add noapic to the boot options for the kernel when I first
found it (based on a recommendation from a thread out on the net)
and it initially helped.
There are no options in the PERC4/DC (LSI Megaraid) that I see
(I'm gathering PERC3 is Adaptec, probably with completely
different options).
I'm really at a loss.... only thing I can 'guess' at is:
1. flaky controller
2. controller/drive firmware incompatabilities
3. bug in either megaraid drivers or sd drivers that pops out under
specific conditions
Since the drives are not 'Dell' drives and I don't see firmware
for Fujitsu 300G drives, I can't test #2. That leaves me with
figuring out how to swap in a spare PERC4/DC that we have (I've
never done that (never had to)) and downgrading to CentOS 4 or
RHEL4 to see if that might confirm #3.
I do find a number of occurances on the net of this, but nowhere
have I found a definitive this was the problem, this was the fix.
-- Curt
>
> HTH,
>
> Tino.
_______________________________________
Curtis Wilbar ~ Senior Systems and Network Administrator
125 CambridgePark Drive
Cambridge, MA 02140
e: curtis at cmarket.com
t: 617.252.6415
f: 617.374.9015
www.cmarket.com
www.biddingforgood.com
_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq
More information about the Linux-PowerEdge
mailing list