rejecting I/O to offline device (PERC woes)

Nick_Parrott at Dell.com Nick_Parrott at Dell.com
Tue Apr 22 11:44:06 CDT 2008


Cons check may fix, chances are however slim.

What you've got here is a punctured stripe, caused by a double-fault
scenario. Bad block (a genuine one) has occurred, and while being
re-mapped, another has been encountered. This block is re-mapped also,
but the parity is not reconstructed. The details beyond this I'm not
aware of, but have seen this enough times to know you've got bad parity.

The media errors are cosmetic, not genuine and can be removed with the
following procedure. You can run consistency checks till the cows come
home, but this won't improve, and the disk diags will fail again and
again as the disk reports media errors. Rebuilding in a new shiny disk
will copy this bad block onto that disk too (before someone asks!)

Backup data, the data on these disks will be destroyed.. the RAID array
may as well be deleted beforehand..
Use MHDD to write empty data to the block, repeat this on the affected
disks and the blocks (only block 143014e6 (hex) / 338695398 (dec) in
your case) then re-create the RAID array. 

Clearing media errors in free space
1. Boot the system on the MHDD diskette. (google for this.. common
utility!) 

2. When prompted, select if the disks are connected to PERC channel 0 or
channel 1.

3. Select which disk you want to work with, in this case we started with
disk 6 ( = disk 0).
    MHDD will present disk 0 as disk 6, disk 1 as disk 7 etc.

4. At the MHDD prompt, type FF (writing sectors from file to the drive).

5. Enter Sector size [512] (press enter to accept default 512 bytes).

6. Source filename: 512b (a 512 byte empty file that will be used to
overwrite the sector).

7. Enter start LBA to write: 21032076 (LBA 140ec8c in decimal, see
above).

8. Enter end LBA: 21032077 (start LBA + 1).

LBA 140ec8c on disk 0 have now been cleared. Repeat step 4 - 8 to clear
the three remaining LBA's on disk 0 (1410de1, 1411a64 and 15af285).

9. Press Shift + F3 to get back to the physical disk selection.

10. Select disk 7 ( = disk 1).

11. Repeat step 4 - 8 until all four LBA's have been cleared on disk 1.

12. Press Alt + X to exit.

(*These instructions are generic.. MHDD may have changed since, the
important part is the byte-size and the sector of course)

Nick Parrott
Dell Enterprise Storage Support
Ire +353 (1) 850 543 543
UK +44 (0) 870 908 0500

-----Original Message-----
From: linux-poweredge-bounces at dell.com
[mailto:linux-poweredge-bounces at dell.com] On Behalf Of
Kurt_Olsson at dell.com
Sent: 22 April 2008 15:50
To: curtis+dpeml at cmarket.com; linux-poweredge-Lists
Subject: RE: rejecting I/O to offline device (PERC woes)

Please attempt to run a consistency check on the logical disk (raid set)
that is comprised of the disks that show the fault(s)

-----Original Message-----
From: linux-poweredge-bounces at dell.com
[mailto:linux-poweredge-bounces at dell.com] On Behalf Of Curtis H. Wilbar
Jr.
Sent: Tuesday, April 22, 2008 9:22 AM
To: linux-poweredge-Lists
Subject: Re: rejecting I/O to offline device (PERC woes)

On Mon, 2008-04-21 at 17:50, Tino Schwarze wrote:
> On Mon, Apr 21, 2008 at 05:38:21PM -0400, Curtis H. Wilbar Jr. wrote:
> 
> > Haven't gotten any tips on a solution to the problem below.
> > It happened again this weekend.
> > 
> > My next test steps (order not determined):
> > 
> > 1. Downgrade to CentOS 4/RHEL 4
> > 2. Swap out PERC controller with a spare
> > 
> > I have never had a problem with the PERC4/DC controllers on our
> > other machines (RHEL3/4, CentOS 4).  Although, I've no other
> > machine that has 5 300G Fujitsu SCSI drives either.
> > 
> > Any suggestions on the below, or which order on the above to
> > try ?
> > 
> > Thanks,
> > 
> > -- Curt
> > 
> > -------------------------------
> > 
> > I have a 6650 with a PERC4/DC running CentOS5.
> > 
> > After 1 to 3 weeks of operation (running VMWare Server) it
> > 'dies' (raid array gets taken offline) and you get rejecting
> > I/O to offline device.
> > 
> > When this system was setup late last year, the 6650 was
> > given all the latest firmware along with the PERC4/DC.
> > 
> > using linttylog, the last entries from when the system must
> > have 'checked out' last night, I see the data attached below.
> > 
> > Some time back I thought I had cured this problem by adding
> > noapic to the kernel boot parameters in boot.conf.  It had
> > gone away for a long time... but is now back.
> > 
> > according to lintty, it reports controller firmware is:
> > 
> > T0: Firmware version 352D build on Mar 19 2007 at 17:43:23
> > T0: MegaRAID Series 518 firmware version 352D
> > 
> > using strings tty.log | grep 'MedErr on pd' | cut -c17- | sort  |
uniq
> > -c | sort -n I see:
> > 
> >     163 REC:log MedErr on pd[1] #retries=0
> >     165 REC:log MedErr on pd[4] #retries=0
> >     168 REC:log MedErr on pd[2] #retries=0
> > 
> > If I am to believe this, Patrol read is finding media errors on
> > physical drives 1, 2, and 4 ! ?
> > 
> > These drives are not even a year old, and to have an almost even
> > distribution of errors across 3 drives seems far fetched (unless
> > patrol read is reading past the end of drive ?, but then it would
> > be doing that with all 5 drives).
> > 
> > Is the PERC busted ?  driver issue ?
> > 
> > I'm running CentOS 5 with kernel 2.6.18-53.1.13.el5PAE
> > 
> > from dmesg, megaraid related driver versions:
> > 
> > megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
> > megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
> > 
> > Anyone seen this behavior before ?  Anyone have a solution ?
> > We have several Dells in a hosting environment with PERC4/DC
> > running RHEL3, RHEL4.X, and CentOS4.X.  We have not had this
> > issue on any of them (though they do not have 5 300G Fujitsu
> > SCSI drives in a RAID 5 config either (as this one does)).
> > 
> > Hoping someone can shed some light on this... so far I keep
> > coming up short on finding a solution.
> > 
> > Here is the full content of the last lines recorded in the PERC
> > as pulled by linttylog:
> > 
> > 03/24 21:43:41: Next PR scheduled to start at 03/25  0:00:00
> > 03/25  3:47:41: REC:log MedErr on pd[4] #retries=0
> > 03/25  3:47:41: LogSense: pd=04, cdb=2f 00 14 30 03 12 00 ff ff 00 
> > 03/25  3:47:41:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00
11
> > 01 00 00 00 3f 
> > 03/25  3:47:41: REC: MedErr on LD[4] BadLba=143014e6
> > 03/25  3:47:41: prCallback: Medium Error on pd=04,
StartLba=14300312,
> > ErrLba=143014e6
> > 03/25  3:47:42: prRecQueue: starting pd=04 recovery - blocking host
> > commands
> [...]
> 
> I've had a similar issue with an external RAID last year. IIRC, it was
a
> PERC3/Di. The problem was that the external RAID took too long to
> recover from media errors therefore the whole thing got disconnected.
> 
> The solution was to turn off disconnect in the BIOS although I've got
no
> idea why this helped - I just tried it out of despair. This is an old
> SuSE box, though, using the aacraid driver.
> 
> I'd replace the failing disk ASAP anyway.

I'd have to replace all of the disks.  The PERC4/DC only has
a predictive failure count for drives.  I find it hard to believe
that 4-5 drives less than 6 months old would all be meeting their
demise... and I had this problem from the start.  

Also, the drives with errors have almost matching error counts...
the statistical probability of this I find near impossible.

I did add noapic to the boot options for the kernel when I first
found it (based on a recommendation from a thread out on the net)
and it initially helped.

There are no options in the PERC4/DC (LSI Megaraid) that I see
(I'm gathering PERC3 is Adaptec, probably with completely 
different options).

I'm really at a loss....  only thing I can 'guess' at is:

1. flaky controller
2. controller/drive firmware incompatabilities
3. bug in either megaraid drivers or sd drivers that pops out under
    specific conditions

Since the drives are not 'Dell' drives and I don't see firmware
for Fujitsu 300G drives, I can't test #2.  That leaves me with
figuring out how to swap in a spare PERC4/DC that we have (I've
never done that (never had to)) and downgrading to CentOS 4 or
RHEL4 to see if that might confirm #3.

I do find a number of occurances on the net of this, but nowhere
have I found a definitive this was the problem, this was the fix.

-- Curt


> 
> HTH,
> 
> Tino.

_______________________________________

Curtis Wilbar ~  Senior Systems and Network Administrator

125 CambridgePark Drive
Cambridge, MA 02140
e: curtis at cmarket.com
t: 617.252.6415
f: 617.374.9015
www.cmarket.com
www.biddingforgood.com



_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq



More information about the Linux-PowerEdge mailing list