Linux, MD1000, megasas timeout woes

simonpe at ferrett.com simonpe at ferrett.com
Fri Apr 6 05:25:50 CDT 2007


Hi All,

I have seen some chatter about this on the list but there does not seem to
have been much of a consensus on a solution.  I was wondering if there was
any more information out there on how to 
work around or resolve this problem.

The problem: under heavy load, the megasas driver does a reset, times out,
does not recover and the filesystem goes offline until the system is
rebooted.

The configuration is a Dell1950 server:
BIOS is Phoenix 1.10 1.2.0
Dual Core 3.0GHz, 2Gb RAM
PowerEdge Expandable RAID Controller Version 5.0.1 (Build December 01, 2005)
PERC 5/i 5.0.2-0003 (Bus 2 Dev 14, HA -0) - 1 Logical Drive
PERC 5/E 5.0.2-0003 (Bus 15 Dev 14, HA -1) - 1 Logical Drive

Connected to the 5/E is a Dell MD1000 array with a 8Tb RAID5 set on it
comprised of SATA disks.

Running Linux - have tried 2.6.21-rc6 (home build with megasas
00.00.03.10-rc3) and 2.6.18-4-amd64 (Debian 2.6.18.dfsg.1-12 with megasas
00.00.03.01)

Here are the (hopefully relevant) parts of the kernel startup:
(With the 2.6.18 build):
megasas: 00.00.03.01 Sun May 14 22:49:52 PDT 2006
megasas: 0x1028:0x0015:0x1028:0x1f03: bus 2:slot 14:func 0
megasas: 0x1028:0x0015:0x1028:0x1f01: bus 15:slot 14:func 0

(With the 2.6.21 build):
megasas: 00.00.03.10-rc3 Wed Mar 28 10:25:52 PST 2007
megasas: 0x1028:0x0015:0x1028:0x1f03: bus 2:slot 14:func 0
megasas: 0x1028:0x0015:0x1028:0x1f01: bus 15:slot 14:func 0

(General bootup messages about disk/driver)
scsi0 : LSI Logic SAS based MegaRAID driver
  Vendor: FUJITSU   Model: MAX3073RC         Rev: D206
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: FUJITSU   Model: MAX3073RC         Rev: D206
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: DP        Model: BACKPLANE         Rev: 1.00
  Type:   Enclosure                          ANSI SCSI revision: 05
  Vendor: DELL      Model: PERC 5/i          Rev: 1.00
  Type:   Direct-Access                      ANSI SCSI revision: 05
megasas: 0x1028:0x0015:0x1028:0x1f01: bus 15:slot 14:func 0
GSI 20 sharing vector 0x5A and IRQ 20
ACPI: PCI Interrupt 0000:0f:0e.0[A] -> GSI 18 (level, low) -> IRQ 90
input: Dell Dell USB Keyboard as /class/input/input0
input: USB HID v1.10 Keyboard [Dell Dell USB Keyboard] on
usb-0000:00:1d.7-5.1
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
scsi1 : LSI Logic SAS based MegaRAID driver
  Vendor: DELL      Model: MD1000            Rev: A.00
  Type:   Enclosure                          ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: ATA       Model: ST3750640NS       Rev: E
  Type:   Direct-Access                      ANSI SCSI revision: 05
  Vendor: DELL      Model: MD1000            Rev: A.00
  Type:   Enclosure                          ANSI SCSI revision: 05
  Vendor: DELL      Model: PERC 5/E Adapter  Rev: 1.00
  Type:   Direct-Access                      ANSI SCSI revision: 05
ESB2: IDE controller at PCI slot 0000:00:1f.1
ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 16 (level, low) -> IRQ 169
ESB2: chipset revision 9
ESB2: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0xfc00-0xfc07, BIOS settings: hda:DMA, hdb:pio
SCSI device sda: 142082048 512-byte hdwr sectors (72746 MB)
sda: test WP failed, assume Write Enabled
sda: asking for cache data failed
sda: assuming drive cache: write through
SCSI device sda: 142082048 512-byte hdwr sectors (72746 MB)
sda: test WP failed, assume Write Enabled
sda: asking for cache data failed
sda: assuming drive cache: write through
 sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 >
sd 0:2:0:0: Attached scsi disk sda
sdb : very big device. try to use READ CAPACITY(16).
SCSI device sdb: 17568890880 512-byte hdwr sectors (8995272 MB)
sdb: test WP failed, assume Write Enabled
sdb: asking for cache data failed
sdb: assuming drive cache: write through
sdb : very big device. try to use READ CAPACITY(16).
SCSI device sdb: 17568890880 512-byte hdwr sectors (8995272 MB)
sdb: test WP failed, assume Write Enabled
sdb: asking for cache data failed
sdb: assuming drive cache: write through
 sdb: sdb1
sd 1:2:0:0: Attached scsi disk sdb

The problem occurs most readily under high write loads.  The server is an
NFS server and all disk access is via NFS.  

When things go south this is what is logged, and a hard boot is then
required to get things back online:
sd 1:2:0:0: megasas: RESET -286287 cmd=8a
megasas: [ 0]waiting for 12 commands to complete
megasas: [ 5]waiting for 12 commands to complete
megasas: [10]waiting for 12 commands to complete
megasas: [15]waiting for 12 commands to complete
...
megasas: [170]waiting for 12 commands to complete
megasas: [175]waiting for 12 commands to complete
megasas: failed to do reset
sd 1:2:0:0: megasas: RESET -286287 cmd=8a
megasas: cannot recover from previous reset failures
sd 1:2:0:0: megasas: RESET -286287 cmd=8a
megasas: cannot recover from previous reset failures
sd 1:2:0:0: scsi: Device offlined - not ready after error recovery

At that point the filesystem is offline.

The rc3 megasas drivers behave the same, though their initial reset messages
have a little extra detail:
sd 1:2:0:0: megasas: RESET -626179 cmd=2a retries=0
megasas: [ 0]waiting for 3 commands to complete
megasas: [ 5]waiting for 3 commands to complete
...

I have had some limited success by following a couple of suggestions I have
found online.  

*) Increasing the timeout to 120s by "echo 120 >
/sys/block/sdb/device/timeout"
*) Decreasing the queue size by "echo 16 > /sys/block/sdb/queue/nr_requests"
(I also tried nr_requests at 8 and 4)

Decreasing the nr_requests can make the system run a little longer before
falling over, but everything I have tried so far has resulted in the same
outcome eventually - that being the series of messages above and an offline
filesystem.  If the filesystem does not see heavy write load the system
appears to be stable.

In summary I have tried:
*) New kernel with the 00.00.03.10-rc3 megasas drivers
*) Increased device timeout (with both 3.10-rc3 and 3.01 megasas)
*) Reduced nr_requests (with both 3.10-rc3 and 3.01 megasas)

Does anyone have further information on this problem?  It is beginning to
cause concern for our rollout schedule and so any advice or assistance would
be very much appreciated.  

Also if there is any futher detail that I can provide I will happly do so.

Thanks,
Simon




-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 268.18.26/748 - Release Date: 4/5/2007
3:33 PM
 



More information about the Linux-PowerEdge mailing list