Perc5/e megasas reset's - advice sought...

Mansell, Gary Gary.Mansell at ricardo.com
Tue Mar 13 02:35:36 CST 2007


In your opinion's (as you seem more knowledgeable than I), do you think
that there is any chance of data loss/corruption due to this matter with
the new driver and firmware that Dell are due to supply?

ie - if I am still to keep experiencing these resets (Dell suggest that
I will) is there a chance that the filesystem could become damaged. I
ask this because a reset that I experienced a while back (with the
earlier driver/firmware) screwed up the journal on the filesystem:

Dec 23 10:08:07 dfgsrv1 kernel: journal_bmap: journal block not found at
offset 1135 on dm-6
Dec 23 10:08:07 dfgsrv1 kernel: Aborting journal on device dm-6.
Dec 23 10:08:07 dfgsrv1 kernel: __journal_remove_journal_head: freeing
b_frozen_data
Dec 23 10:08:07 dfgsrv1 kernel: __journal_remove_journal_head: freeing
b_frozen_data
Dec 23 10:08:07 dfgsrv1 kernel: ext3_abort called.
Dec 23 10:08:07 dfgsrv1 kernel: EXT3-fs error (device dm-6):
ext3_journal_start_sb: Detected aborted journal
Dec 23 10:08:07 dfgsrv1 kernel: ext3_abort called.
Dec 23 10:08:07 dfgsrv1 kernel: EXT3-fs error (device dm-6):
ext3_journal_start_sb: Detected aborted journal
Dec 23 10:08:07 dfgsrv1 kernel: Remounting filesystem read-only
Dec 23 10:08:07 dfgsrv1 kernel: EXT3-fs error (device dm-6) in
start_transaction: Journal has aborted
Dec 23 10:08:07 dfgsrv1 kernel: EXT3-fs error (device dm-6) in
start_transaction: Journal has aborted
Dec 23 11:10:33 dfgsrv1 kernel: kjournald starting.  Commit interval 5
seconds
Dec 23 11:10:33 dfgsrv1 kernel: EXT3 FS on dm-7, internal journal
Dec 23 11:10:33 dfgsrv1 kernel: EXT3-fs: mounted filesystem with ordered
data mode.
Dec 23 12:11:14 dfgsrv1 kernel: __journal_remove_journal_head: freeing
b_committed_data
Dec 23 12:12:07 dfgsrv1 kernel: megasas: RESET -107419313 cmd=2a <c=2
t=0 l=0>
Dec 23 12:12:07 dfgsrv1 kernel: megasas: reset successful 
Dec 23 12:12:37 dfgsrv1 kernel: megasas: RESET -107434828 cmd=2a <c=2
t=0 l=0>
Dec 23 12:12:37 dfgsrv1 kernel: megasas: reset successful

Your advice would be gladly received as I am under pressure to go live
with these machines in a critical production fileserver environment,.

Best Regards




On Mon, 2007-03-12 at 22:52 -0500, Patrick_Boyd at Dell.com wrote:

> There have been some recent changes pushed into the Megaraid sas driver
> by Dell and LSI that should help alliviate this problem. These new
> drivers are currently undergoing evaluation by Dell for posting on
> support.dell.com, however if you want them you should be able to pull
> the latest megaraid_sas driver out of the kernel tree and use that...
> However as I have stated these drivers haven't completed validation by
> Dell yet so they aren't officially supported by us. 
> 
> -----Original Message-----
> From: linux-poweredge-bounces at dell.com
> [mailto:linux-poweredge-bounces at dell.com] On Behalf Of Richard Ford
> Sent: Tuesday, March 13, 2007 7:25 AM
> To: Joe Malicki
> Cc: Gary.Mansell at ricardo.com; linux-poweredge-Lists
> Subject: Re: Perc5/e megasas reset's - advice sought...
> 
> This issue is a lot like the DRAC issue and virtual CDROM and FDD units.
> 
> The SCSI layer in Linux could not handle IDE devices going offline and
> also changes to the system.
> 
> It caused massive SCSI requests and then pushes your file systems to
> read only.
> 
> Dell had a fix to set the DRAC Virtual CD and FDD as IDE-SCSI which
> allows hot plug and not normal IDE.
> 
> Point is - the fix wasn't a fix - but it did do the job.  Servers have
> not crashed since.
> 
> RF.
> 
> 
> On 13 Mar 2007, at 2:09 AM, Joe Malicki wrote:
> 
> > I've spent a significant amount of time looking at this over the last 
> > several months.  It seems to me that the problem is that the Linux 
> > kernel's SCSI layer insists on a single timeout for all SCSI requests,
> > and doesn't tolerate high variances in command completion times.   
> > If any
> > single command times out, it resets the whole bus, even if there is 
> > still significant activity.
> >
> > These resets are purely in response to excessive completion times when
> 
> > we've monitored them, and don't seem to be the problems of any other 
> > activity on the SAS bus.
> >
> > Many other similar RAID cards, like HP's CCISS, have block drivers in 
> > Linux that don't report themselves as SCSI so that they avoid some of 
> > the SCSI layer's meddling.
> >
> > Recently, LSI has been releasing new drivers that turn down the number
> 
> > of outstanding commands allowed when it sees some commands that take 
> > excessively long to complete, which seems like a desirable thing to do
> 
> > to prevent this problem from occuring.
> > Unfortunately the new drivers seem to still be in active peer-review 
> > so don't seem ready to try yet (see the linux-scsi mailing list), but 
> > look close.
> >
> > Also, I'm glad to hear that Dell will be releasing a new driver in 
> > addition to the firmware (I hadn't heard about the driver, which, from
> 
> > looking at the diffs, seems like most of the right fix).
> >
> > One thing the new drivers do, that is purely configuration, is up the 
> > SCSI timeout to 120 seconds.  This is reasonable considering the queue
> 
> > depth... when you can have 128 commands outstanding, it's reasonable 
> > to expect sufficiently high variances that a couple of them may take 
> > that long, occasionally, under high load.  You can change this by:
> >
> > echo 120 > /sys/block/sda/device/timeout
> >
> > If sda was your virtual drive... and similar for other virtual drives.
> >
> > -joe
> >
> > Mansell, Gary wrote:
> >
> >> Hi,
> >>
> >> I have been suffering recurring megasas resets with several PE2950's 
> >> attached to MD1000 RAID units for the last four months since I bought
> 
> >> the systems. I have also seen that others on this list have been 
> >> suffering from the same issue.
> >>
> >> The problem is, that under heavy IO to the MD1000, the system suffers
> 
> >> megaraid SAS resets. This sometimes causes the Filesystem to be 
> >> transitioned to read only by the OS.I have also seen an ext3 journal 
> >> to become corrupt because of this.
> >>
> >> I have had a fault call in with Dell UK's gold queue to resolve this 
> >> matter and last week they announced that they will finally have a 
> >> "fix"
> >> for the issue.
> >>
> >> On April 10th they say that they will be releasing a new firmware and
> 
> >> megasas driver which they say will not prevent the resets happening 
> >> but will mean that they are handled more effectively so that the 
> >> filesystem will not be transitioned to read only.
> >>
> >> I am interested to hear other people's opinion of this as it seems to
> 
> >> me that this is just fixing the symptoms and not the underlying 
> >> problem. It seems to me that a megasas reset just should not be 
> >> happening. The resets must be an indication of a problem on the SAS 
> >> bus - surely one would not expect to see them? Dell also say that I 
> >> can be assured that there will be no data loss if I go with this fix 
> >> - what do others think?
> >> It just seems wrong to me that I should be getting megasas resets
> >> - if
> >> this was a SCSI bus reset then this would be indicative of a major 
> >> problem.
> >>
> >> Anyway, I have got to decide whether to go live with a couple of 
> >> production fileservers (supporting 200 odd clients machines) using 
> >> this technology in the next few weeks so any comments/advice would be
> 
> >> gladly received.
> >>
> >> Regards
> >>
> >> Gary Mansell
> >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >> - - - - - - - - - - - - - - - - - - - This e-mail and any files 
> >> transmitted with it are confidential and intended solely for the use 
> >> of the individual or entity to whom they are addressed.If you have 
> >> received this e-mail in error please notify the sender immediately 
> >> and delete this e-mail from your system.Please note that any views or
> 
> >> opinions presented in this e-mail are solely those of the author and 
> >> do not necessarily represent those of Ricardo (save for reports and 
> >> other documentation formally approved and signed for release to the 
> >> intended recipient).Only Directors are authorised to enter into 
> >> legally binding obligations on behalf of Ricardo. Ricardo may monitor
> 
> >> outgoing and incoming e-mails and other telecommunications systems.
> >> By replying to this e-mail you give consent to such monitoring.The 
> >> recipient should check e-mail and any attachments for the presence of
> 
> >> viruses. Ricardo accepts no liability for any damage caused by any 
> >> virus transmitted by this e-mail. "Ricardo" means Ricardo plc and its
> 
> >> subsidiary companies.
> >> Ricardo plc is a public limited company registered in England with 
> >> registered number 00222915.
> >> The registered office of Ricardo plc is Shoreham Technical Centre, 
> >> Shoreham-by Sea, West Sussex, BN43 5FG.
> >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >> - - - - - - - - - - - - - - - - - - -
> >>
> >> _______________________________________________
> >> Linux-PowerEdge mailing list
> >> Linux-PowerEdge at dell.com
> >> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> >> Please read the FAQ at http://lists.us.dell.com/faq
> >>
> >>
> >>
> >
> > _______________________________________________
> > Linux-PowerEdge mailing list
> > Linux-PowerEdge at dell.com
> > http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> > Please read the FAQ at http://lists.us.dell.com/faq
> 
> 
> 
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq

-- 
------------------------------------------------------------------------------------ 
Gary Mansell 
Team Leader - Technical Computing 
Ricardo UK Ltd. 
Shoreham Technical Centre, Shoreham-By-Sea, West Sussex, BN43 5FG 
Email : Gary.Mansell at ricardo.com
Dial : +44 (0)1273 794485 | Fax :+44 (0)1273 794699  
A subsidiary of Ricardo plc. - www.ricardo.com
------------------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20070313/564fb692/attachment-0001.htm 


More information about the Linux-PowerEdge mailing list