Pe8450 with megaraid and data corruption
dtitzer at cablespeed.com
Mon Jun 2 20:55:01 CDT 2003
Our PE8450 is running a RAID5 volume off a PERC3/DC. We're running
SuSE SLES7 and Oracle 126.96.36.199. I'm using dellmgr 5.11 and megamon 2.6.
This system is in production.
Because some versions of megamon caused scsi timeouts, I was
previously hesitant to try using it. Just getting it to die required
rebooting the system. I recently installed a known compatible version
of dellmgr (from Matt's Site) on this system, and, during a
maintenance window, I started up this (2.6) version of megamon. Both
appeared to be working just fine. We did have a flaky drive that
appeared to be okay, but megamon flagged it bad (it WAS bad) early
last week. It was a hotspare, so we didn't lose anything. I swapped a
bad hotspare for a good one. I considered that a Good Sign that this
was worth keeping running.
Megamon was started up on 5/25 during the late morning. Since it
defaults to running a consistency check right after midnight on Sunday
mornings, today (6/1) was the first chance it had to do that. During
that consistency check, we began getting corruption errors from
Oracle. The check flagged a drive, dropped the volume into degraded
mode, and began building on the hotspare. During this time, though, we
continued to get corruption errors in Oracle. I figure this corruption
is written to the raid volume, but whether the drive failure was the
cause I cannot know.
I want to know if there are any issues with dellmgr and/or megamon
that relate to consistency checks. Is there any possibility that the
consistency check is overly aggressive and that the drive was not bad?
Getting the corruption errors seemed odd. We had hoped that using RAID
would prevent corruption errors from affecting operations. Is it
possible that the RAID controller didn't detect the problem fast
mailto:dtitzer at cablespeed.com
More information about the Linux-PowerEdge