Disconcerting journal commit I/O error RHEL+2.6.9-34.0.2
Jason Young
jason.young at extension.org
Fri Aug 4 14:33:30 CDT 2006
I had it happen to me again yesterday on the primary box that's it
has been happening the most on (it triggered on a simple mkdir in a
mounted logical volume for /export) OMSA 5.0 had been removed. So
combined with the other reports, I'm ruling OMSA 5.0 out (hooray -
sorry Dell folks for the drive by implication there)
The other thing suggested to me from a campus colleague was that
"LVM2 snapshots weren't ready for prime time" - my colleague didn't
really point me to any hard data/bug reports to back that assertion,
but I trust their judgement generally, and it's another thing that's
slightly out of the "ordinary" WRT the filesystem that I'm grasping
at. I had been backing this box up using nightly created/removed
LVM snapshots, and I'm stopping doing that today on the box and I'll
see what happens.
Are you using LVM snapshots at all?
Jason
On Aug 4, 2006, at 4:52 AM, Nicky Peeters wrote:
> Well, I've just had one sever go bongo on a similar issue.
>
> It's a PE 2850 with 6 disks in RAID10, running RHEL4 X86_64.
> And the only machine I upgraded the kernel to 2.6.9-34.0.2.ELsm (22
> days ago)
>
> The FS seems RO now, but since I can't get a root shell running
> (bus errrors) I need to schedule a datacenter trip to know more.
>
> Dmesg output:
>
> EXT3-fs error (device dm-1) in start_transaction: Journal has aborted
> EXT3-fs error (device dm-1) in start_transaction: Journal has aborted
> EXT3-fs error (device dm-1) in start_transaction: Journal has aborted
> EXT3-fs error (device dm-1) in start_transaction: Journal has aborted
> scsi0 (0:0): rejecting I/O to offline device
> EXT3-fs error (device dm-1): ext3_find_entry: reading directory
> #15450113 offset 0
>
> scsi0 (0:0): rejecting I/O to offline device
> EXT3-fs error (device dm-1): ext3_find_entry: reading directory
> #15450118 offset 0
> ...
>
> Let me know if you know more, I'm beginnen to suspect our kernels
> to be the culprits !
>
> On 28 Jul 2006, at 15:12, Jason Young wrote:
>
>> Hi all,
>>
>> Two weeks ago, right after the Red Hat Kernel update to
>> 2.6.9-34.0.2.ELsmp (RHEL4) I started getting a journal commit I/O
>> error on two of my servers with the srvadmin-all rpm's installed
>> (version 5).
>>
>> One was a 2800 running WS the other a 2850 running AS - both with the
>> OEM PERC controllers that came with the servers. All firmware/
>> bios updates are up to the latest release versions available.
>>
>> The error came after a moderate amount of writes (either installing
>> ruby on the freshly reinstalled 2850 or processing some webstats with
>> awstats on the 2800) - and when the journal commit error occurs -
>> every mounted volume goes read only - which obviously wreaks havoc on
>> the running operating system. The problem occurred twice on the
>> 2800, and once on the 2850. It was not (yet) occurring on my other
>> 2850's and 1850's - running RHELv4, ws and as both - also with the
>> version srvadmin-all rpm's (and the srvadmin-rac4 RPM's where
>> appropriate). Those were/are still running 2.6.9-34.0.1
>>
>> My filesystem is a normal primary ext3 /boot, and the rest of the
>> RAID (either all RAID5 or a two disk RAID1 and 3 disk RAID5 on the
>> six-drive 2850) is a PV with various sized LVM2 logical volumes for
>> slash, /var, /home, etc.
>>
>> The problem freaked me out more than a little, the two servers it was
>> happening on are not-yet-production, and obviously the last thing I
>> needed was the problem to spread to production systems. There's no
>> logs obviously, because /var goes read-only like everything else.
>>
>> Grasping at "what changed" straws - I froze going to kernel
>> 2.6.9-34.0.2 everywhere else - and proceeded to pull the Dell
>> srvadmin RPM's everywhere (I know that openipmi is a kernel module,
>> and didn't want it to be a question mark).
>>
>> - No problems on the 2.6.9-34.0.2 boxes since I pulled openipmi and
>> the other rpm's.
>> - Still no problems with the 2.6.9-34.0.1 boxes. I have a few
>> vmware (esx) VM's that have gone to 2.6.9-34.0.2 without problem, but
>> no other physical servers.
>>
>> I'd like to put the Dell software back, because I like it, and I'm
>> not sure it's the culprit at all. But I'm a bit gunshy at the
>> moment, and like the fact that the filesystems aren't "locking up" on
>> me anymore. But I'm a bit of a loss to troubleshoot the problem
>> since there's nothing that can get logged when it happens. Logs up
>> until it happens didn't give me any indication of a pending problem.
>>
>> Thoughts? ideas?
>>
>> Jason
>> --
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> Jason Young -- Systems Manager, eXtension
>> http://about.extension.org/wiki/Jason_Young
>> ______________________________________
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Linux-PowerEdge mailing list
>> Linux-PowerEdge at dell.com
>> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
>> Please read the FAQ at http://lists.us.dell.com/faq
>>
>
>
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jason Young -- Systems Manager, eXtension
http://about.extension.org/wiki/Jason_Young
______________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20060804/b4561d9d/attachment-0001.htm
More information about the Linux-PowerEdge
mailing list