Disconcerting journal commit I/O error RHEL+2.6.9-34.0.2

Jason Young jason.young at eXtension.org
Fri Aug 4 14:42:56 CDT 2006


I will also be taking a closer look at the Blackbird/Genesis firmware  
fixed mentioned in the other thread on the 1850/2850 RAID and RHEL4  
(and that I just got a notification about for another system through  
an email sent to our buyer)

Jason



On Aug 4, 2006, at 3:33 PM, Jason Young wrote:

> I had it happen to me again yesterday on the primary box that's it  
> has been happening the most on (it triggered on a simple mkdir in a  
> mounted logical volume for /export)  OMSA 5.0 had been removed.  So  
> combined with the other reports, I'm ruling OMSA 5.0 out (hooray -  
> sorry Dell folks for the drive by implication there)
>
> The other thing suggested to me from a campus colleague was that  
> "LVM2 snapshots weren't ready for prime time" - my colleague didn't  
> really point me to any hard data/bug reports to back that  
> assertion, but I trust their judgement generally, and it's another  
> thing that's slightly out of the "ordinary" WRT the filesystem that  
> I'm grasping at.    I had been backing this box up using nightly  
> created/removed LVM snapshots, and I'm stopping doing that today on  
> the box and I'll see what happens.
>
> Are you using LVM snapshots at all?
>
> Jason
>
>
> On Aug 4, 2006, at 4:52 AM, Nicky Peeters wrote:
>
>> Well, I've just had one sever go bongo on a similar issue.
>>
>> It's a PE 2850 with 6 disks in RAID10, running RHEL4 X86_64.
>> And the only machine I upgraded the kernel to 2.6.9-34.0.2.ELsm  
>> (22 days ago)
>>
>> The FS seems RO now, but since I can't get a root shell running  
>> (bus errrors) I need to schedule a datacenter trip to know more.
>>
>> Dmesg output:
>>
>> EXT3-fs error (device dm-1) in start_transaction: Journal has aborted
>> EXT3-fs error (device dm-1) in start_transaction: Journal has aborted
>> EXT3-fs error (device dm-1) in start_transaction: Journal has aborted
>> EXT3-fs error (device dm-1) in start_transaction: Journal has aborted
>> scsi0 (0:0): rejecting I/O to offline device
>> EXT3-fs error (device dm-1): ext3_find_entry: reading directory  
>> #15450113 offset 0
>>
>> scsi0 (0:0): rejecting I/O to offline device
>> EXT3-fs error (device dm-1): ext3_find_entry: reading directory  
>> #15450118 offset 0
>> ...
>>
>> Let me know if you know more, I'm beginnen to suspect our kernels  
>> to be the culprits !
>>
>> On 28 Jul 2006, at 15:12, Jason Young wrote:
>>
>>> Hi all,
>>>
>>> Two weeks ago, right after the Red Hat Kernel update to
>>> 2.6.9-34.0.2.ELsmp (RHEL4) I started getting a journal commit I/O
>>> error on two of my servers with the srvadmin-all rpm's installed
>>> (version 5).
>>>
>>> One was a 2800 running WS the other a 2850 running AS - both with  
>>> the
>>> OEM PERC controllers that came with the servers.     All firmware/
>>> bios updates are up to the latest release versions available.
>>>
>>> The error came after a moderate amount of writes (either installing
>>> ruby on the freshly reinstalled 2850 or processing some webstats  
>>> with
>>> awstats on the 2800) - and when the journal commit error occurs -
>>> every mounted volume goes read only - which obviously wreaks  
>>> havoc on
>>> the running operating system.   The problem occurred twice on the
>>> 2800, and once on the 2850.    It was not (yet) occurring on my  
>>> other
>>> 2850's and 1850's - running RHELv4, ws and as both - also with the
>>> version srvadmin-all rpm's (and the srvadmin-rac4 RPM's where
>>> appropriate).  Those were/are still running 2.6.9-34.0.1
>>>
>>> My filesystem is a normal primary ext3 /boot, and the rest of the
>>> RAID (either all RAID5 or a two disk RAID1 and 3 disk RAID5  on the
>>> six-drive 2850) is a PV with various sized LVM2 logical volumes for
>>> slash, /var, /home, etc.
>>>
>>> The problem freaked me out more than a little, the two servers it  
>>> was
>>> happening on are not-yet-production, and obviously the last thing I
>>> needed was the problem to spread to production systems.  There's no
>>> logs obviously, because /var goes read-only like everything else.
>>>
>>> Grasping at "what changed" straws - I froze going to kernel
>>> 2.6.9-34.0.2 everywhere else - and proceeded to pull the Dell
>>> srvadmin RPM's everywhere (I know that openipmi is a kernel module,
>>> and didn't want it to be a question mark).
>>>
>>> - No problems on the 2.6.9-34.0.2 boxes since I pulled openipmi and
>>> the other rpm's.
>>> - Still no problems with the 2.6.9-34.0.1 boxes.   I have a few
>>> vmware (esx) VM's that have gone to 2.6.9-34.0.2 without problem,  
>>> but
>>> no other physical servers.
>>>
>>> I'd like to put the Dell software back, because I like it, and I'm
>>> not sure it's the culprit at all.   But I'm a bit gunshy at the
>>> moment, and like the fact that the filesystems aren't "locking  
>>> up" on
>>> me anymore.   But I'm a bit of a loss to troubleshoot the problem
>>> since there's nothing that can get logged when it happens.     
>>> Logs up
>>> until it happens didn't give me any indication of a pending problem.
>>>
>>> Thoughts?  ideas?
>>>
>>> Jason
>>> --
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> Jason Young --  Systems Manager, eXtension
>>>   http://about.extension.org/wiki/Jason_Young
>>> ______________________________________
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Linux-PowerEdge mailing list
>>> Linux-PowerEdge at dell.com
>>> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>> Please read the FAQ at http://lists.us.dell.com/faq
>>>
>>
>>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Jason Young --  Systems Manager, eXtension
>  http://about.extension.org/wiki/Jason_Young
> ______________________________________
>
>
>
>
>
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jason Young --  Systems Manager, eXtension
  http://about.extension.org/wiki/Jason_Young
______________________________________






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20060804/2d17aad3/attachment-0001.htm 


More information about the Linux-PowerEdge mailing list