Megamon produces "too many open files" prob after upgrade to Debian lenny

Timo Veith tv at rz-zw.fh-kl.de
Fri Sep 4 03:08:59 CDT 2009


Hi all,

just wanted to ask if anybody has met a similar problem. I recently
upgraded a server from Debian etch to lenny. The upgrade went fine. But
after a week or so, I could not log in. Not via SSH and directly on the
console neither. The error was "BASH: Too many open files" (from my
mind). The three finger salute wasn't able to bring the system down
savely. So I had to hard power off the system.

First I thought it was some sort of misconfiguration that I made while
the normal admin work. Nobody is perfect. I reversed my changes but
after another week or so the same thing happened again.

A found a lot of these log messages:

kernel: [806899.612547] VFS: file-max limit 50215 reached

I googled around a long time and found a way to raise the kernel file
limit echo XYZ > /proc/sys/fs/file-max or via sysctl. And I also watched
the value in /proc/sys/fs/file-nr over time. It was continuously rising.

Because of hard powering off the system I did a filesystem check. I
found that /var was damaged and thus had to stop all daemons and
umounting it for repairing. I also stopped megamon. And that was the
time when I suddenly saw the value of open files in file-nr falling back
to "normal".

Ok, then I thought there might be some incompatibliy between lennys PERC
driver and megamon. I straced MegaServ but as I am not good in
interpreting the output of strace, I just decided that I need a
alternative to Megamon. I found megactl (sourceforge project) from some
other post on this list and am using that in a cron job now.

The server is a Poweregde 1750. It has a PERC4/Di with one logical drive
 (RAID1 with Maxtor 34gb and Seagate 69gb). I know I lose space, but
that wasn't a problem so far. Dellmgr says that the RAID status is
optimal, but the Maxtor disk has one media error. The consistency check
from dellmgr said nothing. megactl also reports media error(s) but I am
not sure how serious that is. I know that the media error is there for a
long time, because every time I had to reboot, I got those well known
emails from megamon.

Here is the output of it megarpt:


megaraid health check
---------------------
a0       PERC 4/Di                bios:B109 fw:4.04 chan:2 ldrv:1
rbld:30% batt:good
a0c0t0     MAXTOR ATLAS10K4_36SCA     33GiB  a0d0  online   errs:
media:1  other:0
     write errors: corr:  0    delay: 10    rewrit:  0    tot/corr:  0
  tot/uncorr:  0
      read errors: corr:  9Mi  delay:  0    reread:  1    tot/corr:  0
  tot/uncorr:  1
    verify errors: corr:  0    delay:  0    revrfy:  0    tot/corr:  0
  tot/uncorr:  0
    temperature: current:24C threshold:0C


megaraid configuration
----------------------
a0       PERC 4/Di                bios:B109 fw:4.04 chan:2 ldrv:1
rbld:30% batt:good
a0d0       33GiB RAID 1   1x2  optimal
       row  0:  a0c0t0    a0c0t1


--------------
megarpt version: 0.3
megactl version: 0.4.1




Could that mean there were 9 million read errors that have been
corrected? Could that somehow be related to the problem above?

Kind regards,
Timo



More information about the Linux-PowerEdge mailing list