Ubuntu 7.04 and PE SC1435
Ramiro Alba Queipo
raq at cttc.upc.edu
Thu Sep 18 10:15:27 CDT 2008
On Thu, 2008-09-18 at 15:22 +0100, Vanush "Misha" Paturyan wrote:
> Hi Ramiro,
> You need to provide more info on the setup:
> 1) are / and /home local to each node or they're shared from
> centralized location somehow?
They are both local to each node
> 2) what does jfs error look like (you only provided ext3 error)
Sorry. I discarded this information. I'll provide you next time
Adicional information is:
1) uname -a > Linux jff202 2.6.20-16-server #2 SMP Tue Feb 12 02:16:56
UTC 2008 x86_64 GNU/Linux
2) I use infiniband as the only net (Ethernet is only used to install
nodes via PXE): openmpi comunications (libibverbs1) and accesses with
rsh to calculation nodes via IP over IB. In the near future I intend to
use NFS-RDMA over Infiniband to share /home space, install a batch queue
system and not to allow access to calculation nodes.
Would Ubuntu 8.04 solve the problem (It is reported on Ubuntu Web as
If this where the case I would prepare automatized installations based
on Ubuntu 8.04 amd64.
Are you using Ubuntu with this hardware?
> On 18 Sep 2008, at 11:23, Ramiro Alba Queipo wrote:
> > Hello everybody:
> > We have an infiniband cluster built from PE SC1435 servers under
> > Ubuntu
> > 7.04 and using OpenMPI 1.2.5 (not in official distribution) with
> > Mellanox infiniband cards of 20 Gb/s ( MT25204 [InfiniHost III Lx
> > HCA]).
> > Both hardware tests (full DELL diagnostics) and software tests (hpl,
> > NPG-MI NAS) seem to be OK, but every now and then the / and/or /home
> > file systems are remounted read only by the system with many files
> > corrupted. Then, the system must be reinstalled.
> > I tried both jfs and ext3 file systems, but the results a similar. In
> > the case of ext3 I've got:
> > [46509.378381] EXT3-fs error (device sda1): htree_dirblock_to_tree:
> > bad
> > entry in directory #99737: rec_len is smaller t
> > han minimal - offset=0, inode=0, rec_len=0, name_len=0
> > [46509.378494] Aborting journal on device sda1.
> > [46509.378722] Remounting filesystem read-only
> > This node has been reinstalled from scratch, the same day it failed
> > I am quite confused, as if not a hardware failure, (DELL
> > diagnostics are
> > OK), how can a user process corrupt the / file system.
> > Any comment/advice would be very appreciated.
> > Thanks in advance
> > Regards
> > --
> > Aquest missatge ha estat analitzat per MailScanner
> > a la cerca de virus i d'altres continguts perillosos,
> > i es considera que està net.
> > For all your IT requirements visit: http://www.transtec.co.uk
> > _______________________________________________
> > Linux-PowerEdge mailing list
> > Linux-PowerEdge at dell.com
> > http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> > Please read the FAQ at http://lists.us.dell.com/faq
> Vanush "Misha" Paturyan
> Senior Technical Officer
> Comptuer Science Department
> NUI Maynooth
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
For all your IT requirements visit: http://www.transtec.co.uk
More information about the Linux-PowerEdge