Ubuntu 7.04 and PE SC1435

Ramiro Alba Queipo raq at cttc.upc.edu
Thu Sep 18 10:15:27 CDT 2008


On Thu, 2008-09-18 at 15:22 +0100, Vanush "Misha" Paturyan wrote:
> Hi Ramiro,
> 
> You need to provide more info on the setup:
> 1) are / and /home local to each node or they're shared from  
> centralized location somehow?

They are both local to each node

> 2) what does jfs error look like (you only provided ext3 error)

Sorry. I discarded this information. I'll provide you next time

Adicional information is:

1) uname -a > Linux jff202 2.6.20-16-server #2 SMP Tue Feb 12 02:16:56
UTC 2008 x86_64 GNU/Linux

2) I use infiniband as the only net (Ethernet is only used to install
nodes via PXE): openmpi comunications (libibverbs1) and accesses with
rsh to calculation nodes via IP over IB. In the near future I intend to
use NFS-RDMA over Infiniband to share /home space, install a batch queue
system and not to allow access to calculation nodes. 

Would Ubuntu 8.04 solve the problem (It is reported on Ubuntu Web as
validated hardware)? 
See: http://webapps.ubuntu.com/certification/hardware/200712-182/

If this where the case I would prepare automatized installations based
on Ubuntu 8.04 amd64.

Are you using Ubuntu with this hardware?

Regards

> 
> Cheers,
> 
> Misha.
> 
> 
> On 18 Sep 2008, at 11:23, Ramiro Alba Queipo wrote:
> 
> > Hello everybody:
> >
> > We have an infiniband cluster built from PE SC1435 servers under  
> > Ubuntu
> > 7.04 and using OpenMPI 1.2.5 (not in official distribution) with
> > Mellanox infiniband cards of 20 Gb/s ( MT25204 [InfiniHost III Lx  
> > HCA]).
> >
> > Both hardware tests (full DELL diagnostics) and software tests (hpl,
> > NPG-MI NAS) seem to be OK, but every now and then the / and/or /home
> > file systems are remounted read only by the system with many files
> > corrupted. Then, the system must be reinstalled.
> > I tried both jfs and ext3 file systems, but the results a similar. In
> > the case of ext3 I've got:
> >
> > [46509.378381] EXT3-fs error (device sda1): htree_dirblock_to_tree:  
> > bad
> > entry in directory #99737: rec_len is smaller t
> > han minimal - offset=0, inode=0, rec_len=0, name_len=0
> > [46509.378494] Aborting journal on device sda1.
> > [46509.378722] Remounting filesystem read-only
> >
> > This node has been reinstalled from scratch, the same day it failed
> >
> > I am quite confused, as if not a hardware failure, (DELL  
> > diagnostics are
> > OK), how can a user process corrupt the / file system.
> >
> > Any comment/advice would be very appreciated.
> >
> > Thanks in advance
> >
> > Regards
> >
> >
> > -- 
> > Aquest missatge ha estat analitzat per MailScanner
> > a la cerca de virus i d'altres continguts perillosos,
> > i es considera que està net.
> > For all your IT requirements visit: http://www.transtec.co.uk
> >
> > _______________________________________________
> > Linux-PowerEdge mailing list
> > Linux-PowerEdge at dell.com
> > http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> > Please read the FAQ at http://lists.us.dell.com/faq
> 
> Vanush "Misha" Paturyan
> Senior Technical Officer
> Comptuer Science Department
> NUI Maynooth
> 
> 
> 
> 
> 
> 


-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
For all your IT requirements visit: http://www.transtec.co.uk



More information about the Linux-PowerEdge mailing list