FW: T410 Network Failure

Ryan Pugatch rpug at tripadvisor.com
Tue Sep 1 15:43:32 CDT 2009


FWIW, I am also having the same issue with some R710's.  They are a part 
of a hadoop cluster.  Interestingly enough, so far only 2 out of the 3 
servers have experienced the issue thus far in that cluster.  We also 
run our corporate mail server on an R710 and that has not shown any 
problems yet (except for a weird issue where outgoing TCP connections 
would intermittently fail until we restarted the network interfaces.. 
not sure if this is related--has only happened once).

We are running CentOS 5.3.  All three hadoop machines are running 
2.6.18-128.2.1.el5 and the mail server is running 2.6.18-128.1.10.el5

It seems that when the network would drop it would log:
	
kernel: bnx2: eth0 NIC Copper Link is Down

Not sure that the disable_msi option will fix the two hadoop machines 
having the issues as the problem happens somewhat randomly and not 
easily reproducible.  That being said, we aren't getting some network 
related errors in our hadoop logs that we had been getting previously so 
I suspect that is a good sign.  Time will tell!

Is this issue related to the 2.6.28-rc3 regression specified here? 
http://lkml.indiana.edu/hypermail/linux/kernel/0811.0/01374.html

I am hoping a fix will make its way to RHEL and downstream to CentOS 
(has anyone heard if that is happening?  I'm having trouble finding a 
redhat or centos bug logged).

Are there any performance concerns with using disable_msi?  I know that 
the driver from Dell.com should fix the problem but I'd prefer to use a 
driver provided from upstream.

Ryan Pugatch
Systems Administrator, TripAdvisor


Narendra_K at dell.com wrote:
> Hello,
> 
> Thanks, this info is of great help.
> 
> With regards,
> Narendra K 
> 
> -----Original Message-----
> From: daryl herzmann [mailto:akrherz at iastate.edu] 
> Sent: Thursday, August 13, 2009 7:07 PM
> To: K, Narendra
> Cc: Biligiri, Raghavendra; linux-poweredge-Lists
> Subject: RE: FW: T410 Network Failure
> 
> On Thu, 13 Aug 2009, Narendra_K at Dell.com wrote:
> 
>> Thanks. Top output need not be at the time of failure. It can be any 
>> time, just to get an idea as to what is resource utilization so that 
>> we can replicate it. And general high level detail about the database 
>> you are using - like is it a oracle database ?
> 
> It is running PostgreSQL 8.4 .  sar reports that the average CPU
> utilization for today is 0.44% . 10% of memory is used.  network
> utilization is only a few kbps.  I suspect when the failures occured,
> the machine got hit with a few hundred postgresql connections at once,
> but I have no way to prove it.
> 
> sorry again,
>   daryl
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at lists.us.dell.com
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq



More information about the Linux-PowerEdge mailing list