FW: T410 Network Failure

Ryan Pugatch rpug at tripadvisor.com
Tue Sep 1 16:02:31 CDT 2009


(sorry, resent as I sent from wrong email originally)

Just checked logs again and the copper link down message hasn't happened
every time there was a problem, so that may not be related.

Ryan


Ryan Pugatch wrote:
> FWIW, I am also having the same issue with some R710's.  They are a part 
> of a hadoop cluster.  Interestingly enough, so far only 2 out of the 3 
> servers have experienced the issue thus far in that cluster.  We also 
> run our corporate mail server on an R710 and that has not shown any 
> problems yet (except for a weird issue where outgoing TCP connections 
> would intermittently fail until we restarted the network interfaces.. 
> not sure if this is related--has only happened once).
> 
> We are running CentOS 5.3.  All three hadoop machines are running 
> 2.6.18-128.2.1.el5 and the mail server is running 2.6.18-128.1.10.el5
> 
> It seems that when the network would drop it would log:
> 	
> kernel: bnx2: eth0 NIC Copper Link is Down
> 
> Not sure that the disable_msi option will fix the two hadoop machines 
> having the issues as the problem happens somewhat randomly and not 
> easily reproducible.  That being said, we aren't getting some network 
> related errors in our hadoop logs that we had been getting previously so 
> I suspect that is a good sign.  Time will tell!
> 
> Is this issue related to the 2.6.28-rc3 regression specified here? 
> http://lkml.indiana.edu/hypermail/linux/kernel/0811.0/01374.html
> 
> I am hoping a fix will make its way to RHEL and downstream to CentOS 
> (has anyone heard if that is happening?  I'm having trouble finding a 
> redhat or centos bug logged).
> 
> Are there any performance concerns with using disable_msi?  I know that 
> the driver from Dell.com should fix the problem but I'd prefer to use a 
> driver provided from upstream.
> 
> Ryan Pugatch
> Systems Administrator, TripAdvisor
> 
> 
> Narendra_K at dell.com wrote:
>> Hello,
>>
>> Thanks, this info is of great help.
>>
>> With regards,
>> Narendra K 
>>
>> -----Original Message-----
>> From: daryl herzmann [mailto:akrherz at iastate.edu] 
>> Sent: Thursday, August 13, 2009 7:07 PM
>> To: K, Narendra
>> Cc: Biligiri, Raghavendra; linux-poweredge-Lists
>> Subject: RE: FW: T410 Network Failure
>>
>> On Thu, 13 Aug 2009, Narendra_K at Dell.com wrote:
>>
>>> Thanks. Top output need not be at the time of failure. It can be any 
>>> time, just to get an idea as to what is resource utilization so that 
>>> we can replicate it. And general high level detail about the database 
>>> you are using - like is it a oracle database ?
>> It is running PostgreSQL 8.4 .  sar reports that the average CPU
>> utilization for today is 0.44% . 10% of memory is used.  network
>> utilization is only a few kbps.  I suspect when the failures occured,
>> the machine got hit with a few hundred postgresql connections at once,
>> but I have no way to prove it.
>>
>> sorry again,
>>   daryl
>>
>> _______________________________________________
>> Linux-PowerEdge mailing list
>> Linux-PowerEdge at lists.us.dell.com
>> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>> Please read the FAQ at http://lists.us.dell.com/faq
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at lists.us.dell.com
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq



More information about the Linux-PowerEdge mailing list