FW: T410 Network Failure

Narendra_K at Dell.com Narendra_K at Dell.com
Thu Sep 3 11:30:04 CDT 2009


The issue is not seen when the driver from support.dell.com (version
1.8.7b)is used. If this driver is used, there is no need to pass

RHEL 5.3 native driver should be loaded with disable_msi=1 to not see
the issue.

With regards,
Narendra K

>-----Original Message-----
>From: Ryan Pugatch [mailto:rpug at tripadvisor.com] 
>Sent: Thursday, September 03, 2009 8:48 PM
>To: K, Narendra
>Cc: akrherz at iastate.edu; linux-poweredge-Lists
>Subject: Re: FW: T410 Network Failure
>Is this issue fixed by using the new driver from 
>support.dell.com and NOT having disable_msi?
>Narendra_K at Dell.com wrote:
>> Hello,
>> Yes, there might not be a link down message, everytime this issue is 
>> seen.In a failed state, you cannot ping to the system and you cannot 
>> ping from the system.With disable_msi=1 we have not seen the issue. 
>> When the issue occurs, except that the system becomes unreachable, 
>> there might not be any logs in dmesg or syslog. Issue is not 
>seen with 
>> upstream kernel. Dell and RedHat are working on this, and we should 
>> know soon, what is going on.
>> With regards,
>> Narendra K
>>> -----Original Message-----
>>> From: Ryan Pugatch [mailto:rpug at tripadvisor.com]
>>> Sent: Wednesday, September 02, 2009 2:33 AM
>>> To: K, Narendra
>>> Cc: akrherz at iastate.edu; linux-poweredge-Lists
>>> Subject: Re: FW: T410 Network Failure
>>> (sorry, resent as I sent from wrong email originally)
>>> Just checked logs again and the copper link down message hasn't 
>>> happened every time there was a problem, so that may not be related.
>>> Ryan
>>> Ryan Pugatch wrote:
>>>> FWIW, I am also having the same issue with some R710's.  
>They are a 
>>>> part of a hadoop cluster.  Interestingly enough, so far only
>>> 2 out of
>>>> the 3 servers have experienced the issue thus far in that
>>> cluster.  We
>>>> also run our corporate mail server on an R710 and that has 
>not shown 
>>>> any problems yet (except for a weird issue where outgoing TCP 
>>>> connections would intermittently fail until we restarted the
>>> network interfaces..
>>>> not sure if this is related--has only happened once).
>>>> We are running CentOS 5.3.  All three hadoop machines are running
>>>> 2.6.18-128.2.1.el5 and the mail server is running 
>>>> 2.6.18-128.1.10.el5
>>>> It seems that when the network would drop it would log:
>>>> kernel: bnx2: eth0 NIC Copper Link is Down
>>>> Not sure that the disable_msi option will fix the two hadoop
>>> machines
>>>> having the issues as the problem happens somewhat randomly and not 
>>>> easily reproducible.  That being said, we aren't getting
>>> some network
>>>> related errors in our hadoop logs that we had been getting
>>> previously
>>>> so I suspect that is a good sign.  Time will tell!
>>>> Is this issue related to the 2.6.28-rc3 regression specified here? 
>>>> http://lkml.indiana.edu/hypermail/linux/kernel/0811.0/01374.html
>>>> I am hoping a fix will make its way to RHEL and downstream 
>to CentOS 
>>>> (has anyone heard if that is happening?  I'm having trouble
>>> finding a
>>>> redhat or centos bug logged).
>>>> Are there any performance concerns with using disable_msi?  I know 
>>>> that the driver from Dell.com should fix the problem but I'd
>>> prefer to
>>>> use a driver provided from upstream.
>>>> Ryan Pugatch
>>>> Systems Administrator, TripAdvisor
>>>> Narendra_K at dell.com wrote:
>>>>> Hello,
>>>>> Thanks, this info is of great help.
>>>>> With regards,
>>>>> Narendra K
>>>>> -----Original Message-----
>>>>> From: daryl herzmann [mailto:akrherz at iastate.edu]
>>>>> Sent: Thursday, August 13, 2009 7:07 PM
>>>>> To: K, Narendra
>>>>> Cc: Biligiri, Raghavendra; linux-poweredge-Lists
>>>>> Subject: RE: FW: T410 Network Failure
>>>>> On Thu, 13 Aug 2009, Narendra_K at Dell.com wrote:
>>>>>> Thanks. Top output need not be at the time of failure. It
>>> can be any
>>>>>> time, just to get an idea as to what is resource
>>> utilization so that
>>>>>> we can replicate it. And general high level detail about the 
>>>>>> database you are using - like is it a oracle database ?
>>>>> It is running PostgreSQL 8.4 .  sar reports that the average CPU 
>>>>> utilization for today is 0.44% . 10% of memory is used.  network 
>>>>> utilization is only a few kbps.  I suspect when the
>>> failures occured,
>>>>> the machine got hit with a few hundred postgresql connections at 
>>>>> once, but I have no way to prove it.
>>>>> sorry again,
>>>>>   daryl
>>>>> _______________________________________________
>>>>> Linux-PowerEdge mailing list
>>>>> Linux-PowerEdge at lists.us.dell.com
>>>>> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>>>> Please read the FAQ at http://lists.us.dell.com/faq
>>>> _______________________________________________
>>>> Linux-PowerEdge mailing list
>>>> Linux-PowerEdge at lists.us.dell.com
>>>> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>>> Please read the FAQ at http://lists.us.dell.com/faq

More information about the Linux-PowerEdge mailing list