FW: T410 Network Failure

Narendra_K at Dell.com Narendra_K at Dell.com
Thu Sep 3 11:30:04 CDT 2009


Hello,

The issue is not seen when the driver from support.dell.com (version
1.8.7b)is used. If this driver is used, there is no need to pass
disable_msi=1. 

RHEL 5.3 native driver should be loaded with disable_msi=1 to not see
the issue.

With regards,
Narendra K

>-----Original Message-----
>From: Ryan Pugatch [mailto:rpug at tripadvisor.com] 
>Sent: Thursday, September 03, 2009 8:48 PM
>To: K, Narendra
>Cc: akrherz at iastate.edu; linux-poweredge-Lists
>Subject: Re: FW: T410 Network Failure
>
>Is this issue fixed by using the new driver from 
>support.dell.com and NOT having disable_msi?
>
>Thanks,
>
>Ryan
>
>Narendra_K at Dell.com wrote:
>> Hello,
>> 
>> Yes, there might not be a link down message, everytime this issue is 
>> seen.In a failed state, you cannot ping to the system and you cannot 
>> ping from the system.With disable_msi=1 we have not seen the issue. 
>> When the issue occurs, except that the system becomes unreachable, 
>> there might not be any logs in dmesg or syslog. Issue is not 
>seen with 
>> upstream kernel. Dell and RedHat are working on this, and we should 
>> know soon, what is going on.
>> 
>> With regards,
>> Narendra K
>> 
>>> -----Original Message-----
>>> From: Ryan Pugatch [mailto:rpug at tripadvisor.com]
>>> Sent: Wednesday, September 02, 2009 2:33 AM
>>> To: K, Narendra
>>> Cc: akrherz at iastate.edu; linux-poweredge-Lists
>>> Subject: Re: FW: T410 Network Failure
>>>
>>> (sorry, resent as I sent from wrong email originally)
>>>
>>> Just checked logs again and the copper link down message hasn't 
>>> happened every time there was a problem, so that may not be related.
>>>
>>> Ryan
>>>
>>>
>>> Ryan Pugatch wrote:
>>>> FWIW, I am also having the same issue with some R710's.  
>They are a 
>>>> part of a hadoop cluster.  Interestingly enough, so far only
>>> 2 out of
>>>> the 3 servers have experienced the issue thus far in that
>>> cluster.  We
>>>> also run our corporate mail server on an R710 and that has 
>not shown 
>>>> any problems yet (except for a weird issue where outgoing TCP 
>>>> connections would intermittently fail until we restarted the
>>> network interfaces..
>>>> not sure if this is related--has only happened once).
>>>>
>>>> We are running CentOS 5.3.  All three hadoop machines are running
>>>> 2.6.18-128.2.1.el5 and the mail server is running 
>>>> 2.6.18-128.1.10.el5
>>>>
>>>> It seems that when the network would drop it would log:
>>>> 	
>>>> kernel: bnx2: eth0 NIC Copper Link is Down
>>>>
>>>> Not sure that the disable_msi option will fix the two hadoop
>>> machines
>>>> having the issues as the problem happens somewhat randomly and not 
>>>> easily reproducible.  That being said, we aren't getting
>>> some network
>>>> related errors in our hadoop logs that we had been getting
>>> previously
>>>> so I suspect that is a good sign.  Time will tell!
>>>>
>>>> Is this issue related to the 2.6.28-rc3 regression specified here? 
>>>> http://lkml.indiana.edu/hypermail/linux/kernel/0811.0/01374.html
>>>>
>>>> I am hoping a fix will make its way to RHEL and downstream 
>to CentOS 
>>>> (has anyone heard if that is happening?  I'm having trouble
>>> finding a
>>>> redhat or centos bug logged).
>>>>
>>>> Are there any performance concerns with using disable_msi?  I know 
>>>> that the driver from Dell.com should fix the problem but I'd
>>> prefer to
>>>> use a driver provided from upstream.
>>>>
>>>> Ryan Pugatch
>>>> Systems Administrator, TripAdvisor
>>>>
>>>>
>>>> Narendra_K at dell.com wrote:
>>>>> Hello,
>>>>>
>>>>> Thanks, this info is of great help.
>>>>>
>>>>> With regards,
>>>>> Narendra K
>>>>>
>>>>> -----Original Message-----
>>>>> From: daryl herzmann [mailto:akrherz at iastate.edu]
>>>>> Sent: Thursday, August 13, 2009 7:07 PM
>>>>> To: K, Narendra
>>>>> Cc: Biligiri, Raghavendra; linux-poweredge-Lists
>>>>> Subject: RE: FW: T410 Network Failure
>>>>>
>>>>> On Thu, 13 Aug 2009, Narendra_K at Dell.com wrote:
>>>>>
>>>>>> Thanks. Top output need not be at the time of failure. It
>>> can be any
>>>>>> time, just to get an idea as to what is resource
>>> utilization so that
>>>>>> we can replicate it. And general high level detail about the 
>>>>>> database you are using - like is it a oracle database ?
>>>>> It is running PostgreSQL 8.4 .  sar reports that the average CPU 
>>>>> utilization for today is 0.44% . 10% of memory is used.  network 
>>>>> utilization is only a few kbps.  I suspect when the
>>> failures occured,
>>>>> the machine got hit with a few hundred postgresql connections at 
>>>>> once, but I have no way to prove it.
>>>>>
>>>>> sorry again,
>>>>>   daryl
>>>>>
>>>>> _______________________________________________
>>>>> Linux-PowerEdge mailing list
>>>>> Linux-PowerEdge at lists.us.dell.com
>>>>> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>>>> Please read the FAQ at http://lists.us.dell.com/faq
>>>> _______________________________________________
>>>> Linux-PowerEdge mailing list
>>>> Linux-PowerEdge at lists.us.dell.com
>>>> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>>> Please read the FAQ at http://lists.us.dell.com/faq
>>>
>
>



More information about the Linux-PowerEdge mailing list