FW: T410 Network Failure
Narendra_K at Dell.com
Narendra_K at Dell.com
Thu Sep 3 11:30:04 CDT 2009
Hello,
The issue is not seen when the driver from support.dell.com (version
1.8.7b)is used. If this driver is used, there is no need to pass
disable_msi=1.
RHEL 5.3 native driver should be loaded with disable_msi=1 to not see
the issue.
With regards,
Narendra K
>-----Original Message-----
>From: Ryan Pugatch [mailto:rpug at tripadvisor.com]
>Sent: Thursday, September 03, 2009 8:48 PM
>To: K, Narendra
>Cc: akrherz at iastate.edu; linux-poweredge-Lists
>Subject: Re: FW: T410 Network Failure
>
>Is this issue fixed by using the new driver from
>support.dell.com and NOT having disable_msi?
>
>Thanks,
>
>Ryan
>
>Narendra_K at Dell.com wrote:
>> Hello,
>>
>> Yes, there might not be a link down message, everytime this issue is
>> seen.In a failed state, you cannot ping to the system and you cannot
>> ping from the system.With disable_msi=1 we have not seen the issue.
>> When the issue occurs, except that the system becomes unreachable,
>> there might not be any logs in dmesg or syslog. Issue is not
>seen with
>> upstream kernel. Dell and RedHat are working on this, and we should
>> know soon, what is going on.
>>
>> With regards,
>> Narendra K
>>
>>> -----Original Message-----
>>> From: Ryan Pugatch [mailto:rpug at tripadvisor.com]
>>> Sent: Wednesday, September 02, 2009 2:33 AM
>>> To: K, Narendra
>>> Cc: akrherz at iastate.edu; linux-poweredge-Lists
>>> Subject: Re: FW: T410 Network Failure
>>>
>>> (sorry, resent as I sent from wrong email originally)
>>>
>>> Just checked logs again and the copper link down message hasn't
>>> happened every time there was a problem, so that may not be related.
>>>
>>> Ryan
>>>
>>>
>>> Ryan Pugatch wrote:
>>>> FWIW, I am also having the same issue with some R710's.
>They are a
>>>> part of a hadoop cluster. Interestingly enough, so far only
>>> 2 out of
>>>> the 3 servers have experienced the issue thus far in that
>>> cluster. We
>>>> also run our corporate mail server on an R710 and that has
>not shown
>>>> any problems yet (except for a weird issue where outgoing TCP
>>>> connections would intermittently fail until we restarted the
>>> network interfaces..
>>>> not sure if this is related--has only happened once).
>>>>
>>>> We are running CentOS 5.3. All three hadoop machines are running
>>>> 2.6.18-128.2.1.el5 and the mail server is running
>>>> 2.6.18-128.1.10.el5
>>>>
>>>> It seems that when the network would drop it would log:
>>>>
>>>> kernel: bnx2: eth0 NIC Copper Link is Down
>>>>
>>>> Not sure that the disable_msi option will fix the two hadoop
>>> machines
>>>> having the issues as the problem happens somewhat randomly and not
>>>> easily reproducible. That being said, we aren't getting
>>> some network
>>>> related errors in our hadoop logs that we had been getting
>>> previously
>>>> so I suspect that is a good sign. Time will tell!
>>>>
>>>> Is this issue related to the 2.6.28-rc3 regression specified here?
>>>> http://lkml.indiana.edu/hypermail/linux/kernel/0811.0/01374.html
>>>>
>>>> I am hoping a fix will make its way to RHEL and downstream
>to CentOS
>>>> (has anyone heard if that is happening? I'm having trouble
>>> finding a
>>>> redhat or centos bug logged).
>>>>
>>>> Are there any performance concerns with using disable_msi? I know
>>>> that the driver from Dell.com should fix the problem but I'd
>>> prefer to
>>>> use a driver provided from upstream.
>>>>
>>>> Ryan Pugatch
>>>> Systems Administrator, TripAdvisor
>>>>
>>>>
>>>> Narendra_K at dell.com wrote:
>>>>> Hello,
>>>>>
>>>>> Thanks, this info is of great help.
>>>>>
>>>>> With regards,
>>>>> Narendra K
>>>>>
>>>>> -----Original Message-----
>>>>> From: daryl herzmann [mailto:akrherz at iastate.edu]
>>>>> Sent: Thursday, August 13, 2009 7:07 PM
>>>>> To: K, Narendra
>>>>> Cc: Biligiri, Raghavendra; linux-poweredge-Lists
>>>>> Subject: RE: FW: T410 Network Failure
>>>>>
>>>>> On Thu, 13 Aug 2009, Narendra_K at Dell.com wrote:
>>>>>
>>>>>> Thanks. Top output need not be at the time of failure. It
>>> can be any
>>>>>> time, just to get an idea as to what is resource
>>> utilization so that
>>>>>> we can replicate it. And general high level detail about the
>>>>>> database you are using - like is it a oracle database ?
>>>>> It is running PostgreSQL 8.4 . sar reports that the average CPU
>>>>> utilization for today is 0.44% . 10% of memory is used. network
>>>>> utilization is only a few kbps. I suspect when the
>>> failures occured,
>>>>> the machine got hit with a few hundred postgresql connections at
>>>>> once, but I have no way to prove it.
>>>>>
>>>>> sorry again,
>>>>> daryl
>>>>>
>>>>> _______________________________________________
>>>>> Linux-PowerEdge mailing list
>>>>> Linux-PowerEdge at lists.us.dell.com
>>>>> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>>>> Please read the FAQ at http://lists.us.dell.com/faq
>>>> _______________________________________________
>>>> Linux-PowerEdge mailing list
>>>> Linux-PowerEdge at lists.us.dell.com
>>>> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>>> Please read the FAQ at http://lists.us.dell.com/faq
>>>
>
>
More information about the Linux-PowerEdge
mailing list