FW: T410 Network Failure
rpug at tripadvisor.com
Tue Sep 1 16:02:31 CDT 2009
(sorry, resent as I sent from wrong email originally)
Just checked logs again and the copper link down message hasn't happened
every time there was a problem, so that may not be related.
Ryan Pugatch wrote:
> FWIW, I am also having the same issue with some R710's. They are a part
> of a hadoop cluster. Interestingly enough, so far only 2 out of the 3
> servers have experienced the issue thus far in that cluster. We also
> run our corporate mail server on an R710 and that has not shown any
> problems yet (except for a weird issue where outgoing TCP connections
> would intermittently fail until we restarted the network interfaces..
> not sure if this is related--has only happened once).
> We are running CentOS 5.3. All three hadoop machines are running
> 2.6.18-128.2.1.el5 and the mail server is running 2.6.18-128.1.10.el5
> It seems that when the network would drop it would log:
> kernel: bnx2: eth0 NIC Copper Link is Down
> Not sure that the disable_msi option will fix the two hadoop machines
> having the issues as the problem happens somewhat randomly and not
> easily reproducible. That being said, we aren't getting some network
> related errors in our hadoop logs that we had been getting previously so
> I suspect that is a good sign. Time will tell!
> Is this issue related to the 2.6.28-rc3 regression specified here?
> I am hoping a fix will make its way to RHEL and downstream to CentOS
> (has anyone heard if that is happening? I'm having trouble finding a
> redhat or centos bug logged).
> Are there any performance concerns with using disable_msi? I know that
> the driver from Dell.com should fix the problem but I'd prefer to use a
> driver provided from upstream.
> Ryan Pugatch
> Systems Administrator, TripAdvisor
> Narendra_K at dell.com wrote:
>> Thanks, this info is of great help.
>> With regards,
>> Narendra K
>> -----Original Message-----
>> From: daryl herzmann [mailto:akrherz at iastate.edu]
>> Sent: Thursday, August 13, 2009 7:07 PM
>> To: K, Narendra
>> Cc: Biligiri, Raghavendra; linux-poweredge-Lists
>> Subject: RE: FW: T410 Network Failure
>> On Thu, 13 Aug 2009, Narendra_K at Dell.com wrote:
>>> Thanks. Top output need not be at the time of failure. It can be any
>>> time, just to get an idea as to what is resource utilization so that
>>> we can replicate it. And general high level detail about the database
>>> you are using - like is it a oracle database ?
>> It is running PostgreSQL 8.4 . sar reports that the average CPU
>> utilization for today is 0.44% . 10% of memory is used. network
>> utilization is only a few kbps. I suspect when the failures occured,
>> the machine got hit with a few hundred postgresql connections at once,
>> but I have no way to prove it.
>> sorry again,
>> Linux-PowerEdge mailing list
>> Linux-PowerEdge at lists.us.dell.com
>> Please read the FAQ at http://lists.us.dell.com/faq
> Linux-PowerEdge mailing list
> Linux-PowerEdge at lists.us.dell.com
> Please read the FAQ at http://lists.us.dell.com/faq
More information about the Linux-PowerEdge