Dell/Redhat/tg3 problems

Jeff Garzik jgarzik at redhat.com
Thu Jan 23 12:59:00 CST 2003


On Thu, Jan 23, 2003 at 11:45:47AM -0600, Cris Rhea wrote:
> Now, the bigger question to the group (especially those folks at RedHat 
> or Dell):
> 
> I have a Dell 2550 running RedHat 8.0 that I'm trying to make into a critical
> production server (E-mail server for a large group of Researchers at Mayo).
> I have applied all the firmware updates (BIOS/ESM/Backplane) and RH8.0
> kernel Errata (and now, Jeff's "aragorn1" custom patched kernel).
> 
> On this list, I've seen suggestions all over the board:
> 
>    - Switch network drivers to use the Broadcom driver
>    - Use the custom-patched TG3 drivers- they'll perform better than bcxxxx
>    - Disable the on-board NICs and install Intel cards
>    - Upgrade to a kernel several releases beyond RH8.0 (2.4.20-xxx)
> 
> I have spent weeks now (since before Christmas) trying to get this system
> reliable/stable.
> 
> My team lead made an intersting observation the other day:  
> 
> I've had to do all sorts of patching and research into how to make 
> this Dell/RedHat system stable. If we had instead purchased a Sun/Solaris 
> system, we would have been reliable/stable within a couple days 
> (Install the OS then load Sun's Security and Recommended patch set).
> We almost never see an OS/hardware combination that makes the base 
> system unstable.
> 
> While the cost/performance of a Dell/RedHat system is great, if it takes
> me a month to get a system to play nice and talk on the Ethernet without
> crashing.... makes my upper management question how "ready for prime time"
> Linux is.
> 
> What's the real answer??

Well, that's a bit of a loaded question ;-)  Do you really expect Dell
or Red Hat to draw anything but favorable conclusions about their
products?  :)

So, with that in mind, I will say:


I strongly believe that Dell hardware and e1000 NICs is a rock solid
stable solution.  I personally put our Dell test hardware through
serious abuse here at Red Hat labs, and it passes with flying colors.
I would trust Red Hat's latest errata kernel on Dell's latest hardware
long before I would ever trust Solaris, and that says a lot considering
I was a Solaris admin from the SunOS through Solaris 2.8 days.


That said, let me take off my Red Hat vendor hat, and put on my
hat at Linux net drivers maintainer.  All of the following info is
public, not-NDA'd information, and is my own personal opinion,
NOT my employer's...


The driver situation between bcm5700 and tg3 is very special.  Normally
in Linux, there are not two drivers for the same hardware.  So why
is bcm5700/tg3 special?  Because of Broadcom.  Broadcom continues to
ignore Linux developers when we point out valid, box-crashing bugs
in their driver.  Apparently the maintainer of the Linux net stack,
and the maintainer of Linux net drivers, are clueless in BroadCom's eyes :)
So while BroadCom pushes vendors hard to include the bcm5700 driver
[with its known-for-many-months remote DoS issues], Red Hat pushes
the tg3 driver which is feels is the better driver due to technical
reasons outlined here and elsewhere.

So the next question is, if tg3 is so great, why do we see crash
reports on this list?  The answer is simple.  BroadCom wants bcm5700
to succeed, so they refuse to share documentation or errata on their
NIC hardware.  Since their NIC hardware is, um, a bit buggy and requires
lots of hardware bug workarounds, tg3 has to constantly play catch-up.

Contrast this with Intel.  Intel emails me e1000 drivers patches, I
review them, and either reject with comments (unusual) or send on to
Marcelo/Linus (usual) for inclusion in the kernel.  Just one driver.
And the company works _with_ Linux developers, and listens to them
when they point out bugs.  Result?  A stable, fast driver with no
political contention surrounding it.

I would _love_ to work with BroadCom to resolve these issues, too...

	Jeff






More information about the Linux-PowerEdge mailing list