Dell/Redhat/tg3 problems

Eric Rostetter eric.rostetter at physics.utexas.edu
Thu Jan 23 17:18:01 CST 2003


Quoting crhea at mayo.edu:
 
> Now, the bigger question to the group (especially those folks at RedHat
> or Dell):
> 
> I have a Dell 2550 running RedHat 8.0 that I'm trying to make into a critical
> production server (E-mail server for a large group of Researchers at Mayo).
> I have applied all the firmware updates (BIOS/ESM/Backplane) and RH8.0
> kernel Errata (and now, Jeff's "aragorn1" custom patched kernel).

Did this machine ship with RH 8.0 and the buggy kernel, or did you upgrade
them to those versions?  If you upgraded them to those versions, even if
from RedHat, then how is it Dell's fault?

Unless you got an offical update from Dell that they said would work, then
you can't blame Dell.  And you can't really blame RedHat for an update that
doesn't work on a particular Dell server either.

I've not looked, so maybe I'm wrong, but I didn't think Dell had certified
RH 8.0 with the upgraded errata kernel yet.  If they haven't, and you (and
me and others) upgraded to that anyway, then it is our fault, not Dell's.
 
> On this list, I've seen suggestions all over the board:
> 
>    - Switch network drivers to use the Broadcom driver
>    - Use the custom-patched TG3 drivers- they'll perform better than bcxxxx
>    - Disable the on-board NICs and install Intel cards
>    - Upgrade to a kernel several releases beyond RH8.0 (2.4.20-xxx)

The first was suggested as at the time there was no other options, other
than to downgrade to a lower kernel version.  Now there are other options.

The second should not be used in a production environmenrt.

The third is rather drastic, but may work if you pick a good nic to replace
it with.

The last is a good option also, but may not be needed.  BTW, 2.4.19 works
fine AFAIK.

> I have spent weeks now (since before Christmas) trying to get this system
> reliable/stable.

I'm sorry for that.  I was one of the first to find the tg3 bug, and it
only took a few days to find out what the problem was and "fix" it.  For
me the fix was to switch back to the bcm drivers on all but one machine.
On the last machine, I installed the 2.4.19 kernel with the tg3 driver.
Not sure why you had such a rough time.

> My team lead made an intersting observation the other day:
> 
> I've had to do all sorts of patching and research into how to make
> this Dell/RedHat system stable. If we had instead purchased a Sun/Solaris
> system, we would have been reliable/stable within a couple days

Yes, probably true, at least until it got hacked into.  But that is because
Sun has a controlled environment.  They choose the hardware, they write the
software, so it sure better work!

I've had to do firmware upgrades on SUN's also, as well as VAX and Alpha 
machines, and so on.  So this is not unique to Dell.

And Sun has shipped me one or two lemons before.  And sold me one which they
wouldn't even ship (had to cancel the order after waiting for a year and 
still not getting the product -- they wouldn't ship because of Q/A problems
with it).

> (Install the OS then load Sun's Security and Recommended patch set).
> We almost never see an OS/hardware combination that makes the base
> system unstable.

Other than firmware upgrades, no.  But that is because they make it all, so
it sure should work together.  And you buy and use what they tell you.

With your Dell, there are two companies.  So mistakes will happen.  Plus,
if you go and upgrade stuff when Dell has yet to certify or support it,
then how can you blame Dell or RedHat?  You have to let Dell run the Q/A
cycle just as you are forced to let Sun run the Q/A cycle.

> While the cost/performance of a Dell/RedHat system is great, if it takes
> me a month to get a system to play nice and talk on the Ethernet without
> crashing.... makes my upper management question how "ready for prime time"
> Linux is.

All the PE servers I bought worked fine as shipped.  All rock solid.  It is
only when I decided to upgrade them without Dell's blessing, while Dell was
telling me they didn't support what I was doing, that I found the tg3 bug.
So how can I blame Dell?

> What's the real answer??

Unless you can show that Dell shipped your the server with the buggy kernel,
I'd say the real answer is you messed up. If Dell did ship you the server
with the buggy kernel, or officially say they now support that kernel, then
I would say Dell messed up.

In any case, the answer depends on what you want.  You can downgrade the kernel,
you can upgrade the kernel, you can substitute the drivers, you
can swap out the nics.  Which is right for you depends on your situation.

Just don't expect Dell to support you any better than they already do (which
is pretty darn well) if you install non-supported nics, non-supported kernels,
etc.

-- 
Eric Rostetter
The Department of Physics
The University of Texas at Austin

Why get even? Get odd!




More information about the Linux-PowerEdge mailing list