Dell/Redhat/tg3 problems

Cris Rhea crhea at mayo.edu
Fri Jan 24 11:51:00 CST 2003


Eric- 

You make some interesting points. I'd like to discuss some of them
and provide some of my own opinions/observations.

Note that I am expressing my own opinions- since I am a customer of 
both RedHat and Dell, folks at these companies might find the feedback 
useful in a "customer satisfaction" sort of way.

I'm not trying to bash either company, only convey the frustration
in trying to solve this problem via the "official" support channels.


> > I have a Dell 2550 running RedHat 8.0 that I'm trying to make into a 
> > critical production server (E-mail server for a large group of 
> > Researchers at Mayo).
> > I have applied all the firmware updates (BIOS/ESM/Backplane) and RH8.0
> > kernel Errata (and now, Jeff's "aragorn1" custom patched kernel).
> 
> Did this machine ship with RH 8.0 and the buggy kernel, or did you upgrade
> them to those versions?  If you upgraded them to those versions, even if
> from RedHat, then how is it Dell's fault?

If you look at what I wrote previously, there is no mention of the word
"fault". I asked how to make this combination of hardware/software 
work successfully.

When I originally purchased this server, it came with RH7.1. By today's 
standards, this is a pretty old release. Since Dell doesn't offer 
a blanket software support contract for Linux (AFAIK), and Dell also
only supports the OS shipped with the system (same as on the Dell/Microsoft
side), I don't really have much choice, do I?


> > I have spent weeks now (since before Christmas) trying to get this system
> > reliable/stable.
> 
> I'm sorry for that.  I was one of the first to find the tg3 bug, and it
> only took a few days to find out what the problem was and "fix" it.  For
> me the fix was to switch back to the bcm drivers on all but one machine.
> On the last machine, I installed the 2.4.19 kernel with the tg3 driver.
> Not sure why you had such a rough time.

[Long history of problem follows... you asked! ;) ]

I had a rough time because I went down the wrong path... My symptoms 
were that the machine would just freeze. No messages on the console, 
nothing in the logs, nothing in the hardware ESM logs. It would freeze
with no load (i.e., I had installed the OS, but the machine was just
sitting there waiting for me to do application install/config).

I have a PE1650 sitting next to it that has never gone down since my
install of RH8.0.

That lead me to believe that I most likely had a hardware problem with
my PE2550.

After a session with the extended diags, we uncovered several minor 
hardware problems.  Dell dispatched a tech and replaced a bunch of 
hardware.

Still, the system hung.

Extended diags again uncovered some hardware problems (again, minor, 
but I'm grasping at straws trying to find the source of the hang).  
Again, Dell dispatched a tech with parts.

The system still hung. 

Now, since the system had failed diags previously, I wasn't too interested
in continuing to diagnose the system bit-by-bit. This time, Dell dispatched
a tech who replaced almost the entire system. Only the disk drives (one 
drive had been replaced previously due to failing diags), the SCSI backplane 
and the sheet metal remained untouched.

Still, the system hung.

Heavy sigh!  Run the extended diags in loop mode. Check everything, looking
for an intermittent fault of some kind....

Diags pass-  There's GOT to be something else going on here....

NOTES:

1) Extended diags on a PE2550 can take 30+ hours to run.

2) Even though I went through the normal support channels, identified
    the system as running RH8.0 Linux, I never spoke to any support
    person who had a clue about Linux.  

3) There's no useful Linux info on the "official" Dell support web site.

While I had been doing various Internet searches looking for "hang",
"RH8.0" and "Dell 2550", I hadn't turned up anything interesting.

I did some more searching and came across Matt Domsch's Linux page.
I had been there before, but had forgotten about it. His page pointed me
at this list. You folks told me what the real problem was within 30 minutes!


> With your Dell, there are two companies.  So mistakes will happen.  Plus,
> if you go and upgrade stuff when Dell has yet to certify or support it,
> then how can you blame Dell or RedHat?  You have to let Dell run the Q/A
> cycle just as you are forced to let Sun run the Q/A cycle.

While I agree with your comment on the surface, I think there are 
several things Dell could do better:

1.) Since Dell officially says they support Linux (RedHat) on various 
    servers, have customer support staff dedicated to supporting Linux.
    These people should be fluent in Unix/Linux, not just Microsoft 
    weenies who have taken a Linux class.  I want to talk to folks who
    spend 8 hours/day supporting Linux on Dell hardware- not someone
    who spends 99% of their time supporting Microsoft and only takes
    a Linux support call because there's nobody else in the call center.

2.) With regard to Dell Linux Q/A...  Other than giving RH new Dell
    equipment and having a good communications channel between the 
    developers at both companies, what does Dell really do for
    Linux Q/A?

    I've been purchasing Linux on Dell since around the RH6.2 timeframe-
    Most of the time, the "official" Dell release is the normal RedHat
    release (retail) -OR- the RedHat release plus RedHat errata.

    Dell rarely includes any additional documentation about how Linux
    runs on their servers other than what drivers are needed for their
    internal SCSI/RAID controllers or to use the e100 driver instead
    of the eepro100 driver.

    So, what's the value added by Dell's Q/A?

    I'm not trying to be a smart-ass about this- I just don't see
    (as a customer) any output from Dell's (Linux) Q/A process. 

    Contrast this with searching sunsolve.sun.com (with a support
    contract).  I can view any bug reported, developers' comments 
    about the bug and resolutions or patches. I can also see unsolved 
    bugs (hey, this other guy has seen the same problem that 
    I'm seeing...).

    I wish more vendors used this model for allowing customers access
    to their bug/patches info. I've solved a bunch of really strange
    problems by having direct access to this tool.

3.) CLEARLY, there are people inside Dell who knew exactly what my problem
    was and possible solutions for it. Could we PLEASE have a web page
    on the official Dell support site that discusses known problems
    with the various RedHat releases as they apply to the various Dell
    servers? It would have saved me literally days if I could have
    seen a table that said:

    Release	Symptom/Problem		Hardware Affected
    -------	---------------		-----------------
    RH8.0      	System Hang       	PE2550, PEXXXX, PEYYYY, ...

    Resolution
    ----------
    RH8.0 is currently undergoing Dell Q/A and is not officially 
    supported at this time. Problem linked to the tg3 Ethernet driver.
    Please refer to the Linux-poweredge mailing list archives
    for more information and possible unofficial solutions.


My point to this whole thing is this:

You folks are great!  Obviously, there are folks from both RedHat and
Dell on this list who care a great deal about their products and go
the extra mile to support customers using those products (in addition 
to the other folks on this list who contribute their ideas and knowledge).

I just wish the official support channels could capture this level of 
support and make it available through the "front door".

Just my $0.02....


-- 
 Cristopher J. Rhea                     Mayo Foundation
 Research Computing Facility             Pavilion 2-25
 crhea at Mayo.EDU                        Rochester, MN 55905
 (507) 284-0587                        Fax: (507) 284-5231




More information about the Linux-PowerEdge mailing list