crhea at mayo.edu
Fri Jan 24 11:51:00 CST 2003
You make some interesting points. I'd like to discuss some of them
and provide some of my own opinions/observations.
Note that I am expressing my own opinions- since I am a customer of
both RedHat and Dell, folks at these companies might find the feedback
useful in a "customer satisfaction" sort of way.
I'm not trying to bash either company, only convey the frustration
in trying to solve this problem via the "official" support channels.
> > I have a Dell 2550 running RedHat 8.0 that I'm trying to make into a
> > critical production server (E-mail server for a large group of
> > Researchers at Mayo).
> > I have applied all the firmware updates (BIOS/ESM/Backplane) and RH8.0
> > kernel Errata (and now, Jeff's "aragorn1" custom patched kernel).
> Did this machine ship with RH 8.0 and the buggy kernel, or did you upgrade
> them to those versions? If you upgraded them to those versions, even if
> from RedHat, then how is it Dell's fault?
If you look at what I wrote previously, there is no mention of the word
"fault". I asked how to make this combination of hardware/software
When I originally purchased this server, it came with RH7.1. By today's
standards, this is a pretty old release. Since Dell doesn't offer
a blanket software support contract for Linux (AFAIK), and Dell also
only supports the OS shipped with the system (same as on the Dell/Microsoft
side), I don't really have much choice, do I?
> > I have spent weeks now (since before Christmas) trying to get this system
> > reliable/stable.
> I'm sorry for that. I was one of the first to find the tg3 bug, and it
> only took a few days to find out what the problem was and "fix" it. For
> me the fix was to switch back to the bcm drivers on all but one machine.
> On the last machine, I installed the 2.4.19 kernel with the tg3 driver.
> Not sure why you had such a rough time.
[Long history of problem follows... you asked! ;) ]
I had a rough time because I went down the wrong path... My symptoms
were that the machine would just freeze. No messages on the console,
nothing in the logs, nothing in the hardware ESM logs. It would freeze
with no load (i.e., I had installed the OS, but the machine was just
sitting there waiting for me to do application install/config).
I have a PE1650 sitting next to it that has never gone down since my
install of RH8.0.
That lead me to believe that I most likely had a hardware problem with
After a session with the extended diags, we uncovered several minor
hardware problems. Dell dispatched a tech and replaced a bunch of
Still, the system hung.
Extended diags again uncovered some hardware problems (again, minor,
but I'm grasping at straws trying to find the source of the hang).
Again, Dell dispatched a tech with parts.
The system still hung.
Now, since the system had failed diags previously, I wasn't too interested
in continuing to diagnose the system bit-by-bit. This time, Dell dispatched
a tech who replaced almost the entire system. Only the disk drives (one
drive had been replaced previously due to failing diags), the SCSI backplane
and the sheet metal remained untouched.
Still, the system hung.
Heavy sigh! Run the extended diags in loop mode. Check everything, looking
for an intermittent fault of some kind....
Diags pass- There's GOT to be something else going on here....
1) Extended diags on a PE2550 can take 30+ hours to run.
2) Even though I went through the normal support channels, identified
the system as running RH8.0 Linux, I never spoke to any support
person who had a clue about Linux.
3) There's no useful Linux info on the "official" Dell support web site.
While I had been doing various Internet searches looking for "hang",
"RH8.0" and "Dell 2550", I hadn't turned up anything interesting.
I did some more searching and came across Matt Domsch's Linux page.
I had been there before, but had forgotten about it. His page pointed me
at this list. You folks told me what the real problem was within 30 minutes!
> With your Dell, there are two companies. So mistakes will happen. Plus,
> if you go and upgrade stuff when Dell has yet to certify or support it,
> then how can you blame Dell or RedHat? You have to let Dell run the Q/A
> cycle just as you are forced to let Sun run the Q/A cycle.
While I agree with your comment on the surface, I think there are
several things Dell could do better:
1.) Since Dell officially says they support Linux (RedHat) on various
servers, have customer support staff dedicated to supporting Linux.
These people should be fluent in Unix/Linux, not just Microsoft
weenies who have taken a Linux class. I want to talk to folks who
spend 8 hours/day supporting Linux on Dell hardware- not someone
who spends 99% of their time supporting Microsoft and only takes
a Linux support call because there's nobody else in the call center.
2.) With regard to Dell Linux Q/A... Other than giving RH new Dell
equipment and having a good communications channel between the
developers at both companies, what does Dell really do for
I've been purchasing Linux on Dell since around the RH6.2 timeframe-
Most of the time, the "official" Dell release is the normal RedHat
release (retail) -OR- the RedHat release plus RedHat errata.
Dell rarely includes any additional documentation about how Linux
runs on their servers other than what drivers are needed for their
internal SCSI/RAID controllers or to use the e100 driver instead
of the eepro100 driver.
So, what's the value added by Dell's Q/A?
I'm not trying to be a smart-ass about this- I just don't see
(as a customer) any output from Dell's (Linux) Q/A process.
Contrast this with searching sunsolve.sun.com (with a support
contract). I can view any bug reported, developers' comments
about the bug and resolutions or patches. I can also see unsolved
bugs (hey, this other guy has seen the same problem that
I wish more vendors used this model for allowing customers access
to their bug/patches info. I've solved a bunch of really strange
problems by having direct access to this tool.
3.) CLEARLY, there are people inside Dell who knew exactly what my problem
was and possible solutions for it. Could we PLEASE have a web page
on the official Dell support site that discusses known problems
with the various RedHat releases as they apply to the various Dell
servers? It would have saved me literally days if I could have
seen a table that said:
Release Symptom/Problem Hardware Affected
------- --------------- -----------------
RH8.0 System Hang PE2550, PEXXXX, PEYYYY, ...
RH8.0 is currently undergoing Dell Q/A and is not officially
supported at this time. Problem linked to the tg3 Ethernet driver.
Please refer to the Linux-poweredge mailing list archives
for more information and possible unofficial solutions.
My point to this whole thing is this:
You folks are great! Obviously, there are folks from both RedHat and
Dell on this list who care a great deal about their products and go
the extra mile to support customers using those products (in addition
to the other folks on this list who contribute their ideas and knowledge).
I just wish the official support channels could capture this level of
support and make it available through the "front door".
Just my $0.02....
Cristopher J. Rhea Mayo Foundation
Research Computing Facility Pavilion 2-25
crhea at Mayo.EDU Rochester, MN 55905
(507) 284-0587 Fax: (507) 284-5231
More information about the Linux-PowerEdge