afacli says Number of PRIMARY defects on drive: 5903

Arne Kepp ak at smallworld.no
Wed Aug 28 14:19:01 CDT 2002


Matt_Domsch at Dell.com wrote:

>>scsi : aborting command due to timeout : pid 778229, scsi 3, 
>>channel 0,
>>id 0, lun 0, write (10) 00 00 03 7a 85 00 00 10 00
>>SCSI host 3 abort (pid 778229 timed out - resetting
>>SCSI bus is being reset for host 3 channel 0
>>Kernel panic : scsi_free : Bad offset
>>In interrupt handler - not syncing
>>    
>>
>
>The timeout shouldn't happen, so that's wierd.  Any idea what you were doing
>at the time?  Hopefully not a container rebuild?
>  
>
<lots of lines removed>

Cotainer rebuild: Yes and no, since I'm not an expert I wanted to make 
sure the stuff I was building on was solid and decided to delete and 
recreate the container (after scanning the disk) before putting in my 
mondo-rescue CD.

It's really happened on all sorts of occasions (I estimate a total of 
twenty times over the past year), running everything from RedHat 7.0 to 
7.3 and in between (I've also tried all firmwares from 2.5 up to 2.7, 
currently back on 2.5)..... Looks I'll be one of the first users of the 
2.7 kernel then, if this rewrite doesn't make it into 2.6 ; )

The main purpose of the machine is to be a SWMFS (Smallworld, now GE 
Network Solutions) database server, which means handing out raw 
datablocks fromt the ext3 filesystem. It also runs samba (always most 
recent version), but there is no real load (memory and cpu are around 5% 
at all times except during backup/rsync/antivirus-scans, disk IO is hard 
to measure). I've never seen it crash during backup/rsync/antivirus 
which take about three hours every night at full cpu(s). SWMFS is a very 
stable thing and extensively tested, so I doubt it is to blame but I 
will double check with the helpdesk if they have heard of other such 
incidents. The box mostly crashes sometime between backup and working 
hours, when there are no swmfs transactions. On the other hand, this 
morning it went down with only one user at the office who was not doing 
anything special. Our other PE4400 without RAID is running the same 
stuff (but is less frequently accessed)  with no problems whatsoever for 
almost two years with almost identical software setup, I "love" that box : )

Not knowing the details of battery-backed RAID controllers, I'm 
wondering if this could be caused by a bad battery?  
I've notice that the logs I put on the web are incomplete with respect 
to battery information and I have also seen the machine go down and then 
later display "Battery Charge is now OK", to avoid this I scripted 
reconditioning to happen every other month which should be 3x as often 
as needed, but today is actually less than a month since the last 
reconditioning.

I'm spending the night running all the Dell 32bit diagnostics again, 
I'll post again if something shows up that could be related...

Thanks again : )
-Arne




More information about the Linux-PowerEdge mailing list