High load average on Ubuntu 7.10/8.04
Bruno Friedmann
bruno at ioda-net.ch
Tue May 13 01:05:25 CDT 2008
Everything look good. But need to wait the long test result.
There's two things to comment : the temp is quiet high (54°) My disk is around 36-42.
and the reallocate_count at 1 indicate sector's reallocation. ( shouldn't be a trouble )
Perharps you could use the diagnostic tools ( from Dell & Western Digital Drive fitness )
And make a backup
Andrew Carter wrote:
> Bruno Friedmann wrote:
>> Could you check your harddrive with a smartctl -a /dev/sda
>>
>
> Here is the smartctl -a output:
>
> smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6
> Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Caviar SE (Serial ATA) family
> Device Model: WDC WD2500JS-75NCB1
> Serial Number: WD-WCANK2472851
> Firmware Version: 10.02E01
> User Capacity: 250,000,000,000 bytes
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: 7
> ATA Standard is: Exact ATA specification draft version not indicated
> Local Time is: Mon May 12 14:03:16 2008 PDT
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> See vendor-specific Attribute list for marginal Attributes.
>
> General SMART Values:
> Offline data collection status: (0x84) Offline data collection activity
> was suspended by an interrupting command from host.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine
> completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (8280) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 96) minutes.
> Conveyance self-test routine
> recommended polling time: ( 6) minutes.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always
> - 0
> 3 Spin_Up_Time 0x0003 211 186 021 Pre-fail Always
> - 4441
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
> - 513
> 5 Reallocated_Sector_Ct 0x0033 193 193 140 Pre-fail Always
> - 50
> 7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always
> - 0
> 9 Power_On_Hours 0x0032 086 086 000 Old_age Always
> - 10232
> 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always
> - 0
> 11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always
> - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
> - 336
> 190 Temperature_Celsius 0x0022 046 009 045 Old_age Always
> In_the_past 54
> 194 Temperature_Celsius 0x0022 096 059 000 Old_age Always
> - 54
> 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always
> - 1
> 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always
> - 0
> 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age
> Offline - 0
> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
> - 0
> 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail
> Offline - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Short offline Completed without error 00% 0
> -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
>
>
>> And perharps launch a long test
>> smartctl -t /dev/sda
>>
>
> Test result for short:
>
> === START OF READ SMART DATA SECTION ===
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Short offline Completed without error 00% 10234
> -
> # 2 Short offline Completed without error 00% 0
> -
>
> Running a long test now.
>
>
>> Your harddrive have perharps some trouble inside.
>>
>> a lspci -vv could help to identify the bus use
>> and the report of smartctl to identify (completely the harddrive )
>>
>
> Here is the SATA portion of lspci:
>
> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI
> Controller (rev 09) (prog-if 01 [AHCI 1.0])
> Subsystem: Dell Unknown device 01c1
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B-
> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR-
> Latency: 0
> Interrupt: pin C routed to IRQ 20
> Region 0: I/O ports at fe00 [size=8]
> Region 1: I/O ports at fe10 [size=4]
> Region 2: I/O ports at fe20 [size=8]
> Region 3: I/O ports at fe30 [size=4]
> Region 4: I/O ports at fec0 [size=32]
> Region 5: Memory at ff970000 (32-bit, non-prefetchable) [size=1K]
> Capabilities: [70] Power Management version 2
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
> Status: D0 PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [a8] #12 [0010]
>
>
>
> One other thing to note - reads are fine. It only crawls on writes.
> Also, I think at least one other person in the office is hitting the
> same problem (running the same hardware except with less RAM).
>
> Thanks,
> Andrew
>
>
>> Andrew Carter wrote:
>>> I've been dealing with a problem with my Dell Precision 490 for some
>>> time now. I've been using this machine since late last year at work. The
>>> problem is that the load average goes through the roof (around 10) when
>>> the disk is is used heavily. For example, tar xvzf on a 4GB tarball
>>> brings the machine to its knees.
>>>
>>> I originally was running Ubuntu 7.10 64-bit. I recently did a clean
>>> upgrade to Ubuntu 8.04 64-bit. I've tried turning off all the compiz
>>> effects, killing as many other processes as I can but it always kills
>>> the machine when the disk is heavily accessed.
>>>
>>> I'm not exactly sure how to diagnose this problem. I'm thinking it could
>>> be a kernel problem, a driver problem for the SATA disk, or a firmware
>>> issuee. I believe I have the newest Dell BIOS. I'm not sure about the
>>> whole kernel thing since I don't see widespread talke about it. And my
>>> newer OptiPlex at home hasn't shown any issues (and I mess with lots of
>>> large video and audio on it).
>>>
>>> Specs:
>>> Dell Precision 490
>>> BIOS A07
>>> 4GB RAM
>>> Intel Xeon 1.6GHz (2 dual cores)
>>> Ubuntu 8.04
>>> Kernel Linux 2.6.24-17-generic
>>> 250GB WD SATA hard drive
>
--
Bruno Friedmann
More information about the Linux-Precision
mailing list