NetBSD Problem Report #56686
From gson@gson.org Fri Feb 4 11:01:39 2022
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id A016A1A9239
for <gnats-bugs@gnats.NetBSD.org>; Fri, 4 Feb 2022 11:01:39 +0000 (UTC)
Message-Id: <20220204110129.1A93E254379@guava.gson.org>
Date: Fri, 4 Feb 2022 13:01:29 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: wd(4) device timeouts
X-Send-Pr-Version: 3.95
>Number: 56686
>Category: kern
>Synopsis: wd(4) device timeouts
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Feb 04 11:05:00 +0000 2022
>Last-Modified: Sun Feb 06 13:30:01 +0000 2022
>Originator: Andreas Gustafsson
>Release: NetBSD 9.2
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
With a workload involving multiple emulator processes running NetBSD
guests running ATF tests on a ST5000LM000-2AN170 disk in a Dell
PowerEdge R630 server, I am seeing bursts of "device timeout" errors
several times a day. This PR is about these device timeout errors
themselves; for the panics they sometimes trigger, see PR 56479.
Here is the console log from one such burst:
[ 45042.1012910] wd3d: device timeout writing fsbn 4431519296 of 4431519296-4431519359 (wd3 bn 4431519296; cn 4396348 tn 8 sn 8), xfer 218, retry 0
[ 45042.2542704] wd3d: device timeout writing fsbn 3779132800 of 3779132800-3779132927 (wd3 bn 3779132800; cn 3749139 tn 10 sn 58), xfer 50, retry 0
[ 45042.4102613] wd3d: device timeout writing fsbn 4418502912 of 4418502912-4418502935 (wd3 bn 4418502912; cn 4383435 tn 6 sn 54), xfer e8, retry 0
[ 45042.5652114] wd3d: device timeout writing fsbn 3778503360 of 3778503360-3778503423 (wd3 bn 3778503360; cn 3748515 tn 3 sn 51), xfer 180, retry 0
[ 45042.7212016] wd3d: device timeout writing fsbn 4654435520 of 4654435520-4654435583 (wd3 bn 4654435520; cn 4617495 tn 8 sn 56), xfer 2b0, retry 0
[ 45042.8771911] wd3d: device timeout writing fsbn 4417615040 of 4417615040-4417615103 (wd3 bn 4417615040; cn 4382554 tn 9 sn 41), xfer 348, retry 0
[ 45043.0331818] wd3d: device timeout writing fsbn 3957636800 of 3957636800-3957636863 (wd3 bn 3957636800; cn 3926226 tn 15 sn 47), xfer 3e0, retry 0
[ 45043.1902111] wd3d: device timeout writing fsbn 4030504640 of 4030504640-4030504703 (wd3 bn 4030504640; cn 3998516 tn 8 sn 8), xfer 478, retry 0
[ 45043.3451615] wd3d: device timeout writing fsbn 3945492160 of 3945492160-3945492223 (wd3 bn 3945492160; cn 3914178 tn 11 sn 43), xfer 510, retry 0
[ 45043.5021917] wd3d: device timeout writing fsbn 3763322560 of 3763322560-3763322623 (wd3 bn 3763322560; cn 3733454 tn 14 sn 46), xfer 5a8, retry 0
[ 45043.6592219] wd3d: device timeout writing fsbn 3881732800 of 3881732800-3881732863 (wd3 bn 3881732800; cn 3850925 tn 6 sn 22), xfer 640, retry 0
[ 45043.8152122] wd3d: device timeout writing fsbn 3778503296 of 3778503296-3778503359 (wd3 bn 3778503296; cn 3748515 tn 2 sn 50), xfer 6d8, retry 0
[ 45043.9712037] wd3d: device timeout writing fsbn 3779132928 of 3779132928-3779133055 (wd3 bn 3779132928; cn 3749139 tn 12 sn 60), xfer 770, retry 0
[ 45044.1282335] wd3d: device timeout writing fsbn 4031996800 of 4031996800-4031996927 (wd3 bn 4031996800; cn 3999996 tn 13 sn 13), xfer 808, retry 0
[ 45044.2852623] wd3d: device timeout writing fsbn 3950007936 of 3950007936-3950007999 (wd3 bn 3950007936; cn 3918658 tn 10 sn 42), xfer 8a0, retry 0
[ 45044.4422926] wd3d: device timeout writing fsbn 3764548672 of 3764548672-3764548735 (wd3 bn 3764548672; cn 3734671 tn 4 sn 52), xfer 938, retry 0
[ 45044.5982845] wd3d: device timeout writing fsbn 3764544800 of 3764544800-3764544831 (wd3 bn 3764544800; cn 3734667 tn 7 sn 23), xfer 9d0, retry 0
[ 45044.7542730] wd3d: device timeout writing fsbn 7124681152 of 7124681152-7124681279 (wd3 bn 7124681152; cn 7068136 tn 1 sn 1), xfer a68, retry 0
[ 45044.9092242] wd3d: device timeout writing fsbn 7124681088 of 7124681088-7124681151 (wd3 bn 7124681088; cn 7068136 tn 0 sn 0), xfer b00, retry 0
[ 45045.0641735] wd3d: device timeout writing fsbn 3951300736 of 3951300736-3951300799 (wd3 bn 3951300736; cn 3919941 tn 3 sn 19), xfer b98, retry 0
[ 45045.2201647] wd3d: device timeout writing fsbn 3958537280 of 3958537280-3958537343 (wd3 bn 3958537280; cn 3927120 tn 5 sn 5), xfer c30, retry 0
[ 45045.3751142] wd3d: device timeout writing fsbn 3883039872 of 3883039872-3883039935 (wd3 bn 3883039872; cn 3852222 tn 1 sn 33), xfer cc8, retry 0
[ 45045.5311042] wd3d: device timeout writing fsbn 3883032192 of 3883032192-3883032255 (wd3 bn 3883032192; cn 3852214 tn 7 sn 39), xfer d60, retry 0
[ 45045.6870949] wd3d: device timeout writing fsbn 4770050496 of 4770050496-4770050559 (wd3 bn 4770050496; cn 4732192 tn 15 sn 15), xfer df8, retry 0
[ 45046.3329589] wd3: soft error (corrected) xfer df8
[ 45046.3904871] wd3: soft error (corrected) xfer d60
[ 45046.4476828] wd3: soft error (corrected) xfer cc8
[ 45046.5048796] wd3: soft error (corrected) xfer c30
[ 45046.5620769] wd3: soft error (corrected) xfer b98
[ 45046.6192729] wd3: soft error (corrected) xfer b00
[ 45046.6764685] wd3: soft error (corrected) xfer a68
[ 45046.7336655] wd3: soft error (corrected) xfer 9d0
[ 45046.7908619] wd3: soft error (corrected) xfer 938
[ 45046.8480580] wd3: soft error (corrected) xfer 8a0
[ 45046.9052551] wd3: soft error (corrected) xfer 808
[ 45046.9624508] wd3: soft error (corrected) xfer 770
[ 45047.0196475] wd3: soft error (corrected) xfer 6d8
[ 45047.0768438] wd3: soft error (corrected) xfer 640
[ 45047.1340403] wd3: soft error (corrected) xfer 5a8
[ 45047.1912370] wd3: soft error (corrected) xfer 510
[ 45047.2484331] wd3: soft error (corrected) xfer 478
[ 45047.3056288] wd3: soft error (corrected) xfer 3e0
[ 45047.3628253] wd3: soft error (corrected) xfer 348
[ 45047.4200222] wd3: soft error (corrected) xfer 2b0
[ 45047.4772193] wd3: soft error (corrected) xfer 180
[ 45047.5344148] wd3: soft error (corrected) xfer e8
[ 45047.5905719] wd3: soft error (corrected) xfer 50
[ 45047.6467280] wd3: soft error (corrected) xfer 218
Here's another one, which interestingly includes an "uncorrectable
data error" followed by a "soft error (corrected)" for the same
transfer:
[ 56845.1934106] wd3d: device timeout reading fsbn 4190850944 of 4190850944-4190850967 (wd3 bn 4190850944; cn 4157590 tn 3 sn 35), xfer 490, retry 0
[ 56845.3474313] wd3d: device timeout reading fsbn 7217102720 of 7217102720-7217102783 (wd3 bn 7217102720; cn 7159824 tn 2 sn 2), xfer 68, retry 0
[ 56845.5013421] wd3d: device timeout writing fsbn 4492000896 of 4492000896-4492000959 (wd3 bn 4492000896; cn 4456350 tn 1 sn 33), xfer 198, retry 0
[ 56845.6573313] wd3d: device timeout writing fsbn 3945492160 of 3945492160-3945492223 (wd3 bn 3945492160; cn 3914178 tn 11 sn 43), xfer 230, retry 0
[ 56845.8143624] wd3d: device timeout writing fsbn 3881732800 of 3881732800-3881732863 (wd3 bn 3881732800; cn 3850925 tn 6 sn 22), xfer 2c8, retry 0
[ 56845.9703522] wd3d: device timeout writing fsbn 4030504640 of 4030504640-4030504703 (wd3 bn 4030504640; cn 3998516 tn 8 sn 8), xfer 360, retry 0
[ 56846.1253022] wd3d: device timeout writing fsbn 3951298944 of 3951298944-3951299007 (wd3 bn 3951298944; cn 3919939 tn 6 sn 54), xfer 100, retry 0
[ 56846.2812920] wd3d: device timeout writing fsbn 3951359872 of 3951359872-3951359935 (wd3 bn 3951359872; cn 3919999 tn 13 sn 61), xfer 3f8, retry 0
[ 56846.4383222] wd3d: device timeout writing fsbn 4031996864 of 4031996864-4031996927 (wd3 bn 4031996864; cn 3999996 tn 14 sn 14), xfer 528, retry 0
[ 56846.5953523] wd3d: device timeout writing fsbn 4492654656 of 4492654656-4492654783 (wd3 bn 4492654656; cn 4456998 tn 10 sn 42), xfer 5c0, retry 0
[ 56846.7523839] wd3d: device timeout reading fsbn 4669633728 of 4669633728-4669633791 (wd3 bn 4669633728; cn 4632573 tn 2 sn 18), xfer 658, retry 0
[ 56846.9083727] wd3d: device timeout writing fsbn 3883017088 of 3883017088-3883017151 (wd3 bn 3883017088; cn 3852199 tn 7 sn 55), xfer 6f0, retry 0
[ 56914.8308581] wd3d: error writing fsbn 3883017088 of 3883017088-3883017151 (wd3 bn 3883017088; cn 3852199 tn 7 sn 55), xfer 6f0, retry 1
[ 56914.9709129] wd3: (aborted command, id not found, uncorrectable data error)
[ 56915.0509450] wd3d: requeue reading fsbn 4669633728 of 4669633728-4669633791 (wd3 bn 4669633728; cn 4632573 tn 2 sn 18), xfer 658, retry 1
[ 56915.2010038] wd3d: requeue writing fsbn 4492654656 of 4492654656-4492654783 (wd3 bn 4492654656; cn 4456998 tn 10 sn 42), xfer 5c0, retry 1
[ 56915.3510630] wd3d: requeue writing fsbn 4031996864 of 4031996864-4031996927 (wd3 bn 4031996864; cn 3999996 tn 14 sn 14), xfer 528, retry 1
[ 56915.5011219] wd3d: requeue writing fsbn 3951359872 of 3951359872-3951359935 (wd3 bn 3951359872; cn 3919999 tn 13 sn 61), xfer 3f8, retry 1
[ 56915.6511814] wd3d: requeue writing fsbn 3951298944 of 3951298944-3951299007 (wd3 bn 3951298944; cn 3919939 tn 6 sn 54), xfer 100, retry 1
[ 56915.8012403] wd3d: requeue writing fsbn 4030504640 of 4030504640-4030504703 (wd3 bn 4030504640; cn 3998516 tn 8 sn 8), xfer 360, retry 1
[ 56915.9512994] wd3d: requeue writing fsbn 3881732800 of 3881732800-3881732863 (wd3 bn 3881732800; cn 3850925 tn 6 sn 22), xfer 2c8, retry 1
[ 56916.1013587] wd3d: requeue writing fsbn 3945492160 of 3945492160-3945492223 (wd3 bn 3945492160; cn 3914178 tn 11 sn 43), xfer 230, retry 1
[ 56916.2514185] wd3d: requeue writing fsbn 4492000896 of 4492000896-4492000959 (wd3 bn 4492000896; cn 4456350 tn 1 sn 33), xfer 198, retry 1
[ 56916.4014769] wd3d: requeue reading fsbn 7217102720 of 7217102720-7217102783 (wd3 bn 7217102720; cn 7159824 tn 2 sn 2), xfer 68, retry 1
[ 56916.5415318] wd3d: requeue reading fsbn 4190850944 of 4190850944-4190850967 (wd3 bn 4190850944; cn 4157590 tn 3 sn 35), xfer 490, retry 1
[ 56916.6915911] wd3: soft error (corrected) xfer 6f0
[ 56916.7548500] wd3: soft error (corrected) xfer 198
[ 56916.8121630] wd3: soft error (corrected) xfer 230
[ 56916.8693600] wd3: soft error (corrected) xfer 2c8
[ 56916.9265564] wd3: soft error (corrected) xfer 360
[ 56916.9837519] wd3: soft error (corrected) xfer 100
[ 56917.0409486] wd3: soft error (corrected) xfer 3f8
[ 56917.0981443] wd3: soft error (corrected) xfer 528
[ 56917.1553414] wd3: soft error (corrected) xfer 5c0
[ 56917.3418506] wd3: soft error (corrected) xfer 490
[ 56918.7023839] wd3: soft error (corrected) xfer 658
Here's the dmesg output for the controller and disk again:
[ 1.092336] ahcisata1 at pci0 dev 31 function 2: vendor 8086 product 8d02 (rev. 0x05)
[ 1.092336] ahcisata1: 64-bit DMA
[ 1.092336] ahcisata1: AHCI revision 1.30, 6 ports, 32 slots, CAP 0xcb30ff45<EMS,PSC,SSC,PMD,ISS=0x3=Gen3,SCLO,SAL,SSS,SNCQ,S64A>
[ 1.092336] ahcisata1: interrupting at msi5 vec 0
[ 1.092336] atabus4 at ahcisata1 channel 0
[ 1.092336] atabus5 at ahcisata1 channel 1
[ 1.092336] atabus6 at ahcisata1 channel 2
[ 1.092336] atabus7 at ahcisata1 channel 3
[ 1.092336] atabus8 at ahcisata1 channel 4
[ 1.092336] atabus9 at ahcisata1 channel 5
[ 3.251001] ahcisata1 port 1: device present, speed: 6.0Gb/s
[...]
[ 5.921953] wd3: <ST5000LM000-2AN170>
[ 5.961967] wd3: drive supports 1-sector PIO transfers, LBA48 addressing
[ 5.961967] wd3: 4657 GB, 9690021 cyl, 16 head, 63 sec, 512 bytes/sect x 9767541168 sectors (0 bytes/physsect; first aligned sector: 8)
[ 6.422131] wd3: GPT GUID: 31a830f6-dace-4062-ab89-f6911c261385
[ 6.422131] dk2 at wd3: "65183669-9719-47da-9e2a-09a1f9a7bf6d", 9767538688 blocks at 2048, type: ffs
[ 6.532170] wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags)
[ 6.532170] wd3(ahcisata1:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags)
The SMART log of the disk shows no errors.
>How-To-Repeat:
>Fix:
>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56686: wd(4) device timeouts
Date: Fri, 4 Feb 2022 12:54:54 -0000 (UTC)
gson@gson.org (Andreas Gustafsson) writes:
> [ 5.921953] wd3: <ST5000LM000-2AN170>
ST5000LM000 is a SMR disk, the timeouts might be real and
the driver might need to wait longer for such hardware.
The 'uncorrectable data error' comes from the drive, so
that's probably real too.
From: Andreas Gustafsson <gson@gson.org>
To: mlelstv@serpens.de (Michael van Elst)
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/56686: wd(4) device timeouts
Date: Fri, 4 Feb 2022 15:45:00 +0200
Michael van Elst wrote:
> > [ 5.921953] wd3: <ST5000LM000-2AN170>
>
> ST5000LM000 is a SMR disk, the timeouts might be real and
> the driver might need to wait longer for such hardware.
Agreed. Where is the timeout defined, and does it take the queue
length into account?
> The 'uncorrectable data error' comes from the drive, so
> that's probably real too.
Probably, but I don't see it in the SMART error log:
smartctl 7.2 2020-12-30 r5155 [NetBSD 9.2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 2.5 5400
Device Model: ST5000LM000-2AN170
Serial Number: WCJ46G52
LU WWN Device Id: 5 000c50 0d44f62ec
Firmware Version: 0001
User Capacity: 5,000,981,078,016 bytes [5.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5526 rpm
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Feb 4 15:41:50 2022 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 824) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x30a5) SCT Status supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 072 062 006 Pre-fail Always - 74443112
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 37
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 082 060 045 Pre-fail Always - 166162157
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7150 (25 38 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 36
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 34360262665
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 072 046 040 Old_age Always - 28 (Min/Max 22/33)
191 G-Sense_Error_Rate 0x0032 099 099 000 Old_age Always - 2867
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 223651
194 Temperature_Celsius 0x0022 028 054 000 Old_age Always - 28 (0 19 0 0 0)
195 Hardware_ECC_Recovered 0x001a 079 064 000 Old_age Always - 74443112
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2389 (212 48 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 35963005615
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 14641645157
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
--
Andreas Gustafsson, gson@gson.org
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56686: wd(4) device timeouts
Date: Fri, 4 Feb 2022 15:00:53 +0100
On Fri, Feb 04, 2022 at 01:50:01PM +0000, Andreas Gustafsson wrote:
> Probably, but I don't see it in the SMART error log:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 072 062 006 Pre-fail Always - 74443112
[..]
> 195 Hardware_ECC_Recovered 0x001a 079 064 000 Old_age Always - 74443112
I just replaced a similar Barracuda drive with a CMR one. It worked, and
I got all data from it, but it became *really* slow - and subjectively slower
every day.
Those two values were similar high, but the drive itself still considerd itself
healthy. I'm avoiding SMR.
Martin
From: Michael van Elst <mlelstv@serpens.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/56686: wd(4) device timeouts
Date: Fri, 4 Feb 2022 15:58:51 +0100
On Fri, Feb 04, 2022 at 03:45:00PM +0200, Andreas Gustafsson wrote:
> Michael van Elst wrote:
> > > [ 5.921953] wd3: <ST5000LM000-2AN170>
> >
> > ST5000LM000 is a SMR disk, the timeouts might be real and
> > the driver might need to wait longer for such hardware.
>
> Agreed. Where is the timeout defined, and does it take the queue
> length into account?
There are several kinds of timeouts, but this here should be a command
timeout. It's a callout started when the command is issued to the
controller, the timeout period is a constant that depends on the
particular command. Regular I/O commands have 10s, that's the
ATA_DELAY constant in
sys/dev/ic/ahcisata_core.c
sys/dev/ata/ata_wdc.c
Other controllers may have something else...
There are a few commands that have their own timeouts, like flushing
the drive cache (used by WAPBL if you let it). They should not run
concurrently with I/O commands, but I'm not sure.
> > The 'uncorrectable data error' comes from the drive, so
> > that's probably real too.
>
> Probably, but I don't see it in the SMART error log:
could be all hidden in that number:
> 1 Raw_Read_Error_Rate 0x000f 072 062 006 Pre-fail Always - 74443112
> SMART Self-test log structure revision number 1
> No self-tests have been logged. [To run self-tests, use: smartctl -t]
Maybe a long self-test will log something.
Greetings,
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: Patrick Welche <prlw1@talktalk.net>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56686: wd(4) device timeouts
Date: Sun, 6 Feb 2022 12:04:03 +0000
On Fri, Feb 04, 2022 at 01:50:01PM +0000, Andreas Gustafsson wrote:
>
> Michael van Elst wrote:
> > > [ 5.921953] wd3: <ST5000LM000-2AN170>
> >
> > ST5000LM000 is a SMR disk, the timeouts might be real and
> > the driver might need to wait longer for such hardware.
>
> Agreed. Where is the timeout defined, and does it take the queue
> length into account?
Might options AHCISATA_EXTRA_DELAY help or isn't this -current?
(c.f.
commit 4cd5be3fe84c428901551ba8bbc784639677d714
Author: jmcneill <jmcneill@NetBSD.org>
Date: Mon Oct 11 12:48:10 2021 +0000
ahcisata: remove excessive delays from drive probe path
There are a handful of inexplicable 500ms delays introduced to the drive
detect path in this driver, slowing boot. They can be re-enabled with
options AHCISATA_EXTRA_DELAY, but should not be enabled for normal kernels.
If a delay does need to be introduced in these places, the value should
either be more carefully selected or the scope limited to hardware that
requires the extra delay.
sys/dev/ic/ahcisata_core.c | 14 ++++++++++++--
)
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
Andreas Gustafsson <gson@gson.org>
Subject: Re: kern/56686: wd(4) device timeouts
Date: Sun, 6 Feb 2022 14:24:58 +0100
On Sun, Feb 06, 2022 at 12:05:01PM +0000, Patrick Welche wrote:
> The following reply was made to PR kern/56686; it has been noted by GNATS.
>
> From: Patrick Welche <prlw1@talktalk.net>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/56686: wd(4) device timeouts
> Date: Sun, 6 Feb 2022 12:04:03 +0000
>
> On Fri, Feb 04, 2022 at 01:50:01PM +0000, Andreas Gustafsson wrote:
> >
> > Michael van Elst wrote:
> > > > [ 5.921953] wd3: <ST5000LM000-2AN170>
> > >
> > > ST5000LM000 is a SMR disk, the timeouts might be real and
> > > the driver might need to wait longer for such hardware.
> >
> > Agreed. Where is the timeout defined, and does it take the queue
> > length into account?
>
> Might options AHCISATA_EXTRA_DELAY help or isn't this -current?
No, this option is for the probe path only; it's not used for regular I/Os
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.