NetBSD Problem Report #56686

From gson@gson.org  Fri Feb  4 11:01:39 2022
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id A016A1A9239
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  4 Feb 2022 11:01:39 +0000 (UTC)
Message-Id: <20220204110129.1A93E254379@guava.gson.org>
Date: Fri,  4 Feb 2022 13:01:29 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: wd(4) device timeouts
X-Send-Pr-Version: 3.95

>Number:         56686
>Category:       kern
>Synopsis:       wd(4) device timeouts
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 04 11:05:00 +0000 2022
>Last-Modified:  Sun Feb 06 13:30:01 +0000 2022
>Originator:     Andreas Gustafsson
>Release:        NetBSD 9.2
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

With a workload involving multiple emulator processes running NetBSD
guests running ATF tests on a ST5000LM000-2AN170 disk in a Dell
PowerEdge R630 server, I am seeing bursts of "device timeout" errors
several times a day.  This PR is about these device timeout errors
themselves; for the panics they sometimes trigger, see PR 56479.

Here is the console log from one such burst:

[ 45042.1012910] wd3d: device timeout writing fsbn 4431519296 of 4431519296-4431519359 (wd3 bn 4431519296; cn 4396348 tn 8 sn 8), xfer 218, retry 0
[ 45042.2542704] wd3d: device timeout writing fsbn 3779132800 of 3779132800-3779132927 (wd3 bn 3779132800; cn 3749139 tn 10 sn 58), xfer 50, retry 0
[ 45042.4102613] wd3d: device timeout writing fsbn 4418502912 of 4418502912-4418502935 (wd3 bn 4418502912; cn 4383435 tn 6 sn 54), xfer e8, retry 0
[ 45042.5652114] wd3d: device timeout writing fsbn 3778503360 of 3778503360-3778503423 (wd3 bn 3778503360; cn 3748515 tn 3 sn 51), xfer 180, retry 0
[ 45042.7212016] wd3d: device timeout writing fsbn 4654435520 of 4654435520-4654435583 (wd3 bn 4654435520; cn 4617495 tn 8 sn 56), xfer 2b0, retry 0
[ 45042.8771911] wd3d: device timeout writing fsbn 4417615040 of 4417615040-4417615103 (wd3 bn 4417615040; cn 4382554 tn 9 sn 41), xfer 348, retry 0
[ 45043.0331818] wd3d: device timeout writing fsbn 3957636800 of 3957636800-3957636863 (wd3 bn 3957636800; cn 3926226 tn 15 sn 47), xfer 3e0, retry 0
[ 45043.1902111] wd3d: device timeout writing fsbn 4030504640 of 4030504640-4030504703 (wd3 bn 4030504640; cn 3998516 tn 8 sn 8), xfer 478, retry 0
[ 45043.3451615] wd3d: device timeout writing fsbn 3945492160 of 3945492160-3945492223 (wd3 bn 3945492160; cn 3914178 tn 11 sn 43), xfer 510, retry 0
[ 45043.5021917] wd3d: device timeout writing fsbn 3763322560 of 3763322560-3763322623 (wd3 bn 3763322560; cn 3733454 tn 14 sn 46), xfer 5a8, retry 0
[ 45043.6592219] wd3d: device timeout writing fsbn 3881732800 of 3881732800-3881732863 (wd3 bn 3881732800; cn 3850925 tn 6 sn 22), xfer 640, retry 0
[ 45043.8152122] wd3d: device timeout writing fsbn 3778503296 of 3778503296-3778503359 (wd3 bn 3778503296; cn 3748515 tn 2 sn 50), xfer 6d8, retry 0
[ 45043.9712037] wd3d: device timeout writing fsbn 3779132928 of 3779132928-3779133055 (wd3 bn 3779132928; cn 3749139 tn 12 sn 60), xfer 770, retry 0
[ 45044.1282335] wd3d: device timeout writing fsbn 4031996800 of 4031996800-4031996927 (wd3 bn 4031996800; cn 3999996 tn 13 sn 13), xfer 808, retry 0
[ 45044.2852623] wd3d: device timeout writing fsbn 3950007936 of 3950007936-3950007999 (wd3 bn 3950007936; cn 3918658 tn 10 sn 42), xfer 8a0, retry 0
[ 45044.4422926] wd3d: device timeout writing fsbn 3764548672 of 3764548672-3764548735 (wd3 bn 3764548672; cn 3734671 tn 4 sn 52), xfer 938, retry 0
[ 45044.5982845] wd3d: device timeout writing fsbn 3764544800 of 3764544800-3764544831 (wd3 bn 3764544800; cn 3734667 tn 7 sn 23), xfer 9d0, retry 0
[ 45044.7542730] wd3d: device timeout writing fsbn 7124681152 of 7124681152-7124681279 (wd3 bn 7124681152; cn 7068136 tn 1 sn 1), xfer a68, retry 0
[ 45044.9092242] wd3d: device timeout writing fsbn 7124681088 of 7124681088-7124681151 (wd3 bn 7124681088; cn 7068136 tn 0 sn 0), xfer b00, retry 0
[ 45045.0641735] wd3d: device timeout writing fsbn 3951300736 of 3951300736-3951300799 (wd3 bn 3951300736; cn 3919941 tn 3 sn 19), xfer b98, retry 0
[ 45045.2201647] wd3d: device timeout writing fsbn 3958537280 of 3958537280-3958537343 (wd3 bn 3958537280; cn 3927120 tn 5 sn 5), xfer c30, retry 0
[ 45045.3751142] wd3d: device timeout writing fsbn 3883039872 of 3883039872-3883039935 (wd3 bn 3883039872; cn 3852222 tn 1 sn 33), xfer cc8, retry 0
[ 45045.5311042] wd3d: device timeout writing fsbn 3883032192 of 3883032192-3883032255 (wd3 bn 3883032192; cn 3852214 tn 7 sn 39), xfer d60, retry 0
[ 45045.6870949] wd3d: device timeout writing fsbn 4770050496 of 4770050496-4770050559 (wd3 bn 4770050496; cn 4732192 tn 15 sn 15), xfer df8, retry 0
[ 45046.3329589] wd3: soft error (corrected) xfer df8
[ 45046.3904871] wd3: soft error (corrected) xfer d60
[ 45046.4476828] wd3: soft error (corrected) xfer cc8
[ 45046.5048796] wd3: soft error (corrected) xfer c30
[ 45046.5620769] wd3: soft error (corrected) xfer b98
[ 45046.6192729] wd3: soft error (corrected) xfer b00
[ 45046.6764685] wd3: soft error (corrected) xfer a68
[ 45046.7336655] wd3: soft error (corrected) xfer 9d0
[ 45046.7908619] wd3: soft error (corrected) xfer 938
[ 45046.8480580] wd3: soft error (corrected) xfer 8a0
[ 45046.9052551] wd3: soft error (corrected) xfer 808
[ 45046.9624508] wd3: soft error (corrected) xfer 770
[ 45047.0196475] wd3: soft error (corrected) xfer 6d8
[ 45047.0768438] wd3: soft error (corrected) xfer 640
[ 45047.1340403] wd3: soft error (corrected) xfer 5a8
[ 45047.1912370] wd3: soft error (corrected) xfer 510
[ 45047.2484331] wd3: soft error (corrected) xfer 478
[ 45047.3056288] wd3: soft error (corrected) xfer 3e0
[ 45047.3628253] wd3: soft error (corrected) xfer 348
[ 45047.4200222] wd3: soft error (corrected) xfer 2b0
[ 45047.4772193] wd3: soft error (corrected) xfer 180
[ 45047.5344148] wd3: soft error (corrected) xfer e8
[ 45047.5905719] wd3: soft error (corrected) xfer 50
[ 45047.6467280] wd3: soft error (corrected) xfer 218

Here's another one, which interestingly includes an "uncorrectable
data error" followed by a "soft error (corrected)" for the same
transfer:

[ 56845.1934106] wd3d: device timeout reading fsbn 4190850944 of 4190850944-4190850967 (wd3 bn 4190850944; cn 4157590 tn 3 sn 35), xfer 490, retry 0
[ 56845.3474313] wd3d: device timeout reading fsbn 7217102720 of 7217102720-7217102783 (wd3 bn 7217102720; cn 7159824 tn 2 sn 2), xfer 68, retry 0
[ 56845.5013421] wd3d: device timeout writing fsbn 4492000896 of 4492000896-4492000959 (wd3 bn 4492000896; cn 4456350 tn 1 sn 33), xfer 198, retry 0
[ 56845.6573313] wd3d: device timeout writing fsbn 3945492160 of 3945492160-3945492223 (wd3 bn 3945492160; cn 3914178 tn 11 sn 43), xfer 230, retry 0
[ 56845.8143624] wd3d: device timeout writing fsbn 3881732800 of 3881732800-3881732863 (wd3 bn 3881732800; cn 3850925 tn 6 sn 22), xfer 2c8, retry 0
[ 56845.9703522] wd3d: device timeout writing fsbn 4030504640 of 4030504640-4030504703 (wd3 bn 4030504640; cn 3998516 tn 8 sn 8), xfer 360, retry 0
[ 56846.1253022] wd3d: device timeout writing fsbn 3951298944 of 3951298944-3951299007 (wd3 bn 3951298944; cn 3919939 tn 6 sn 54), xfer 100, retry 0
[ 56846.2812920] wd3d: device timeout writing fsbn 3951359872 of 3951359872-3951359935 (wd3 bn 3951359872; cn 3919999 tn 13 sn 61), xfer 3f8, retry 0
[ 56846.4383222] wd3d: device timeout writing fsbn 4031996864 of 4031996864-4031996927 (wd3 bn 4031996864; cn 3999996 tn 14 sn 14), xfer 528, retry 0
[ 56846.5953523] wd3d: device timeout writing fsbn 4492654656 of 4492654656-4492654783 (wd3 bn 4492654656; cn 4456998 tn 10 sn 42), xfer 5c0, retry 0
[ 56846.7523839] wd3d: device timeout reading fsbn 4669633728 of 4669633728-4669633791 (wd3 bn 4669633728; cn 4632573 tn 2 sn 18), xfer 658, retry 0
[ 56846.9083727] wd3d: device timeout writing fsbn 3883017088 of 3883017088-3883017151 (wd3 bn 3883017088; cn 3852199 tn 7 sn 55), xfer 6f0, retry 0
[ 56914.8308581] wd3d: error writing fsbn 3883017088 of 3883017088-3883017151 (wd3 bn 3883017088; cn 3852199 tn 7 sn 55), xfer 6f0, retry 1
[ 56914.9709129] wd3: (aborted command, id not found, uncorrectable data error)
[ 56915.0509450] wd3d: requeue reading fsbn 4669633728 of 4669633728-4669633791 (wd3 bn 4669633728; cn 4632573 tn 2 sn 18), xfer 658, retry 1
[ 56915.2010038] wd3d: requeue writing fsbn 4492654656 of 4492654656-4492654783 (wd3 bn 4492654656; cn 4456998 tn 10 sn 42), xfer 5c0, retry 1
[ 56915.3510630] wd3d: requeue writing fsbn 4031996864 of 4031996864-4031996927 (wd3 bn 4031996864; cn 3999996 tn 14 sn 14), xfer 528, retry 1
[ 56915.5011219] wd3d: requeue writing fsbn 3951359872 of 3951359872-3951359935 (wd3 bn 3951359872; cn 3919999 tn 13 sn 61), xfer 3f8, retry 1
[ 56915.6511814] wd3d: requeue writing fsbn 3951298944 of 3951298944-3951299007 (wd3 bn 3951298944; cn 3919939 tn 6 sn 54), xfer 100, retry 1
[ 56915.8012403] wd3d: requeue writing fsbn 4030504640 of 4030504640-4030504703 (wd3 bn 4030504640; cn 3998516 tn 8 sn 8), xfer 360, retry 1
[ 56915.9512994] wd3d: requeue writing fsbn 3881732800 of 3881732800-3881732863 (wd3 bn 3881732800; cn 3850925 tn 6 sn 22), xfer 2c8, retry 1
[ 56916.1013587] wd3d: requeue writing fsbn 3945492160 of 3945492160-3945492223 (wd3 bn 3945492160; cn 3914178 tn 11 sn 43), xfer 230, retry 1
[ 56916.2514185] wd3d: requeue writing fsbn 4492000896 of 4492000896-4492000959 (wd3 bn 4492000896; cn 4456350 tn 1 sn 33), xfer 198, retry 1
[ 56916.4014769] wd3d: requeue reading fsbn 7217102720 of 7217102720-7217102783 (wd3 bn 7217102720; cn 7159824 tn 2 sn 2), xfer 68, retry 1
[ 56916.5415318] wd3d: requeue reading fsbn 4190850944 of 4190850944-4190850967 (wd3 bn 4190850944; cn 4157590 tn 3 sn 35), xfer 490, retry 1
[ 56916.6915911] wd3: soft error (corrected) xfer 6f0
[ 56916.7548500] wd3: soft error (corrected) xfer 198
[ 56916.8121630] wd3: soft error (corrected) xfer 230
[ 56916.8693600] wd3: soft error (corrected) xfer 2c8
[ 56916.9265564] wd3: soft error (corrected) xfer 360
[ 56916.9837519] wd3: soft error (corrected) xfer 100
[ 56917.0409486] wd3: soft error (corrected) xfer 3f8
[ 56917.0981443] wd3: soft error (corrected) xfer 528
[ 56917.1553414] wd3: soft error (corrected) xfer 5c0
[ 56917.3418506] wd3: soft error (corrected) xfer 490
[ 56918.7023839] wd3: soft error (corrected) xfer 658

Here's the dmesg output for the controller and disk again:

 [     1.092336] ahcisata1 at pci0 dev 31 function 2: vendor 8086 product 8d02 (rev. 0x05)
 [     1.092336] ahcisata1: 64-bit DMA
 [     1.092336] ahcisata1: AHCI revision 1.30, 6 ports, 32 slots, CAP 0xcb30ff45<EMS,PSC,SSC,PMD,ISS=0x3=Gen3,SCLO,SAL,SSS,SNCQ,S64A>
 [     1.092336] ahcisata1: interrupting at msi5 vec 0
 [     1.092336] atabus4 at ahcisata1 channel 0
 [     1.092336] atabus5 at ahcisata1 channel 1
 [     1.092336] atabus6 at ahcisata1 channel 2
 [     1.092336] atabus7 at ahcisata1 channel 3
 [     1.092336] atabus8 at ahcisata1 channel 4
 [     1.092336] atabus9 at ahcisata1 channel 5
 [     3.251001] ahcisata1 port 1: device present, speed: 6.0Gb/s
 [...]
 [     5.921953] wd3: <ST5000LM000-2AN170>
 [     5.961967] wd3: drive supports 1-sector PIO transfers, LBA48 addressing
 [     5.961967] wd3: 4657 GB, 9690021 cyl, 16 head, 63 sec, 512 bytes/sect x 9767541168 sectors (0 bytes/physsect; first aligned sector: 8)
 [     6.422131] wd3: GPT GUID: 31a830f6-dace-4062-ab89-f6911c261385
 [     6.422131] dk2 at wd3: "65183669-9719-47da-9e2a-09a1f9a7bf6d", 9767538688 blocks at 2048, type: ffs
 [     6.532170] wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags)
 [     6.532170] wd3(ahcisata1:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags)

The SMART log of the disk shows no errors.

>How-To-Repeat:

>Fix:

>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56686: wd(4) device timeouts
Date: Fri, 4 Feb 2022 12:54:54 -0000 (UTC)

 gson@gson.org (Andreas Gustafsson) writes:

 > [     5.921953] wd3: <ST5000LM000-2AN170>

 ST5000LM000 is a SMR disk, the timeouts might be real and
 the driver might need to wait longer for such hardware.

 The 'uncorrectable data error' comes from the drive, so
 that's probably real too.

From: Andreas Gustafsson <gson@gson.org>
To: mlelstv@serpens.de (Michael van Elst)
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/56686: wd(4) device timeouts
Date: Fri, 4 Feb 2022 15:45:00 +0200

 Michael van Elst wrote:
 >  > [     5.921953] wd3: <ST5000LM000-2AN170>
 >  
 >  ST5000LM000 is a SMR disk, the timeouts might be real and
 >  the driver might need to wait longer for such hardware.

 Agreed.  Where is the timeout defined, and does it take the queue
 length into account?

 >  The 'uncorrectable data error' comes from the drive, so
 >  that's probably real too.

 Probably, but I don't see it in the SMART error log:

 smartctl 7.2 2020-12-30 r5155 [NetBSD 9.2 amd64] (local build)
 Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF INFORMATION SECTION ===
 Model Family:     Seagate Barracuda 2.5 5400
 Device Model:     ST5000LM000-2AN170
 Serial Number:    WCJ46G52
 LU WWN Device Id: 5 000c50 0d44f62ec
 Firmware Version: 0001
 User Capacity:    5,000,981,078,016 bytes [5.00 TB]
 Sector Sizes:     512 bytes logical, 4096 bytes physical
 Rotation Rate:    5526 rpm
 Form Factor:      2.5 inches
 Device is:        In smartctl database [for details use: -P show]
 ATA Version is:   ACS-3 T13/2161-D revision 5
 SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
 Local Time is:    Fri Feb  4 15:41:50 2022 EET
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled

 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED

 General SMART Values:
 Offline data collection status:  (0x00)	Offline data collection activity
 					was never started.
 					Auto Offline Data Collection: Disabled.
 Self-test execution status:      (   0)	The previous self-test routine completed
 					without error or no self-test has ever 
 					been run.
 Total time to complete Offline 
 data collection: 		(    0) seconds.
 Offline data collection
 capabilities: 			 (0x73) SMART execute Offline immediate.
 					Auto Offline data collection on/off support.
 					Suspend Offline collection upon new
 					command.
 					No Offline surface scan supported.
 					Self-test supported.
 					Conveyance Self-test supported.
 					Selective Self-test supported.
 SMART capabilities:            (0x0003)	Saves SMART data before entering
 					power-saving mode.
 					Supports SMART auto save timer.
 Error logging capability:        (0x01)	Error logging supported.
 					General Purpose Logging supported.
 Short self-test routine 
 recommended polling time: 	 (   1) minutes.
 Extended self-test routine
 recommended polling time: 	 ( 824) minutes.
 Conveyance self-test routine
 recommended polling time: 	 (   2) minutes.
 SCT capabilities: 	       (0x30a5)	SCT Status supported.
 					SCT Data Table supported.

 SMART Attributes Data Structure revision number: 10
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000f   072   062   006    Pre-fail  Always       -       74443112
   3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       37
   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
   7 Seek_Error_Rate         0x000f   082   060   045    Pre-fail  Always       -       166162157
   9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7150 (25 38 0)
  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
  12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       36
 183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
 184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
 188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       34360262665
 189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
 190 Airflow_Temperature_Cel 0x0022   072   046   040    Old_age   Always       -       28 (Min/Max 22/33)
 191 G-Sense_Error_Rate      0x0032   099   099   000    Old_age   Always       -       2867
 192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       3
 193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       223651
 194 Temperature_Celsius     0x0022   028   054   000    Old_age   Always       -       28 (0 19 0 0 0)
 195 Hardware_ECC_Recovered  0x001a   079   064   000    Old_age   Always       -       74443112
 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
 240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2389 (212 48 0)
 241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       35963005615
 242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       14641645157
 254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

 SMART Error Log Version: 1
 No Errors Logged

 SMART Self-test log structure revision number 1
 No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 SMART Selective self-test log data structure revision number 1
  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
     1        0        0  Not_testing
     2        0        0  Not_testing
     3        0        0  Not_testing
     4        0        0  Not_testing
     5        0        0  Not_testing
 Selective self-test flags (0x0):
   After scanning selected spans, do NOT read-scan remainder of disk.
 If Selective self-test is pending on power-up, resume after 0 minute delay.

 -- 
 Andreas Gustafsson, gson@gson.org

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56686: wd(4) device timeouts
Date: Fri, 4 Feb 2022 15:00:53 +0100

 On Fri, Feb 04, 2022 at 01:50:01PM +0000, Andreas Gustafsson wrote:
 >  Probably, but I don't see it in the SMART error log:

 >  ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 >    1 Raw_Read_Error_Rate     0x000f   072   062   006    Pre-fail  Always       -       74443112
 [..]
 >  195 Hardware_ECC_Recovered  0x001a   079   064   000    Old_age   Always       -       74443112

 I just replaced a similar Barracuda drive with a CMR one. It worked, and
 I got all data from it, but it became *really* slow - and subjectively slower
 every day.

 Those two values were similar high, but the drive itself still considerd itself
 healthy. I'm avoiding SMR.

 Martin

From: Michael van Elst <mlelstv@serpens.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/56686: wd(4) device timeouts
Date: Fri, 4 Feb 2022 15:58:51 +0100

 On Fri, Feb 04, 2022 at 03:45:00PM +0200, Andreas Gustafsson wrote:
 > Michael van Elst wrote:
 > >  > [     5.921953] wd3: <ST5000LM000-2AN170>
 > >  
 > >  ST5000LM000 is a SMR disk, the timeouts might be real and
 > >  the driver might need to wait longer for such hardware.
 > 
 > Agreed.  Where is the timeout defined, and does it take the queue
 > length into account?

 There are several kinds of timeouts, but this here should be a command
 timeout. It's a callout started when the command is issued to the
 controller, the timeout period is a constant that depends on the
 particular command. Regular I/O commands have 10s, that's the
 ATA_DELAY constant in

 sys/dev/ic/ahcisata_core.c
 sys/dev/ata/ata_wdc.c

 Other controllers may have something else...

 There are a few commands that have their own timeouts, like flushing
 the drive cache (used by WAPBL if you let it). They should not run
 concurrently with I/O commands, but I'm not sure.


 > >  The 'uncorrectable data error' comes from the drive, so
 > >  that's probably real too.
 > 
 > Probably, but I don't see it in the SMART error log:


 could be all hidden in that number:

 >   1 Raw_Read_Error_Rate     0x000f   072   062   006    Pre-fail  Always       -       74443112



 > SMART Self-test log structure revision number 1
 > No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 Maybe a long self-test will log something.


 Greetings,
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Patrick Welche <prlw1@talktalk.net>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56686: wd(4) device timeouts
Date: Sun, 6 Feb 2022 12:04:03 +0000

 On Fri, Feb 04, 2022 at 01:50:01PM +0000, Andreas Gustafsson wrote:
 > 
 >  Michael van Elst wrote:
 >  >  > [     5.921953] wd3: <ST5000LM000-2AN170>
 >  >  
 >  >  ST5000LM000 is a SMR disk, the timeouts might be real and
 >  >  the driver might need to wait longer for such hardware.
 >  
 >  Agreed.  Where is the timeout defined, and does it take the queue
 >  length into account?

 Might options AHCISATA_EXTRA_DELAY help or isn't this -current?

 (c.f.
 commit 4cd5be3fe84c428901551ba8bbc784639677d714
 Author: jmcneill <jmcneill@NetBSD.org>
 Date:   Mon Oct 11 12:48:10 2021 +0000

     ahcisata: remove excessive delays from drive probe path

     There are a handful of inexplicable 500ms delays introduced to the drive
     detect path in this driver, slowing boot. They can be re-enabled with
     options AHCISATA_EXTRA_DELAY, but should not be enabled for normal kernels.
     If a delay does need to be introduced in these places, the value should
     either be more carefully selected or the scope limited to hardware that
     requires the extra delay.

  sys/dev/ic/ahcisata_core.c | 14 ++++++++++++--
 )

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
        Andreas Gustafsson <gson@gson.org>
Subject: Re: kern/56686: wd(4) device timeouts
Date: Sun, 6 Feb 2022 14:24:58 +0100

 On Sun, Feb 06, 2022 at 12:05:01PM +0000, Patrick Welche wrote:
 > The following reply was made to PR kern/56686; it has been noted by GNATS.
 > 
 > From: Patrick Welche <prlw1@talktalk.net>
 > To: gnats-bugs@netbsd.org
 > Cc: 
 > Subject: Re: kern/56686: wd(4) device timeouts
 > Date: Sun, 6 Feb 2022 12:04:03 +0000
 > 
 >  On Fri, Feb 04, 2022 at 01:50:01PM +0000, Andreas Gustafsson wrote:
 >  > 
 >  >  Michael van Elst wrote:
 >  >  >  > [     5.921953] wd3: <ST5000LM000-2AN170>
 >  >  >  
 >  >  >  ST5000LM000 is a SMR disk, the timeouts might be real and
 >  >  >  the driver might need to wait longer for such hardware.
 >  >  
 >  >  Agreed.  Where is the timeout defined, and does it take the queue
 >  >  length into account?
 >  
 >  Might options AHCISATA_EXTRA_DELAY help or isn't this -current?

 No, this option is for the probe path only; it's not used for regular I/Os

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.