NetBSD Problem Report #54790

From tsutsui@ceres.dti.ne.jp  Fri Dec 20 21:52:47 2019
Return-Path: <tsutsui@ceres.dti.ne.jp>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 7E7977A18E
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 20 Dec 2019 21:52:47 +0000 (UTC)
Message-Id: <201912202152.xBKLqaTN001886@ceres.dti.ne.jp>
Date: Sat, 21 Dec 2019 06:52:36 +0900 (JST)
From: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
Reply-To: tsutsui@ceres.dti.ne.jp
To: gnats-bugs@NetBSD.org
Cc: tsutsui@ceres.dti.ne.jp
Subject: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ support?)
X-Send-Pr-Version: 3.95

>Number:         54790
>Category:       kern
>Synopsis:       9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ support?)
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    jdolecek
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Dec 20 21:55:00 +0000 2019
>Closed-Date:    Tue Feb 08 18:29:10 +0000 2022
>Last-Modified:  Tue Feb 08 18:29:10 +0000 2022
>Originator:     Izumi Tsutsui
>Release:        NetBSD 9.0_RC1
>Organization:
>Environment:
System: NetBSD 9.0_RC1 (GENERIC) #0: Wed Nov 27 16:14:52 UTC 2019
    mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/i386/compile/GENERIC
Architecture: i386
Machine: i386
>Description:
I'm getting reproducible kernel fault in ata_recovery_resume()
on my 9.0_RC1 i386 machines.  It looks triggered by SSD error,
but I wonder if the errors are real hardware faiulre or not.
(not seen on 8.1 kernel)

ddb says (typed from screen pic):
---
kernle: supervisor trap page fault, code=0
Stopped in pid 0.41 (system) at netbsd:ata_recovery_resume+0xe3:       movzwl  8(%eax),%edx
db{0}> bt
ata_recovery_resume(c51abb88,0,8441,8,c08a608e,0,8441,8000,c51abb88,8441) at netbsd:ata_recovery_resume+0xe3
ahci_channel_recover(c51abb88,8,8441,c0fc4238,1277b90,c51ab000,8,c51abb88,c4d488c0,0) at netbsd:ahci_channel_recover+0x82
ata_thread_run(c51abb88,8,8000,8441,c51abb90,6,c51abc98,c5197080,c01813fc,c509fc00) at netbsd:ata_thread_run+0x1f3
atabus_thread(c5197080,1540000,154a000,0,c01003fd,0,0,0,0,0) at netbsd:atabus_thread+0x228
>db{0}>
---

dmesg on the ddb prompt say (timestamp is omitted to save typing):
---
 :
ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00)
ahcisata0: ignoring broken port multiplier support
ahcisata0: AHCI revision 1.10, 4 ports, 32 slots, CAP 0xf3209f83<CCCS,PMD,ISS=0x2=Gen2,SCLO,SAL,SMPS,SSNTF,SNCQ,S64A>
ahcisata0: interrupting at ioapic0 pin 22
atabus0 at ahcisata0 channel 0
atabus1 at ahcisata0 channel 1
atabus2 at ahcisata0 channel 2
atabus3 at ahcisata0 channel 3
 :
ixpide0 at pci0 dev 20 function 1: ATI Technologies IXP IDE Controller (rev. 0x00)
ixpide0: bus-master DMA support present
ixpide0: primary channel configured to compatibility mode
ixpide0: primary channel interrupting at ioapic0 pin 14
atabus4 at ixpide0 channel 0
ixpide0: secondary channel configured to compatibility mode
ixpide0: secondary channel interrupting at ioapic0 pin 15
atabus5 at ixpide0 channel 1
 :
ahcisata0 port 0: device present, speed: 3.0Gb/s
ahcisata0 port 1: device present, speed: 3.0Gb/s
ahcisata0 port 2: device present, speed: 3.0Gb/s
ahcisata0 port 3: device present, speed: 1.5Gb/s
 :
wd0 at atabus0 drive 0
wd0: <Hitachi HDS5C3020ALA632>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sectors
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags) w/PRIO
wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
wd1 at atabus1 drive 0
wd1: <Hitachi HDS5C3020ALA632>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sectors
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags) w/PRIO
wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
wd2 at atabus2 drive 0
wd2: <Samsung SSD 860 EVO 500GB>
wd2: drive supports 1-sector PIO transfers, LBA48 addressing
wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags)
wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags)
atapibus0 at atabus3: 1 targets
cd0 at atapibus0 drive 0: <HL-DT-ST DVDRAM GH24NSD5, KLUIBRA1411, LJ00> cdrom removable
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
cd0(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA)
 :
wsmux1: connecting to wsdisplay0
cd0(ahcisata0:3:0):  DEFERRED ERROR, key = 0x2
wsdisplay0: screen 1 added (default, vt100 emulation)
wsdisplay0: screen 2 added (default, vt100 emulation)
wsdisplay0: screen 3 added (default, vt100 emulation)
wsdisplay0: screen 4 added (default, vt100 emulation)
cd0(ahcisata0:3:0):  DEFERRED ERROR, key = 0x2
wd2a: device timeout reading fsbn 343200640 of 343200640-343200647 (wd2 bn 343200640; cn 167578 tn 14 sn 0), xfer dcc, retry 0
wd2a: device timeout writing fsbn 479102685 of 479102605-479102719 (wd2 bn 479102685; cn 233936 tn 27 sn 29), xfer 7f0, retry 0
 :
[many similar errors]
 :
uvm_fault(0xc13737e0, 0, 1) -> 0xe
fatal page fault in supervisor mode
trap type 6 code 0 eip 0xc018305f cs 0x8 eflags 0x10286 cr2 0x8 ilevel 0 esp 0xc51abb88
curlwp 0xc509fc00 pid 0 lid 41 lowest kstack 0xdc7da2c0
db{0}> 
---

"0xc018305f" is here:
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=1.2#240
---
    234 	/* Requeue all unfinished commands for same drive as failed command */
    235 	for (slot = 0; slot < ch_openings; slot++) {
    236 		if ((ata_queue_active(chp) & (1U << slot)) == 0)
    237 			continue;
    238 
    239 		xfer = ata_queue_hwslot_to_xfer(chp, slot);
->  240 		if (drive != xfer->c_drive) 
    241 			continue;
    242 
    243 		xfer->ops->c_kill_xfer(chp, xfer,
    244 		    (error == 0) ? KILL_REQUEUE : KILL_RESET);
    245 	}
---
Per dumb printf debug, actually "xfer" is NULL on the fault.

>How-To-Repeat:
~100% reproducible on my Samsung SSD with load on my main machine
(ASRock M3A UCC http://www.asrock.com/mb/AMD/M3A%20UCC/index.jp.asp )
but not sure if it can happen on other machines.

>Fix:
No idea.
Is it worth to have some kernel config option to disable NCQ,
if it's triggered by the feature?

---
Izumi Tsutsui

>Release-Note:

>Audit-Trail:
From: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
To: gnats-bugs@netbsd.org
Cc: tsutsui@ceres.dti.ne.jp
Subject: Re: kern/54790: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
	 support?)
Date: Mon, 23 Dec 2019 02:03:03 +0900

 After I set "sysctl -w hw.wd2.use_ncq=0", the problem
 (not only kernel fault but also timeout and read/write errors)
 does not happen on the same machine/configurations.

 So I guess something wrong in new NCQ support,
 especially with fast SSDs?

 I also wonder whether
 > ahcisata0: ignoring broken port multiplier support
 this quirk message affects NCQ implementation or not.

 ---
 Izumi Tsutsui

From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/54790: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ support?)
Date: Sun, 22 Dec 2019 22:48:41 +0100

 --0000000000007e5485059a51e0ff
 Content-Type: text/plain; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 Can you try kernel with DEBUG+DIAGNOSTIC?

 There is KASSERTMSG() which should trigger if the xfer is no longer active
 - after ata_queue_active() returns the slot as active, it should never
 actually happen the ata_queue_hwslot_to_xfer() returns NULL.

 Jaromir

 Le ven. 20 d=C3=A9c. 2019 =C3=A0 22:55, Izumi Tsutsui <tsutsui@ceres.dti.ne=
 .jp> a
 =C3=A9crit :

 > >Number:         54790
 > >Category:       kern
 > >Synopsis:       9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
 > support?)
 > >Confidential:   no
 > >Severity:       critical
 > >Priority:       high
 > >Responsible:    kern-bug-people
 > >State:          open
 > >Class:          sw-bug
 > >Submitter-Id:   net
 > >Arrival-Date:   Fri Dec 20 21:55:00 +0000 2019
 > >Originator:     Izumi Tsutsui
 > >Release:        NetBSD 9.0_RC1
 > >Organization:
 > >Environment:
 > System: NetBSD 9.0_RC1 (GENERIC) #0: Wed Nov 27 16:14:52 UTC 2019
 >     mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/i386/compile/GENERIC
 > Architecture: i386
 > Machine: i386
 > >Description:
 > I'm getting reproducible kernel fault in ata_recovery_resume()
 > on my 9.0_RC1 i386 machines.  It looks triggered by SSD error,
 > but I wonder if the errors are real hardware faiulre or not.
 > (not seen on 8.1 kernel)
 >
 > ddb says (typed from screen pic):
 > ---
 > kernle: supervisor trap page fault, code=3D0
 > Stopped in pid 0.41 (system) at netbsd:ata_recovery_resume+0xe3:
 >  movzwl  8(%eax),%edx
 > db{0}> bt
 > ata_recovery_resume(c51abb88,0,8441,8,c08a608e,0,8441,8000,c51abb88,8441)
 > at netbsd:ata_recovery_resume+0xe3
 > ahci_channel_recover(c51abb88,8,8441,c0fc4238,1277b90,c51ab000,8,c51abb88=
 ,c4d488c0,0)
 > at netbsd:ahci_channel_recover+0x82
 > ata_thread_run(c51abb88,8,8000,8441,c51abb90,6,c51abc98,c5197080,c01813fc=
 ,c509fc00)
 > at netbsd:ata_thread_run+0x1f3
 > atabus_thread(c5197080,1540000,154a000,0,c01003fd,0,0,0,0,0) at
 > netbsd:atabus_thread+0x228
 > >db{0}>
 > ---
 >
 > dmesg on the ddb prompt say (timestamp is omitted to save typing):
 > ---
 >  :
 > ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00)
 > ahcisata0: ignoring broken port multiplier support
 > ahcisata0: AHCI revision 1.10, 4 ports, 32 slots, CAP
 > 0xf3209f83<CCCS,PMD,ISS=3D0x2=3DGen2,SCLO,SAL,SMPS,SSNTF,SNCQ,S64A>
 > ahcisata0: interrupting at ioapic0 pin 22
 > atabus0 at ahcisata0 channel 0
 > atabus1 at ahcisata0 channel 1
 > atabus2 at ahcisata0 channel 2
 > atabus3 at ahcisata0 channel 3
 >  :
 > ixpide0 at pci0 dev 20 function 1: ATI Technologies IXP IDE Controller
 > (rev. 0x00)
 > ixpide0: bus-master DMA support present
 > ixpide0: primary channel configured to compatibility mode
 > ixpide0: primary channel interrupting at ioapic0 pin 14
 > atabus4 at ixpide0 channel 0
 > ixpide0: secondary channel configured to compatibility mode
 > ixpide0: secondary channel interrupting at ioapic0 pin 15
 > atabus5 at ixpide0 channel 1
 >  :
 > ahcisata0 port 0: device present, speed: 3.0Gb/s
 > ahcisata0 port 1: device present, speed: 3.0Gb/s
 > ahcisata0 port 2: device present, speed: 3.0Gb/s
 > ahcisata0 port 3: device present, speed: 1.5Gb/s
 >  :
 > wd0 at atabus0 drive 0
 > wd0: <Hitachi HDS5C3020ALA632>
 > wd0: drive supports 16-sector PIO transfers, LBA48 addressing
 > wd0: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168
 > sectors
 > wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133),
 > WRITE DMA FUA, NCQ (32 tags) w/PRIO
 > wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
 > (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
 > wd1 at atabus1 drive 0
 > wd1: <Hitachi HDS5C3020ALA632>
 > wd1: drive supports 16-sector PIO transfers, LBA48 addressing
 > wd1: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168
 > sectors
 > wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133),
 > WRITE DMA FUA, NCQ (32 tags) w/PRIO
 > wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
 > (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
 > wd2 at atabus2 drive 0
 > wd2: <Samsung SSD 860 EVO 500GB>
 > wd2: drive supports 1-sector PIO transfers, LBA48 addressing
 > wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168
 > sectors
 > wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133),
 > WRITE DMA FUA, NCQ (32 tags)
 > wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
 > (Ultra/133) (using DMA), NCQ (31 tags)
 > atapibus0 at atabus3: 1 targets
 > cd0 at atapibus0 drive 0: <HL-DT-ST DVDRAM GH24NSD5, KLUIBRA1411, LJ00>
 > cdrom removable
 > cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
 > cd0(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
 > (Ultra/133) (using DMA)
 >  :
 > wsmux1: connecting to wsdisplay0
 > cd0(ahcisata0:3:0):  DEFERRED ERROR, key =3D 0x2
 > wsdisplay0: screen 1 added (default, vt100 emulation)
 > wsdisplay0: screen 2 added (default, vt100 emulation)
 > wsdisplay0: screen 3 added (default, vt100 emulation)
 > wsdisplay0: screen 4 added (default, vt100 emulation)
 > cd0(ahcisata0:3:0):  DEFERRED ERROR, key =3D 0x2
 > wd2a: device timeout reading fsbn 343200640 of 343200640-343200647 (wd2 b=
 n
 > 343200640; cn 167578 tn 14 sn 0), xfer dcc, retry 0
 > wd2a: device timeout writing fsbn 479102685 of 479102605-479102719 (wd2 b=
 n
 > 479102685; cn 233936 tn 27 sn 29), xfer 7f0, retry 0
 >  :
 > [many similar errors]
 >  :
 > uvm_fault(0xc13737e0, 0, 1) -> 0xe
 > fatal page fault in supervisor mode
 > trap type 6 code 0 eip 0xc018305f cs 0x8 eflags 0x10286 cr2 0x8 ilevel 0
 > esp 0xc51abb88
 > curlwp 0xc509fc00 pid 0 lid 41 lowest kstack 0xdc7da2c0
 > db{0}>
 > ---
 >
 > "0xc018305f" is here:
 > https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=3D1.2#240
 > ---
 >     234         /* Requeue all unfinished commands for same drive as
 > failed command */
 >     235         for (slot =3D 0; slot < ch_openings; slot++) {
 >     236                 if ((ata_queue_active(chp) & (1U << slot)) =3D=3D=
  0)
 >     237                         continue;
 >     238
 >     239                 xfer =3D ata_queue_hwslot_to_xfer(chp, slot);
 > ->  240                 if (drive !=3D xfer->c_drive)
 >     241                         continue;
 >     242
 >     243                 xfer->ops->c_kill_xfer(chp, xfer,
 >     244                     (error =3D=3D 0) ? KILL_REQUEUE : KILL_RESET)=
 ;
 >     245         }
 > ---
 > Per dumb printf debug, actually "xfer" is NULL on the fault.
 >
 > >How-To-Repeat:
 > ~100% reproducible on my Samsung SSD with load on my main machine
 > (ASRock M3A UCC http://www.asrock.com/mb/AMD/M3A%20UCC/index.jp.asp )
 > but not sure if it can happen on other machines.
 >
 > >Fix:
 > No idea.
 > Is it worth to have some kernel config option to disable NCQ,
 > if it's triggered by the feature?
 >
 > ---
 > Izumi Tsutsui
 >
 >

 --0000000000007e5485059a51e0ff
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
 r=3D"ltr">Can you try kernel with DEBUG+DIAGNOSTIC?<div><br></div><div>Ther=
 e is KASSERTMSG() which should trigger if the xfer is no longer active - af=
 ter ata_queue_active() returns the slot as active, it should never actually=
  happen the ata_queue_hwslot_to_xfer() returns NULL.</div><div><br></div><d=
 iv>Jaromir</div></div></div></div></div></div><br><div class=3D"gmail_quote=
 "><div dir=3D"ltr" class=3D"gmail_attr">Le=C2=A0ven. 20 d=C3=A9c. 2019 =C3=
 =A0=C2=A022:55, Izumi Tsutsui &lt;<a href=3D"mailto:tsutsui@ceres.dti.ne.jp=
 ">tsutsui@ceres.dti.ne.jp</a>&gt; a =C3=A9crit=C2=A0:<br></div><blockquote =
 class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1=
 px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:=
 1ex">&gt;Number:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A054790<br>
 &gt;Category:=C2=A0 =C2=A0 =C2=A0 =C2=A0kern<br>
 &gt;Synopsis:=C2=A0 =C2=A0 =C2=A0 =C2=A09.0_RC1 kernel crash in ata_recover=
 y_resume() (in NCQ support?)<br>
 &gt;Confidential:=C2=A0 =C2=A0no<br>
 &gt;Severity:=C2=A0 =C2=A0 =C2=A0 =C2=A0critical<br>
 &gt;Priority:=C2=A0 =C2=A0 =C2=A0 =C2=A0high<br>
 &gt;Responsible:=C2=A0 =C2=A0 kern-bug-people<br>
 &gt;State:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 open<br>
 &gt;Class:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sw-bug<br>
 &gt;Submitter-Id:=C2=A0 =C2=A0net<br>
 &gt;Arrival-Date:=C2=A0 =C2=A0Fri Dec 20 21:55:00 +0000 2019<br>
 &gt;Originator:=C2=A0 =C2=A0 =C2=A0Izumi Tsutsui<br>
 &gt;Release:=C2=A0 =C2=A0 =C2=A0 =C2=A0 NetBSD 9.0_RC1<br>
 &gt;Organization:<br>
 &gt;Environment:<br>
 System: NetBSD 9.0_RC1 (GENERIC) #0: Wed Nov 27 16:14:52 UTC 2019<br>
 =C2=A0 =C2=A0 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/i386/compile/GEN=
 ERIC<br>
 Architecture: i386<br>
 Machine: i386<br>
 &gt;Description:<br>
 I&#39;m getting reproducible kernel fault in ata_recovery_resume()<br>
 on my 9.0_RC1 i386 machines.=C2=A0 It looks triggered by SSD error,<br>
 but I wonder if the errors are real hardware faiulre or not.<br>
 (not seen on 8.1 kernel)<br>
 <br>
 ddb says (typed from screen pic):<br>
 ---<br>
 kernle: supervisor trap page fault, code=3D0<br>
 Stopped in pid 0.41 (system) at netbsd:ata_recovery_resume+0xe3:=C2=A0 =C2=
 =A0 =C2=A0 =C2=A0movzwl=C2=A0 8(%eax),%edx<br>
 db{0}&gt; bt<br>
 ata_recovery_resume(c51abb88,0,8441,8,c08a608e,0,8441,8000,c51abb88,8441) a=
 t netbsd:ata_recovery_resume+0xe3<br>
 ahci_channel_recover(c51abb88,8,8441,c0fc4238,1277b90,c51ab000,8,c51abb88,c=
 4d488c0,0) at netbsd:ahci_channel_recover+0x82<br>
 ata_thread_run(c51abb88,8,8000,8441,c51abb90,6,c51abc98,c5197080,c01813fc,c=
 509fc00) at netbsd:ata_thread_run+0x1f3<br>
 atabus_thread(c5197080,1540000,154a000,0,c01003fd,0,0,0,0,0) at netbsd:atab=
 us_thread+0x228<br>
 &gt;db{0}&gt;<br>
 ---<br>
 <br>
 dmesg on the ddb prompt say (timestamp is omitted to save typing):<br>
 ---<br>
 =C2=A0:<br>
 ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00)<b=
 r>
 ahcisata0: ignoring broken port multiplier support<br>
 ahcisata0: AHCI revision 1.10, 4 ports, 32 slots, CAP 0xf3209f83&lt;CCCS,PM=
 D,ISS=3D0x2=3DGen2,SCLO,SAL,SMPS,SSNTF,SNCQ,S64A&gt;<br>
 ahcisata0: interrupting at ioapic0 pin 22<br>
 atabus0 at ahcisata0 channel 0<br>
 atabus1 at ahcisata0 channel 1<br>
 atabus2 at ahcisata0 channel 2<br>
 atabus3 at ahcisata0 channel 3<br>
 =C2=A0:<br>
 ixpide0 at pci0 dev 20 function 1: ATI Technologies IXP IDE Controller (rev=
 . 0x00)<br>
 ixpide0: bus-master DMA support present<br>
 ixpide0: primary channel configured to compatibility mode<br>
 ixpide0: primary channel interrupting at ioapic0 pin 14<br>
 atabus4 at ixpide0 channel 0<br>
 ixpide0: secondary channel configured to compatibility mode<br>
 ixpide0: secondary channel interrupting at ioapic0 pin 15<br>
 atabus5 at ixpide0 channel 1<br>
 =C2=A0:<br>
 ahcisata0 port 0: device present, speed: 3.0Gb/s<br>
 ahcisata0 port 1: device present, speed: 3.0Gb/s<br>
 ahcisata0 port 2: device present, speed: 3.0Gb/s<br>
 ahcisata0 port 3: device present, speed: 1.5Gb/s<br>
 =C2=A0:<br>
 wd0 at atabus0 drive 0<br>
 wd0: &lt;Hitachi HDS5C3020ALA632&gt;<br>
 wd0: drive supports 16-sector PIO transfers, LBA48 addressing<br>
 wd0: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sec=
 tors<br>
 wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), W=
 RITE DMA FUA, NCQ (32 tags) w/PRIO<br>
 wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
 33) (using DMA), NCQ (31 tags) w/PRIO<br>
 wd1 at atabus1 drive 0<br>
 wd1: &lt;Hitachi HDS5C3020ALA632&gt;<br>
 wd1: drive supports 16-sector PIO transfers, LBA48 addressing<br>
 wd1: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sec=
 tors<br>
 wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), W=
 RITE DMA FUA, NCQ (32 tags) w/PRIO<br>
 wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
 33) (using DMA), NCQ (31 tags) w/PRIO<br>
 wd2 at atabus2 drive 0<br>
 wd2: &lt;Samsung SSD 860 EVO 500GB&gt;<br>
 wd2: drive supports 1-sector PIO transfers, LBA48 addressing<br>
 wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sector=
 s<br>
 wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), W=
 RITE DMA FUA, NCQ (32 tags)<br>
 wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
 33) (using DMA), NCQ (31 tags)<br>
 atapibus0 at atabus3: 1 targets<br>
 cd0 at atapibus0 drive 0: &lt;HL-DT-ST DVDRAM GH24NSD5, KLUIBRA1411, LJ00&g=
 t; cdrom removable<br>
 cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)<br=
 >
 cd0(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
 33) (using DMA)<br>
 =C2=A0:<br>
 wsmux1: connecting to wsdisplay0<br>
 cd0(ahcisata0:3:0):=C2=A0 DEFERRED ERROR, key =3D 0x2<br>
 wsdisplay0: screen 1 added (default, vt100 emulation)<br>
 wsdisplay0: screen 2 added (default, vt100 emulation)<br>
 wsdisplay0: screen 3 added (default, vt100 emulation)<br>
 wsdisplay0: screen 4 added (default, vt100 emulation)<br>
 cd0(ahcisata0:3:0):=C2=A0 DEFERRED ERROR, key =3D 0x2<br>
 wd2a: device timeout reading fsbn 343200640 of 343200640-343200647 (wd2 bn =
 343200640; cn 167578 tn 14 sn 0), xfer dcc, retry 0<br>
 wd2a: device timeout writing fsbn 479102685 of 479102605-479102719 (wd2 bn =
 479102685; cn 233936 tn 27 sn 29), xfer 7f0, retry 0<br>
 =C2=A0:<br>
 [many similar errors]<br>
 =C2=A0:<br>
 uvm_fault(0xc13737e0, 0, 1) -&gt; 0xe<br>
 fatal page fault in supervisor mode<br>
 trap type 6 code 0 eip 0xc018305f cs 0x8 eflags 0x10286 cr2 0x8 ilevel 0 es=
 p 0xc51abb88<br>
 curlwp 0xc509fc00 pid 0 lid 41 lowest kstack 0xdc7da2c0<br>
 db{0}&gt; <br>
 ---<br>
 <br>
 &quot;0xc018305f&quot; is here:<br>
 <a href=3D"https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=3D1=
 .2#240" rel=3D"noreferrer" target=3D"_blank">https://nxr.netbsd.org/xref/sr=
 c/sys/dev/ata/ata_recovery.c?r=3D1.2#240</a><br>
 ---<br>
 =C2=A0 =C2=A0 234=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* Requeue all unfinishe=
 d commands for same drive as failed command */<br>
 =C2=A0 =C2=A0 235=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0for (slot =3D 0; slot &l=
 t; ch_openings; slot++) {<br>
 =C2=A0 =C2=A0 236=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
 =C2=A0if ((ata_queue_active(chp) &amp; (1U &lt;&lt; slot)) =3D=3D 0)<br>
 =C2=A0 =C2=A0 237=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0continue;<br>
 =C2=A0 =C2=A0 238 <br>
 =C2=A0 =C2=A0 239=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
 =C2=A0xfer =3D ata_queue_hwslot_to_xfer(chp, slot);<br>
 -&gt;=C2=A0 240=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
 =A0if (drive !=3D xfer-&gt;c_drive) <br>
 =C2=A0 =C2=A0 241=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0continue;<br>
 =C2=A0 =C2=A0 242 <br>
 =C2=A0 =C2=A0 243=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
 =C2=A0xfer-&gt;ops-&gt;c_kill_xfer(chp, xfer,<br>
 =C2=A0 =C2=A0 244=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
 =C2=A0 =C2=A0 =C2=A0(error =3D=3D 0) ? KILL_REQUEUE : KILL_RESET);<br>
 =C2=A0 =C2=A0 245=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}<br>
 ---<br>
 Per dumb printf debug, actually &quot;xfer&quot; is NULL on the fault.<br>
 <br>
 &gt;How-To-Repeat:<br>
 ~100% reproducible on my Samsung SSD with load on my main machine<br>
 (ASRock M3A UCC <a href=3D"http://www.asrock.com/mb/AMD/M3A%20UCC/index.jp.=
 asp" rel=3D"noreferrer" target=3D"_blank">http://www.asrock.com/mb/AMD/M3A%=
 20UCC/index.jp.asp</a> )<br>
 but not sure if it can happen on other machines.<br>
 <br>
 &gt;Fix:<br>
 No idea.<br>
 Is it worth to have some kernel config option to disable NCQ,<br>
 if it&#39;s triggered by the feature?<br>
 <br>
 ---<br>
 Izumi Tsutsui<br>
 <br>
 </blockquote></div>

 --0000000000007e5485059a51e0ff--

From: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
To: gnats-bugs@netbsd.org
Cc: jaromir.dolecek@gmail.com, tsutsui@ceres.dti.ne.jp
Subject: Re: kern/54790: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
	 support?)
Date: Sat, 28 Dec 2019 21:28:11 +0900

 >  Can you try kernel with DEBUG+DIAGNOSTIC?
 >  
 >  There is KASSERTMSG() which should trigger if the xfer is no longer active
 >  - after ata_queue_active() returns the slot as active, it should never
 >  actually happen the ata_queue_hwslot_to_xfer() returns NULL.

 KASSERT log from NetBSD/i386 9.0_RC1 GENERIC + options DIAGNOSTIC
 + options DEBUG + options LOCKDEBUG:

  https://gist.github.com/tsutsui/9e4fd770c6207c2af64ffb3c65c29b24

 Typed manually from the above one (for future search):
 ---
 panic: kernel diagnostic assertion "(chq->active_xfers_used & __BIT(xfer->c_slot)) != 0" failed: file "../../../../dev/ata/ata.c", line 1332
 cpu0: Begin traceback...
 vpanic(c10beea0,dc723eec,dc723f0c,c0189467,c10beea0,c10bee07,c10d4440,c10d3d81,534,20) at netbsd:vpanic+0x12d
 kern_assert(c10beea0,c10bee07,c10d4440,c10d3d81,534,20,c555a4d0,c5396b88,c5396000,dc723f3c) at netbsd:kern_asset+0x23
 ata_deactivate_xfer(c5396b88,c555a4d0,ba0,20,a,0,6b88,c555a4d0,c555ae30,2) at netbsd:ata_deactivate_xfer+0x5f
 ahci_bio_complete(c5396b88,c555a4d0,0,c5396b98,c14cee00,c018c03d,dc723f94,c0963cce,c5396b88,c4d78860) at netbsd:ahci_bio_complete+0x137
 ata_timeout(c5396b88,c4d78860,dc71320c,c4d5a040,c0937f18,c14cee04,c5396b88,0,c14cee60,dc713074) at netbsd:ata_timeout+0x66
 callout_softclock(0,dc713074,c0100400,16a7000,16b0010,30,c0100010,10,0,13b4c0) at netbsd:callout_softclock+0x37e
 softint_dispatch(c4d78020,2,0,0,0,0,dc726ff0,dc726e44,dc726e98,80050033) at netbsd:softint_dispatch+0xcc
 Bad frame pointer: 0xdc712f24
 cpu0: End traceback...
 fatal breackpoint trap in supervisor mode
 trap type 1 code 0 eip 0xc0115a94 cs 0x8 eflags 0x202 cr2 0xb6529020 ilevel 0x8 esp 0xdc723ed0
 curlwp 0xc4d78860 pid 0 lid 5 lowest kstack 0xdc7212c0

 ---
 Izumi Tsutsui

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sat, 04 Jan 2020 21:17:38 +0000
Responsible-Changed-Why:
I'll look at this.


State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 13 Jan 2020 21:22:26 +0000
State-Changed-Why:
-current with sys/dev/ata/wd.c rev.1.454 should now disable NCQ by default,
can you check that fixes the problem for you?


From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54790 CVS commit: src/sys/dev/ata
Date: Mon, 13 Jan 2020 21:20:17 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Mon Jan 13 21:20:17 UTC 2020

 Modified Files:
 	src/sys/dev/ata: wd.c

 Log Message:
 disable NCQ by default for "Samsung SSD 860 EVO 1TB" and
 "Samsung SSD 860 EVO 500GB" - these drives have known broken NCQ support
 particularly when used with AMD SB710/750 chipsets, problem occur also
 under Linux and Windows

 https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813
 https://bugzilla.kernel.org/show_bug.cgi?id=201693

 It seems there is no Samsung firmware update to fix this even.

 Disable NCQ regardless of the controller, it's likely same problem
 exists with other controllers too.

 This should fix PR kern/54790 and PR kern/54855


 To generate a diff of this commit:
 cvs rdiff -u -r1.453 -r1.454 src/sys/dev/ata/wd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54790 CVS commit: src/share/man/man4
Date: Mon, 13 Jan 2020 21:43:06 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Mon Jan 13 21:43:06 UTC 2020

 Modified Files:
 	src/share/man/man4: wd.4

 Log Message:
 document the wd(4) sysctl nodes, and add the note about the Sumsung EVO drives

 part of fix for PR kern/54790 and PR kern/54855


 To generate a diff of this commit:
 cvs rdiff -u -r1.20 -r1.21 src/share/man/man4/wd.4

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->analyzed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Fri, 24 Jan 2020 19:39:35 +0000
State-Changed-Why:
The crash happens because the recovery is eventually run via the kernel
thread, which countrary to the exception handler doesn't block further
interrupts during the processing. I'm working on making the recovery
path handle this correctly.


State-Changed-From-To: analyzed->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sat, 04 Apr 2020 22:31:29 +0000
State-Changed-Why:
Can you try if rev. 1.3 of ata_recovery.c + rev. 1.9 of ata_subr.c fixes
the problem?


From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54790 CVS commit: src/sys/dev/ata
Date: Sat, 4 Apr 2020 22:30:03 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Sat Apr  4 22:30:03 UTC 2020

 Modified Files:
 	src/sys/dev/ata: ata_recovery.c ata_subr.c

 Log Message:
 stop xfer timeouts during recovery, all xfers will be requeued anyway

 this avoids race with the timeout routine when processing the xfers
 for requeueing

 should fix PR kern/54790 by Izumi Tsutsui


 To generate a diff of this commit:
 cvs rdiff -u -r1.2 -r1.3 src/sys/dev/ata/ata_recovery.c
 cvs rdiff -u -r1.8 -r1.9 src/sys/dev/ata/ata_subr.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 29 Jun 2020 11:04:06 +0000
State-Changed-Why:
Problem fixed. Thanks for report.


State-Changed-From-To: closed->needs-pullups
State-Changed-By: gson@NetBSD.org
State-Changed-When: Wed, 02 Feb 2022 16:46:16 +0000
State-Changed-Why:
When fixing a critical bug reported on a release branch,
pullups are of the essence.


From: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
To: gnats-bugs@netbsd.org
Cc: jdolecek@netbsd.org, gson@NetBSD.org, tsutsui@ceres.dti.ne.jp
Subject: Re: kern/54790 (9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
	 support?))
Date: Fri, 4 Feb 2022 02:50:03 +0900

 > State-Changed-Why:
 > When fixing a critical bug reported on a release branch,
 > pullups are of the essence.

 On my environment (ATI SB600) this problem was workaround by
 AHCI_QUIRK_BADNCQ quirk:
  https://mail-index.netbsd.org/source-changes/2020/01/18/msg112983.html
 so it was not trivial to test the later fix. (sorry for a late feedback)

 ---
 Izumi Tsutsui

From: Andreas Gustafsson <gson@NetBSD.org>
To: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
Cc: gnats-bugs@netbsd.org, jdolecek@netbsd.org
Subject: Re: kern/54790 (9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
	 support?))
Date: Fri, 4 Feb 2022 13:14:15 +0200

 Izumi Tsutsui wrote:
 > On my environment (ATI SB600) this problem was workaround by
 > AHCI_QUIRK_BADNCQ quirk:
 >  https://mail-index.netbsd.org/source-changes/2020/01/18/msg112983.html
 > so it was not trivial to test the later fix. (sorry for a late feedback)

 It's possible we have been suffering from two separate bugs that both
 cause device timeouts, and in both cases the timeouts were triggering
 the same third bug causing panics during recovery.

 Since my PR 56479 was specifically about the panics, I have now filed
 a separate PR 56686 about the timeouts.  If your timeout problem was
 indeed specific to AMD chipsets, mine must be unrelated because my
 machine has an Intel chipset.
 -- 
 Andreas Gustafsson, gson@NetBSD.org

State-Changed-From-To: needs-pullups->pending-pullups
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 07 Feb 2022 21:46:27 +0000
State-Changed-Why:
Ticket #1426


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54790 CVS commit: [netbsd-9] src/sys/dev/ata
Date: Tue, 8 Feb 2022 14:45:00 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue Feb  8 14:45:00 UTC 2022

 Modified Files:
 	src/sys/dev/ata [netbsd-9]: ata_recovery.c ata_subr.c

 Log Message:
 Pull up following revision(s) (requested by jdolecek in ticket #1426):

 	sys/dev/ata/ata_recovery.c: revision 1.3
 	sys/dev/ata/ata_subr.c: revision 1.9

 stop xfer timeouts during recovery, all xfers will be requeued anyway
 this avoids race with the timeout routine when processing the xfers
 for requeueing

 should fix PR kern/54790 by Izumi Tsutsui


 To generate a diff of this commit:
 cvs rdiff -u -r1.2 -r1.2.8.1 src/sys/dev/ata/ata_recovery.c
 cvs rdiff -u -r1.8 -r1.8.4.1 src/sys/dev/ata/ata_subr.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 08 Feb 2022 18:29:10 +0000
State-Changed-Why:
Pullup to netbsd-9 done now too.


>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.