NetBSD Problem Report #54790
From tsutsui@ceres.dti.ne.jp Fri Dec 20 21:52:47 2019
Return-Path: <tsutsui@ceres.dti.ne.jp>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 7E7977A18E
for <gnats-bugs@gnats.NetBSD.org>; Fri, 20 Dec 2019 21:52:47 +0000 (UTC)
Message-Id: <201912202152.xBKLqaTN001886@ceres.dti.ne.jp>
Date: Sat, 21 Dec 2019 06:52:36 +0900 (JST)
From: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
Reply-To: tsutsui@ceres.dti.ne.jp
To: gnats-bugs@NetBSD.org
Cc: tsutsui@ceres.dti.ne.jp
Subject: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ support?)
X-Send-Pr-Version: 3.95
>Number: 54790
>Category: kern
>Synopsis: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ support?)
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: jdolecek
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Dec 20 21:55:00 +0000 2019
>Closed-Date: Tue Feb 08 18:29:10 +0000 2022
>Last-Modified: Tue Feb 08 18:29:10 +0000 2022
>Originator: Izumi Tsutsui
>Release: NetBSD 9.0_RC1
>Organization:
>Environment:
System: NetBSD 9.0_RC1 (GENERIC) #0: Wed Nov 27 16:14:52 UTC 2019
mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/i386/compile/GENERIC
Architecture: i386
Machine: i386
>Description:
I'm getting reproducible kernel fault in ata_recovery_resume()
on my 9.0_RC1 i386 machines. It looks triggered by SSD error,
but I wonder if the errors are real hardware faiulre or not.
(not seen on 8.1 kernel)
ddb says (typed from screen pic):
---
kernle: supervisor trap page fault, code=0
Stopped in pid 0.41 (system) at netbsd:ata_recovery_resume+0xe3: movzwl 8(%eax),%edx
db{0}> bt
ata_recovery_resume(c51abb88,0,8441,8,c08a608e,0,8441,8000,c51abb88,8441) at netbsd:ata_recovery_resume+0xe3
ahci_channel_recover(c51abb88,8,8441,c0fc4238,1277b90,c51ab000,8,c51abb88,c4d488c0,0) at netbsd:ahci_channel_recover+0x82
ata_thread_run(c51abb88,8,8000,8441,c51abb90,6,c51abc98,c5197080,c01813fc,c509fc00) at netbsd:ata_thread_run+0x1f3
atabus_thread(c5197080,1540000,154a000,0,c01003fd,0,0,0,0,0) at netbsd:atabus_thread+0x228
>db{0}>
---
dmesg on the ddb prompt say (timestamp is omitted to save typing):
---
:
ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00)
ahcisata0: ignoring broken port multiplier support
ahcisata0: AHCI revision 1.10, 4 ports, 32 slots, CAP 0xf3209f83<CCCS,PMD,ISS=0x2=Gen2,SCLO,SAL,SMPS,SSNTF,SNCQ,S64A>
ahcisata0: interrupting at ioapic0 pin 22
atabus0 at ahcisata0 channel 0
atabus1 at ahcisata0 channel 1
atabus2 at ahcisata0 channel 2
atabus3 at ahcisata0 channel 3
:
ixpide0 at pci0 dev 20 function 1: ATI Technologies IXP IDE Controller (rev. 0x00)
ixpide0: bus-master DMA support present
ixpide0: primary channel configured to compatibility mode
ixpide0: primary channel interrupting at ioapic0 pin 14
atabus4 at ixpide0 channel 0
ixpide0: secondary channel configured to compatibility mode
ixpide0: secondary channel interrupting at ioapic0 pin 15
atabus5 at ixpide0 channel 1
:
ahcisata0 port 0: device present, speed: 3.0Gb/s
ahcisata0 port 1: device present, speed: 3.0Gb/s
ahcisata0 port 2: device present, speed: 3.0Gb/s
ahcisata0 port 3: device present, speed: 1.5Gb/s
:
wd0 at atabus0 drive 0
wd0: <Hitachi HDS5C3020ALA632>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sectors
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags) w/PRIO
wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
wd1 at atabus1 drive 0
wd1: <Hitachi HDS5C3020ALA632>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sectors
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags) w/PRIO
wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
wd2 at atabus2 drive 0
wd2: <Samsung SSD 860 EVO 500GB>
wd2: drive supports 1-sector PIO transfers, LBA48 addressing
wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags)
wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags)
atapibus0 at atabus3: 1 targets
cd0 at atapibus0 drive 0: <HL-DT-ST DVDRAM GH24NSD5, KLUIBRA1411, LJ00> cdrom removable
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
cd0(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA)
:
wsmux1: connecting to wsdisplay0
cd0(ahcisata0:3:0): DEFERRED ERROR, key = 0x2
wsdisplay0: screen 1 added (default, vt100 emulation)
wsdisplay0: screen 2 added (default, vt100 emulation)
wsdisplay0: screen 3 added (default, vt100 emulation)
wsdisplay0: screen 4 added (default, vt100 emulation)
cd0(ahcisata0:3:0): DEFERRED ERROR, key = 0x2
wd2a: device timeout reading fsbn 343200640 of 343200640-343200647 (wd2 bn 343200640; cn 167578 tn 14 sn 0), xfer dcc, retry 0
wd2a: device timeout writing fsbn 479102685 of 479102605-479102719 (wd2 bn 479102685; cn 233936 tn 27 sn 29), xfer 7f0, retry 0
:
[many similar errors]
:
uvm_fault(0xc13737e0, 0, 1) -> 0xe
fatal page fault in supervisor mode
trap type 6 code 0 eip 0xc018305f cs 0x8 eflags 0x10286 cr2 0x8 ilevel 0 esp 0xc51abb88
curlwp 0xc509fc00 pid 0 lid 41 lowest kstack 0xdc7da2c0
db{0}>
---
"0xc018305f" is here:
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=1.2#240
---
234 /* Requeue all unfinished commands for same drive as failed command */
235 for (slot = 0; slot < ch_openings; slot++) {
236 if ((ata_queue_active(chp) & (1U << slot)) == 0)
237 continue;
238
239 xfer = ata_queue_hwslot_to_xfer(chp, slot);
-> 240 if (drive != xfer->c_drive)
241 continue;
242
243 xfer->ops->c_kill_xfer(chp, xfer,
244 (error == 0) ? KILL_REQUEUE : KILL_RESET);
245 }
---
Per dumb printf debug, actually "xfer" is NULL on the fault.
>How-To-Repeat:
~100% reproducible on my Samsung SSD with load on my main machine
(ASRock M3A UCC http://www.asrock.com/mb/AMD/M3A%20UCC/index.jp.asp )
but not sure if it can happen on other machines.
>Fix:
No idea.
Is it worth to have some kernel config option to disable NCQ,
if it's triggered by the feature?
---
Izumi Tsutsui
>Release-Note:
>Audit-Trail:
From: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
To: gnats-bugs@netbsd.org
Cc: tsutsui@ceres.dti.ne.jp
Subject: Re: kern/54790: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
support?)
Date: Mon, 23 Dec 2019 02:03:03 +0900
After I set "sysctl -w hw.wd2.use_ncq=0", the problem
(not only kernel fault but also timeout and read/write errors)
does not happen on the same machine/configurations.
So I guess something wrong in new NCQ support,
especially with fast SSDs?
I also wonder whether
> ahcisata0: ignoring broken port multiplier support
this quirk message affects NCQ implementation or not.
---
Izumi Tsutsui
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc:
Subject: Re: kern/54790: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ support?)
Date: Sun, 22 Dec 2019 22:48:41 +0100
--0000000000007e5485059a51e0ff
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Can you try kernel with DEBUG+DIAGNOSTIC?
There is KASSERTMSG() which should trigger if the xfer is no longer active
- after ata_queue_active() returns the slot as active, it should never
actually happen the ata_queue_hwslot_to_xfer() returns NULL.
Jaromir
Le ven. 20 d=C3=A9c. 2019 =C3=A0 22:55, Izumi Tsutsui <tsutsui@ceres.dti.ne=
.jp> a
=C3=A9crit :
> >Number: 54790
> >Category: kern
> >Synopsis: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
> support?)
> >Confidential: no
> >Severity: critical
> >Priority: high
> >Responsible: kern-bug-people
> >State: open
> >Class: sw-bug
> >Submitter-Id: net
> >Arrival-Date: Fri Dec 20 21:55:00 +0000 2019
> >Originator: Izumi Tsutsui
> >Release: NetBSD 9.0_RC1
> >Organization:
> >Environment:
> System: NetBSD 9.0_RC1 (GENERIC) #0: Wed Nov 27 16:14:52 UTC 2019
> mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/i386/compile/GENERIC
> Architecture: i386
> Machine: i386
> >Description:
> I'm getting reproducible kernel fault in ata_recovery_resume()
> on my 9.0_RC1 i386 machines. It looks triggered by SSD error,
> but I wonder if the errors are real hardware faiulre or not.
> (not seen on 8.1 kernel)
>
> ddb says (typed from screen pic):
> ---
> kernle: supervisor trap page fault, code=3D0
> Stopped in pid 0.41 (system) at netbsd:ata_recovery_resume+0xe3:
> movzwl 8(%eax),%edx
> db{0}> bt
> ata_recovery_resume(c51abb88,0,8441,8,c08a608e,0,8441,8000,c51abb88,8441)
> at netbsd:ata_recovery_resume+0xe3
> ahci_channel_recover(c51abb88,8,8441,c0fc4238,1277b90,c51ab000,8,c51abb88=
,c4d488c0,0)
> at netbsd:ahci_channel_recover+0x82
> ata_thread_run(c51abb88,8,8000,8441,c51abb90,6,c51abc98,c5197080,c01813fc=
,c509fc00)
> at netbsd:ata_thread_run+0x1f3
> atabus_thread(c5197080,1540000,154a000,0,c01003fd,0,0,0,0,0) at
> netbsd:atabus_thread+0x228
> >db{0}>
> ---
>
> dmesg on the ddb prompt say (timestamp is omitted to save typing):
> ---
> :
> ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00)
> ahcisata0: ignoring broken port multiplier support
> ahcisata0: AHCI revision 1.10, 4 ports, 32 slots, CAP
> 0xf3209f83<CCCS,PMD,ISS=3D0x2=3DGen2,SCLO,SAL,SMPS,SSNTF,SNCQ,S64A>
> ahcisata0: interrupting at ioapic0 pin 22
> atabus0 at ahcisata0 channel 0
> atabus1 at ahcisata0 channel 1
> atabus2 at ahcisata0 channel 2
> atabus3 at ahcisata0 channel 3
> :
> ixpide0 at pci0 dev 20 function 1: ATI Technologies IXP IDE Controller
> (rev. 0x00)
> ixpide0: bus-master DMA support present
> ixpide0: primary channel configured to compatibility mode
> ixpide0: primary channel interrupting at ioapic0 pin 14
> atabus4 at ixpide0 channel 0
> ixpide0: secondary channel configured to compatibility mode
> ixpide0: secondary channel interrupting at ioapic0 pin 15
> atabus5 at ixpide0 channel 1
> :
> ahcisata0 port 0: device present, speed: 3.0Gb/s
> ahcisata0 port 1: device present, speed: 3.0Gb/s
> ahcisata0 port 2: device present, speed: 3.0Gb/s
> ahcisata0 port 3: device present, speed: 1.5Gb/s
> :
> wd0 at atabus0 drive 0
> wd0: <Hitachi HDS5C3020ALA632>
> wd0: drive supports 16-sector PIO transfers, LBA48 addressing
> wd0: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168
> sectors
> wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133),
> WRITE DMA FUA, NCQ (32 tags) w/PRIO
> wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
> (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
> wd1 at atabus1 drive 0
> wd1: <Hitachi HDS5C3020ALA632>
> wd1: drive supports 16-sector PIO transfers, LBA48 addressing
> wd1: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168
> sectors
> wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133),
> WRITE DMA FUA, NCQ (32 tags) w/PRIO
> wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
> (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
> wd2 at atabus2 drive 0
> wd2: <Samsung SSD 860 EVO 500GB>
> wd2: drive supports 1-sector PIO transfers, LBA48 addressing
> wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168
> sectors
> wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133),
> WRITE DMA FUA, NCQ (32 tags)
> wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
> (Ultra/133) (using DMA), NCQ (31 tags)
> atapibus0 at atabus3: 1 targets
> cd0 at atapibus0 drive 0: <HL-DT-ST DVDRAM GH24NSD5, KLUIBRA1411, LJ00>
> cdrom removable
> cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
> cd0(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
> (Ultra/133) (using DMA)
> :
> wsmux1: connecting to wsdisplay0
> cd0(ahcisata0:3:0): DEFERRED ERROR, key =3D 0x2
> wsdisplay0: screen 1 added (default, vt100 emulation)
> wsdisplay0: screen 2 added (default, vt100 emulation)
> wsdisplay0: screen 3 added (default, vt100 emulation)
> wsdisplay0: screen 4 added (default, vt100 emulation)
> cd0(ahcisata0:3:0): DEFERRED ERROR, key =3D 0x2
> wd2a: device timeout reading fsbn 343200640 of 343200640-343200647 (wd2 b=
n
> 343200640; cn 167578 tn 14 sn 0), xfer dcc, retry 0
> wd2a: device timeout writing fsbn 479102685 of 479102605-479102719 (wd2 b=
n
> 479102685; cn 233936 tn 27 sn 29), xfer 7f0, retry 0
> :
> [many similar errors]
> :
> uvm_fault(0xc13737e0, 0, 1) -> 0xe
> fatal page fault in supervisor mode
> trap type 6 code 0 eip 0xc018305f cs 0x8 eflags 0x10286 cr2 0x8 ilevel 0
> esp 0xc51abb88
> curlwp 0xc509fc00 pid 0 lid 41 lowest kstack 0xdc7da2c0
> db{0}>
> ---
>
> "0xc018305f" is here:
> https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=3D1.2#240
> ---
> 234 /* Requeue all unfinished commands for same drive as
> failed command */
> 235 for (slot =3D 0; slot < ch_openings; slot++) {
> 236 if ((ata_queue_active(chp) & (1U << slot)) =3D=3D=
0)
> 237 continue;
> 238
> 239 xfer =3D ata_queue_hwslot_to_xfer(chp, slot);
> -> 240 if (drive !=3D xfer->c_drive)
> 241 continue;
> 242
> 243 xfer->ops->c_kill_xfer(chp, xfer,
> 244 (error =3D=3D 0) ? KILL_REQUEUE : KILL_RESET)=
;
> 245 }
> ---
> Per dumb printf debug, actually "xfer" is NULL on the fault.
>
> >How-To-Repeat:
> ~100% reproducible on my Samsung SSD with load on my main machine
> (ASRock M3A UCC http://www.asrock.com/mb/AMD/M3A%20UCC/index.jp.asp )
> but not sure if it can happen on other machines.
>
> >Fix:
> No idea.
> Is it worth to have some kernel config option to disable NCQ,
> if it's triggered by the feature?
>
> ---
> Izumi Tsutsui
>
>
--0000000000007e5485059a51e0ff
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
r=3D"ltr">Can you try kernel with DEBUG+DIAGNOSTIC?<div><br></div><div>Ther=
e is KASSERTMSG() which should trigger if the xfer is no longer active - af=
ter ata_queue_active() returns the slot as active, it should never actually=
happen the ata_queue_hwslot_to_xfer() returns NULL.</div><div><br></div><d=
iv>Jaromir</div></div></div></div></div></div><br><div class=3D"gmail_quote=
"><div dir=3D"ltr" class=3D"gmail_attr">Le=C2=A0ven. 20 d=C3=A9c. 2019 =C3=
=A0=C2=A022:55, Izumi Tsutsui <<a href=3D"mailto:tsutsui@ceres.dti.ne.jp=
">tsutsui@ceres.dti.ne.jp</a>> a =C3=A9crit=C2=A0:<br></div><blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1=
px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:=
1ex">>Number:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A054790<br>
>Category:=C2=A0 =C2=A0 =C2=A0 =C2=A0kern<br>
>Synopsis:=C2=A0 =C2=A0 =C2=A0 =C2=A09.0_RC1 kernel crash in ata_recover=
y_resume() (in NCQ support?)<br>
>Confidential:=C2=A0 =C2=A0no<br>
>Severity:=C2=A0 =C2=A0 =C2=A0 =C2=A0critical<br>
>Priority:=C2=A0 =C2=A0 =C2=A0 =C2=A0high<br>
>Responsible:=C2=A0 =C2=A0 kern-bug-people<br>
>State:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 open<br>
>Class:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sw-bug<br>
>Submitter-Id:=C2=A0 =C2=A0net<br>
>Arrival-Date:=C2=A0 =C2=A0Fri Dec 20 21:55:00 +0000 2019<br>
>Originator:=C2=A0 =C2=A0 =C2=A0Izumi Tsutsui<br>
>Release:=C2=A0 =C2=A0 =C2=A0 =C2=A0 NetBSD 9.0_RC1<br>
>Organization:<br>
>Environment:<br>
System: NetBSD 9.0_RC1 (GENERIC) #0: Wed Nov 27 16:14:52 UTC 2019<br>
=C2=A0 =C2=A0 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/i386/compile/GEN=
ERIC<br>
Architecture: i386<br>
Machine: i386<br>
>Description:<br>
I'm getting reproducible kernel fault in ata_recovery_resume()<br>
on my 9.0_RC1 i386 machines.=C2=A0 It looks triggered by SSD error,<br>
but I wonder if the errors are real hardware faiulre or not.<br>
(not seen on 8.1 kernel)<br>
<br>
ddb says (typed from screen pic):<br>
---<br>
kernle: supervisor trap page fault, code=3D0<br>
Stopped in pid 0.41 (system) at netbsd:ata_recovery_resume+0xe3:=C2=A0 =C2=
=A0 =C2=A0 =C2=A0movzwl=C2=A0 8(%eax),%edx<br>
db{0}> bt<br>
ata_recovery_resume(c51abb88,0,8441,8,c08a608e,0,8441,8000,c51abb88,8441) a=
t netbsd:ata_recovery_resume+0xe3<br>
ahci_channel_recover(c51abb88,8,8441,c0fc4238,1277b90,c51ab000,8,c51abb88,c=
4d488c0,0) at netbsd:ahci_channel_recover+0x82<br>
ata_thread_run(c51abb88,8,8000,8441,c51abb90,6,c51abc98,c5197080,c01813fc,c=
509fc00) at netbsd:ata_thread_run+0x1f3<br>
atabus_thread(c5197080,1540000,154a000,0,c01003fd,0,0,0,0,0) at netbsd:atab=
us_thread+0x228<br>
>db{0}><br>
---<br>
<br>
dmesg on the ddb prompt say (timestamp is omitted to save typing):<br>
---<br>
=C2=A0:<br>
ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00)<b=
r>
ahcisata0: ignoring broken port multiplier support<br>
ahcisata0: AHCI revision 1.10, 4 ports, 32 slots, CAP 0xf3209f83<CCCS,PM=
D,ISS=3D0x2=3DGen2,SCLO,SAL,SMPS,SSNTF,SNCQ,S64A><br>
ahcisata0: interrupting at ioapic0 pin 22<br>
atabus0 at ahcisata0 channel 0<br>
atabus1 at ahcisata0 channel 1<br>
atabus2 at ahcisata0 channel 2<br>
atabus3 at ahcisata0 channel 3<br>
=C2=A0:<br>
ixpide0 at pci0 dev 20 function 1: ATI Technologies IXP IDE Controller (rev=
. 0x00)<br>
ixpide0: bus-master DMA support present<br>
ixpide0: primary channel configured to compatibility mode<br>
ixpide0: primary channel interrupting at ioapic0 pin 14<br>
atabus4 at ixpide0 channel 0<br>
ixpide0: secondary channel configured to compatibility mode<br>
ixpide0: secondary channel interrupting at ioapic0 pin 15<br>
atabus5 at ixpide0 channel 1<br>
=C2=A0:<br>
ahcisata0 port 0: device present, speed: 3.0Gb/s<br>
ahcisata0 port 1: device present, speed: 3.0Gb/s<br>
ahcisata0 port 2: device present, speed: 3.0Gb/s<br>
ahcisata0 port 3: device present, speed: 1.5Gb/s<br>
=C2=A0:<br>
wd0 at atabus0 drive 0<br>
wd0: <Hitachi HDS5C3020ALA632><br>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing<br>
wd0: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sec=
tors<br>
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), W=
RITE DMA FUA, NCQ (32 tags) w/PRIO<br>
wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
33) (using DMA), NCQ (31 tags) w/PRIO<br>
wd1 at atabus1 drive 0<br>
wd1: <Hitachi HDS5C3020ALA632><br>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing<br>
wd1: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sec=
tors<br>
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), W=
RITE DMA FUA, NCQ (32 tags) w/PRIO<br>
wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
33) (using DMA), NCQ (31 tags) w/PRIO<br>
wd2 at atabus2 drive 0<br>
wd2: <Samsung SSD 860 EVO 500GB><br>
wd2: drive supports 1-sector PIO transfers, LBA48 addressing<br>
wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sector=
s<br>
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), W=
RITE DMA FUA, NCQ (32 tags)<br>
wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
33) (using DMA), NCQ (31 tags)<br>
atapibus0 at atabus3: 1 targets<br>
cd0 at atapibus0 drive 0: <HL-DT-ST DVDRAM GH24NSD5, KLUIBRA1411, LJ00&g=
t; cdrom removable<br>
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)<br=
>
cd0(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
33) (using DMA)<br>
=C2=A0:<br>
wsmux1: connecting to wsdisplay0<br>
cd0(ahcisata0:3:0):=C2=A0 DEFERRED ERROR, key =3D 0x2<br>
wsdisplay0: screen 1 added (default, vt100 emulation)<br>
wsdisplay0: screen 2 added (default, vt100 emulation)<br>
wsdisplay0: screen 3 added (default, vt100 emulation)<br>
wsdisplay0: screen 4 added (default, vt100 emulation)<br>
cd0(ahcisata0:3:0):=C2=A0 DEFERRED ERROR, key =3D 0x2<br>
wd2a: device timeout reading fsbn 343200640 of 343200640-343200647 (wd2 bn =
343200640; cn 167578 tn 14 sn 0), xfer dcc, retry 0<br>
wd2a: device timeout writing fsbn 479102685 of 479102605-479102719 (wd2 bn =
479102685; cn 233936 tn 27 sn 29), xfer 7f0, retry 0<br>
=C2=A0:<br>
[many similar errors]<br>
=C2=A0:<br>
uvm_fault(0xc13737e0, 0, 1) -> 0xe<br>
fatal page fault in supervisor mode<br>
trap type 6 code 0 eip 0xc018305f cs 0x8 eflags 0x10286 cr2 0x8 ilevel 0 es=
p 0xc51abb88<br>
curlwp 0xc509fc00 pid 0 lid 41 lowest kstack 0xdc7da2c0<br>
db{0}> <br>
---<br>
<br>
"0xc018305f" is here:<br>
<a href=3D"https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=3D1=
.2#240" rel=3D"noreferrer" target=3D"_blank">https://nxr.netbsd.org/xref/sr=
c/sys/dev/ata/ata_recovery.c?r=3D1.2#240</a><br>
---<br>
=C2=A0 =C2=A0 234=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* Requeue all unfinishe=
d commands for same drive as failed command */<br>
=C2=A0 =C2=A0 235=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0for (slot =3D 0; slot &l=
t; ch_openings; slot++) {<br>
=C2=A0 =C2=A0 236=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0if ((ata_queue_active(chp) & (1U << slot)) =3D=3D 0)<br>
=C2=A0 =C2=A0 237=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0continue;<br>
=C2=A0 =C2=A0 238 <br>
=C2=A0 =C2=A0 239=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0xfer =3D ata_queue_hwslot_to_xfer(chp, slot);<br>
->=C2=A0 240=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0if (drive !=3D xfer->c_drive) <br>
=C2=A0 =C2=A0 241=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0continue;<br>
=C2=A0 =C2=A0 242 <br>
=C2=A0 =C2=A0 243=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0xfer->ops->c_kill_xfer(chp, xfer,<br>
=C2=A0 =C2=A0 244=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0(error =3D=3D 0) ? KILL_REQUEUE : KILL_RESET);<br>
=C2=A0 =C2=A0 245=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}<br>
---<br>
Per dumb printf debug, actually "xfer" is NULL on the fault.<br>
<br>
>How-To-Repeat:<br>
~100% reproducible on my Samsung SSD with load on my main machine<br>
(ASRock M3A UCC <a href=3D"http://www.asrock.com/mb/AMD/M3A%20UCC/index.jp.=
asp" rel=3D"noreferrer" target=3D"_blank">http://www.asrock.com/mb/AMD/M3A%=
20UCC/index.jp.asp</a> )<br>
but not sure if it can happen on other machines.<br>
<br>
>Fix:<br>
No idea.<br>
Is it worth to have some kernel config option to disable NCQ,<br>
if it's triggered by the feature?<br>
<br>
---<br>
Izumi Tsutsui<br>
<br>
</blockquote></div>
--0000000000007e5485059a51e0ff--
From: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
To: gnats-bugs@netbsd.org
Cc: jaromir.dolecek@gmail.com, tsutsui@ceres.dti.ne.jp
Subject: Re: kern/54790: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
support?)
Date: Sat, 28 Dec 2019 21:28:11 +0900
> Can you try kernel with DEBUG+DIAGNOSTIC?
>
> There is KASSERTMSG() which should trigger if the xfer is no longer active
> - after ata_queue_active() returns the slot as active, it should never
> actually happen the ata_queue_hwslot_to_xfer() returns NULL.
KASSERT log from NetBSD/i386 9.0_RC1 GENERIC + options DIAGNOSTIC
+ options DEBUG + options LOCKDEBUG:
https://gist.github.com/tsutsui/9e4fd770c6207c2af64ffb3c65c29b24
Typed manually from the above one (for future search):
---
panic: kernel diagnostic assertion "(chq->active_xfers_used & __BIT(xfer->c_slot)) != 0" failed: file "../../../../dev/ata/ata.c", line 1332
cpu0: Begin traceback...
vpanic(c10beea0,dc723eec,dc723f0c,c0189467,c10beea0,c10bee07,c10d4440,c10d3d81,534,20) at netbsd:vpanic+0x12d
kern_assert(c10beea0,c10bee07,c10d4440,c10d3d81,534,20,c555a4d0,c5396b88,c5396000,dc723f3c) at netbsd:kern_asset+0x23
ata_deactivate_xfer(c5396b88,c555a4d0,ba0,20,a,0,6b88,c555a4d0,c555ae30,2) at netbsd:ata_deactivate_xfer+0x5f
ahci_bio_complete(c5396b88,c555a4d0,0,c5396b98,c14cee00,c018c03d,dc723f94,c0963cce,c5396b88,c4d78860) at netbsd:ahci_bio_complete+0x137
ata_timeout(c5396b88,c4d78860,dc71320c,c4d5a040,c0937f18,c14cee04,c5396b88,0,c14cee60,dc713074) at netbsd:ata_timeout+0x66
callout_softclock(0,dc713074,c0100400,16a7000,16b0010,30,c0100010,10,0,13b4c0) at netbsd:callout_softclock+0x37e
softint_dispatch(c4d78020,2,0,0,0,0,dc726ff0,dc726e44,dc726e98,80050033) at netbsd:softint_dispatch+0xcc
Bad frame pointer: 0xdc712f24
cpu0: End traceback...
fatal breackpoint trap in supervisor mode
trap type 1 code 0 eip 0xc0115a94 cs 0x8 eflags 0x202 cr2 0xb6529020 ilevel 0x8 esp 0xdc723ed0
curlwp 0xc4d78860 pid 0 lid 5 lowest kstack 0xdc7212c0
---
Izumi Tsutsui
Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sat, 04 Jan 2020 21:17:38 +0000
Responsible-Changed-Why:
I'll look at this.
State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 13 Jan 2020 21:22:26 +0000
State-Changed-Why:
-current with sys/dev/ata/wd.c rev.1.454 should now disable NCQ by default,
can you check that fixes the problem for you?
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54790 CVS commit: src/sys/dev/ata
Date: Mon, 13 Jan 2020 21:20:17 +0000
Module Name: src
Committed By: jdolecek
Date: Mon Jan 13 21:20:17 UTC 2020
Modified Files:
src/sys/dev/ata: wd.c
Log Message:
disable NCQ by default for "Samsung SSD 860 EVO 1TB" and
"Samsung SSD 860 EVO 500GB" - these drives have known broken NCQ support
particularly when used with AMD SB710/750 chipsets, problem occur also
under Linux and Windows
https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813
https://bugzilla.kernel.org/show_bug.cgi?id=201693
It seems there is no Samsung firmware update to fix this even.
Disable NCQ regardless of the controller, it's likely same problem
exists with other controllers too.
This should fix PR kern/54790 and PR kern/54855
To generate a diff of this commit:
cvs rdiff -u -r1.453 -r1.454 src/sys/dev/ata/wd.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54790 CVS commit: src/share/man/man4
Date: Mon, 13 Jan 2020 21:43:06 +0000
Module Name: src
Committed By: jdolecek
Date: Mon Jan 13 21:43:06 UTC 2020
Modified Files:
src/share/man/man4: wd.4
Log Message:
document the wd(4) sysctl nodes, and add the note about the Sumsung EVO drives
part of fix for PR kern/54790 and PR kern/54855
To generate a diff of this commit:
cvs rdiff -u -r1.20 -r1.21 src/share/man/man4/wd.4
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: feedback->analyzed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Fri, 24 Jan 2020 19:39:35 +0000
State-Changed-Why:
The crash happens because the recovery is eventually run via the kernel
thread, which countrary to the exception handler doesn't block further
interrupts during the processing. I'm working on making the recovery
path handle this correctly.
State-Changed-From-To: analyzed->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sat, 04 Apr 2020 22:31:29 +0000
State-Changed-Why:
Can you try if rev. 1.3 of ata_recovery.c + rev. 1.9 of ata_subr.c fixes
the problem?
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54790 CVS commit: src/sys/dev/ata
Date: Sat, 4 Apr 2020 22:30:03 +0000
Module Name: src
Committed By: jdolecek
Date: Sat Apr 4 22:30:03 UTC 2020
Modified Files:
src/sys/dev/ata: ata_recovery.c ata_subr.c
Log Message:
stop xfer timeouts during recovery, all xfers will be requeued anyway
this avoids race with the timeout routine when processing the xfers
for requeueing
should fix PR kern/54790 by Izumi Tsutsui
To generate a diff of this commit:
cvs rdiff -u -r1.2 -r1.3 src/sys/dev/ata/ata_recovery.c
cvs rdiff -u -r1.8 -r1.9 src/sys/dev/ata/ata_subr.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 29 Jun 2020 11:04:06 +0000
State-Changed-Why:
Problem fixed. Thanks for report.
State-Changed-From-To: closed->needs-pullups
State-Changed-By: gson@NetBSD.org
State-Changed-When: Wed, 02 Feb 2022 16:46:16 +0000
State-Changed-Why:
When fixing a critical bug reported on a release branch,
pullups are of the essence.
From: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
To: gnats-bugs@netbsd.org
Cc: jdolecek@netbsd.org, gson@NetBSD.org, tsutsui@ceres.dti.ne.jp
Subject: Re: kern/54790 (9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
support?))
Date: Fri, 4 Feb 2022 02:50:03 +0900
> State-Changed-Why:
> When fixing a critical bug reported on a release branch,
> pullups are of the essence.
On my environment (ATI SB600) this problem was workaround by
AHCI_QUIRK_BADNCQ quirk:
https://mail-index.netbsd.org/source-changes/2020/01/18/msg112983.html
so it was not trivial to test the later fix. (sorry for a late feedback)
---
Izumi Tsutsui
From: Andreas Gustafsson <gson@NetBSD.org>
To: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
Cc: gnats-bugs@netbsd.org, jdolecek@netbsd.org
Subject: Re: kern/54790 (9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
support?))
Date: Fri, 4 Feb 2022 13:14:15 +0200
Izumi Tsutsui wrote:
> On my environment (ATI SB600) this problem was workaround by
> AHCI_QUIRK_BADNCQ quirk:
> https://mail-index.netbsd.org/source-changes/2020/01/18/msg112983.html
> so it was not trivial to test the later fix. (sorry for a late feedback)
It's possible we have been suffering from two separate bugs that both
cause device timeouts, and in both cases the timeouts were triggering
the same third bug causing panics during recovery.
Since my PR 56479 was specifically about the panics, I have now filed
a separate PR 56686 about the timeouts. If your timeout problem was
indeed specific to AMD chipsets, mine must be unrelated because my
machine has an Intel chipset.
--
Andreas Gustafsson, gson@NetBSD.org
State-Changed-From-To: needs-pullups->pending-pullups
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 07 Feb 2022 21:46:27 +0000
State-Changed-Why:
Ticket #1426
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54790 CVS commit: [netbsd-9] src/sys/dev/ata
Date: Tue, 8 Feb 2022 14:45:00 +0000
Module Name: src
Committed By: martin
Date: Tue Feb 8 14:45:00 UTC 2022
Modified Files:
src/sys/dev/ata [netbsd-9]: ata_recovery.c ata_subr.c
Log Message:
Pull up following revision(s) (requested by jdolecek in ticket #1426):
sys/dev/ata/ata_recovery.c: revision 1.3
sys/dev/ata/ata_subr.c: revision 1.9
stop xfer timeouts during recovery, all xfers will be requeued anyway
this avoids race with the timeout routine when processing the xfers
for requeueing
should fix PR kern/54790 by Izumi Tsutsui
To generate a diff of this commit:
cvs rdiff -u -r1.2 -r1.2.8.1 src/sys/dev/ata/ata_recovery.c
cvs rdiff -u -r1.8 -r1.8.4.1 src/sys/dev/ata/ata_subr.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: pending-pullups->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 08 Feb 2022 18:29:10 +0000
State-Changed-Why:
Pullup to netbsd-9 done now too.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.