NetBSD Problem Report #54503
From www@netbsd.org Thu Aug 29 07:52:28 2019
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 7FD297A1A1
for <gnats-bugs@gnats.NetBSD.org>; Thu, 29 Aug 2019 07:52:28 +0000 (UTC)
Message-Id: <20190829075227.72F147A1C4@mollari.NetBSD.org>
Date: Thu, 29 Aug 2019 07:52:27 +0000 (UTC)
From: rokuyama.rk@gmail.com
Reply-To: rokuyama.rk@gmail.com
To: gnats-bugs@NetBSD.org
Subject: Panic during attaching nvme(4) when # of logical CPUs >= 32 ?
X-Send-Pr-Version: www-1.0
>Number: 54503
>Category: kern
>Synopsis: Panic during attaching nvme(4) when # of logical CPUs >= 32 ?
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: jdolecek
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Aug 29 07:55:00 +0000 2019
>Closed-Date: Mon Sep 23 05:05:12 +0000 2019
>Last-Modified: Wed Sep 25 15:50:00 +0000 2019
>Originator: Rin Okuyama
>Release: 9.99.10
>Organization:
Department of Physics, Meiji University
>Environment:
NetBSD kobrpd02 9.99.10 NetBSD 9.99.10 (GENERIC) #1: Thu Aug 29 12:07:14 JST 2019 rin@latipes:/build/work/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
Panic occurs during attaching nvme(4) when number of logical CPUs = 48
(24 cores x 2 threads per cores):
>> NetBSD/x86 BIOS Boot, Revision 5.11 (Tue Aug 27 14:53:16 UTC 2019) (from NetBSD 9.99.10)
>> Memory: 629/1702608 k
> boot -v
...
NetBSD 9.99.10 (GENERIC) #1: Thu Aug 29 12:07:14 JST 2019
...
cpu0 at mainbus0 apid 0
timecounter: Timecounter "lapic" frequency 24972680 Hz quality -100
cpu0: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz, id 0x50654
cpu0: package 0, core 0, smt 0
...
cpu47 at mainbus0 apid 59
cpu47: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz, id 0x50654
cpu47: package 1, core 13, smt 1
...
ppb3 at pci5 dev 0 function 0: vendor 8086 product 2030 (rev. 0x04)
ppb3: PCI Express capability version 2 <Root Port of PCI-E Root Complex> x16 @ 8.0GT/s
ppb3: link is x4 @ 8.0GT/s
pci6 at ppb3 bus 59
pci6: i/o space, memory space enabled, rd/line, wr/inv ok
nvme0 at pci6 dev 0 function 0: vendor 8086 product 2700 (rev. 0x00)
nvme0: NVMe 1.0
allocated pic msix4 type edge pin 0 level 6 to cpu0 slot 20 idt entry 101
nvme0: for admin queue interrupting at msix4 vec 0
nvme0: INTEL SSDPED1D280GA, firmware E2010325, serial PHMB742101WX280CGN
allocated pic msix4 type edge pin 1 level 6 to cpu0 slot 21 idt entry 102
nvme0: for io queue 1 interrupting at msix4 vec 1 affinity to cpu0
allocated pic msix4 type edge pin 2 level 6 to cpu0 slot 22 idt entry 103
nvme0: for io queue 2 interrupting at msix4 vec 2 affinity to cpu1
...
allocated pic msix4 type edge pin 31 level 6 to cpu0 slot 22 idt entry 136
nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30
prevented execution of 0x0 (SMEP)
fatal page fault in supervisor mode
trap type 6 code 0x10 rip 0 cs 0x8 rflags 0x10202 cr2 0 ilevel 0x8 rsp 0xffffffff81ae8318
curlwp 0xffffffff8165cc20 pid 0.1 lowest kstack 0xffffffff81ae42c0
kernel: page fault trap, code=0
Stopped in pid 0.1 (system) at 0:uvm_fault(0xffffffff817856e0, 0xffff8d4680000000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip 0xffffffff8021e5b6 cs 0x8 rflags 0x10a06 cr2 0xffff8d4680000000 ilevel 0x8 rsp 0xffffffff81ae7f00
curlwp 0xffffffff8165cc20 pid 0.1 lowest kstack 0xffffffff81ae42c0
kernel: page fault trap, code=0
Stopped in pid 0.1 (system) at netbsd:db_disasm+0xec: movq 0(%rdx,%rcx,8),%
rcx
db{0}> bt
db_disasm() at netbsd:db_disasm+0xec
db_trap() at netbsd:db_trap+0xf4
kdb_trap() at netbsd:kdb_trap+0xe1
trap() at netbsd:trap+0x327
--- trap (number 6) ---
?() at 0
nvme_poll() at netbsd:nvme_poll+0x104 <--- nvme.c:1261
nvme_attach() at netbsd:nvme_attach+0x927 <--- nvme.c:1509
nvme_pci_attach() at netbsd:nvme_pci_attach+0x309 <--- nvme_pci.c:228
config_attach_loc() at netbsd:config_attach_loc+0x1a5
...
Full dmesg (with serial console) and netbsd.gdb is provided here
(This is GENERIC kernel with MSGBUFSIZE=1048576):
http://www.netbsd.org/~rin/nvme_panic_20190829/dmesg
http://www.netbsd.org/~rin/nvme_panic_20190829/netbsd.gdb.gz (CAUTION HUGE!!)
When hyper threading is disabled in BIOS, i.e., # of logical CPUs =
# of cores = 24, the system boots fine. Here's dmesg -t, intrctl list,
pcictl pci6 dump -d 0, and acpidump -d:
http://www.netbsd.org/~rin/nvme_panic_20190829/HT_disabled/dmesg
http://www.netbsd.org/~rin/nvme_panic_20190829/HT_disabled/intrctl
http://www.netbsd.org/~rin/nvme_panic_20190829/HT_disabled/pcictl
http://www.netbsd.org/~rin/nvme_panic_20190829/HT_disabled/acpidump
The system also boots fine if nvme* is disabled in userconf(4), even if
hyper threading is enabled in BIOS. Here's dmesg -t, intrctl list,
pcictl pci6 dump -d 0, and acpidump -d:
http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/dmesg
http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/intrctl
http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/pcictl
http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/acpidump
Can I provide any other information? Since this machine is located in
my previous office, I cannot access its console or BIOS immediately.
Now, it runs with hyper threading disabled.
>How-To-Repeat:
Boot that machine with nvme* and hyper threading enabled.
>Fix:
N/A
>Release-Note:
>Audit-Trail:
From: Thomas Klausner <wiz@NetBSD.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical
CPUs >= 32 ?
Date: Thu, 29 Aug 2019 10:40:31 +0200
On Thu, Aug 29, 2019 at 07:55:00AM +0000, rokuyama.rk@gmail.com wrote:
> >Number: 54503
> >Category: kern
> >Synopsis: Panic during attaching nvme(4) when # of logical CPUs >= 32 ?
Interesting workaround!
This looks similar to my PR 54503 where the workaround was
Index: nvme_pci.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/nvme_pci.c,v
retrieving revision 1.26
diff -u -r1.26 nvme_pci.c
--- nvme_pci.c 23 Jan 2019 06:56:19 -0000 1.26
+++ nvme_pci.c 10 Jun 2019 08:18:33 -0000
@@ -64,7 +64,7 @@
#include <dev/ic/nvmereg.h>
#include <dev/ic/nvmevar.h>
-int nvme_pci_force_intx = 0;
+int nvme_pci_force_intx = 1;
int nvme_pci_mpsafe = 1;
int nvme_pci_mq = 1; /* INTx: ioq=1, MSI/MSI-X: ioq=ncpu */
Thomas
From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical
CPUs >= 32 ?
Date: Thu, 29 Aug 2019 10:56:17 +0000
on irc jared points out
arch/x86/include/intrdefs.h
50:#define MAX_INTR_SOURCES 32
Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Thu, 29 Aug 2019 15:03:32 +0000
Responsible-Changed-Why:
I'll look at this.
From: Rin Okuyama <rokuyama.rk@gmail.com>
To: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
>= 32 ?
Date: Fri, 30 Aug 2019 19:36:37 +0900
On 2019/08/30 0:12, Jaromír Doleček wrote:
> Le jeu. 29 août 2019 à 13:00, <coypu@sdf.org> a écrit :
>> on irc jared points out
>>
>> arch/x86/include/intrdefs.h
>> 50:#define MAX_INTR_SOURCES 32
>
> I think this is only per CPU limit.
I think so too. Actually, ixg(4) uses MSI-X interrupts with 49 vectors
when hyper threading is enabled:
http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/dmesg
Considering with wiz@'s workaround, I guess that something may go wrong
when MSI-X interrupts are established in the nvme(4) code.
Thanks,
rin
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: Rin Okuyama <rokuyama.rk@gmail.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
>= 32 ?
Date: Fri, 30 Aug 2019 16:28:31 +0200
Can you please try a kernel with nvme_q_complete() marked __noinline,
to see where exactly inside that function the code panics? I've
reviewed the code and I don't see any particular reason why it would
fail while setting up 32nd queue.
Can you also send output of 'nvmectl identify nvme0' when booted
successfully i.e. with HT disabled.
Jaromir
Le ven. 30 ao=C3=BBt 2019 =C3=A0 12:36, Rin Okuyama <rokuyama.rk@gmail.com>=
a =C3=A9crit :
>
> On 2019/08/30 0:12, Jarom=C3=ADr Dole=C4=8Dek wrote:
> > Le jeu. 29 ao=C3=BBt 2019 =C3=A0 13:00, <coypu@sdf.org> a =C3=A9crit :
> >> on irc jared points out
> >>
> >> arch/x86/include/intrdefs.h
> >> 50:#define MAX_INTR_SOURCES 32
> >
> > I think this is only per CPU limit.
>
> I think so too. Actually, ixg(4) uses MSI-X interrupts with 49 vectors
> when hyper threading is enabled:
>
> http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/dmesg
>
> Considering with wiz@'s workaround, I guess that something may go wrong
> when MSI-X interrupts are established in the nvme(4) code.
>
> Thanks,
> rin
From: Rin Okuyama <rokuyama.rk@gmail.com>
To: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
>= 32 ?
Date: Fri, 30 Aug 2019 23:43:54 +0900
On 2019/08/30 23:28, Jaromír Doleček wrote:
> Can you please try a kernel with nvme_q_complete() marked __noinline,
> to see where exactly inside that function the code panics? I've
> reviewed the code and I don't see any particular reason why it would
> fail while setting up 32nd queue.
>
> Can you also send output of 'nvmectl identify nvme0' when booted
> successfully i.e. with HT disabled.
Thank you very much for working on this!
Since this machine is located in my previous office, I cannot access
its BIOS/console immediately. I will visit there tomorrow to try it.
Here's output of nvmectl with HT disabled:
----
% sudo nvmectl identify nvme0
Controller Capabilities/Features
================================
Vendor ID: 8086
Subsystem Vendor ID: 8086
Serial Number: PHMB742101WX280CGN
Model Number: INTEL SSDPED1D280GA
Firmware Version: E2010325
Recommended Arb Burst: 0
IEEE OUI Identifier: e4 d2 5c
Multi-Interface Cap: 00
Max Data Transfer Size: 131072
Controller ID: 0x00
Admin Command Set Attributes
============================
Security Send/Receive: Supported
Format NVM: Supported
Firmware Activate/Download: Supported
Namespace Managment: Not Supported
Abort Command Limit: 4
Async Event Request Limit: 4
Number of Firmware Slots: 1
Firmware Slot 1 Read-Only: No
Per-Namespace SMART Log: No
Error Log Page Entries: 64
Number of Power States: 1
NVM Command Set Attributes
==========================
Submission Queue Entry Size
Max: 64
Min: 64
Completion Queue Entry Size
Max: 16
Min: 16
Number of Namespaces: 1
Compare Command: Not Supported
Write Uncorrectable Command: Supported
Dataset Management Command: Supported
Write Zeroes Command: Not Supported
Features Save/Select Field: Not Supported
Reservation: Not Supported
Volatile Write Cache: Not Present
----
Thanks,
rin
From: Rin Okuyama <rokuyama.rk@gmail.com>
To: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
>= 32 ?
Date: Sat, 31 Aug 2019 18:52:18 +0900
On 2019/08/30 23:28, Jaromír Doleček wrote:
> Can you please try a kernel with nvme_q_complete() marked __noinline,
> to see where exactly inside that function the code panics? I've
> reviewed the code and I don't see any particular reason why it would
> fail while setting up 32nd queue.
nvme_q_complete() is not expanded inline even without __noinline.
Instruction at the fault address,
allocated pic msix4 type edge pin 31 level 6 to cpu0 slot 22 idt entry 136
nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30
prevented execution of 0x0 (SMEP)
fatal page fault in supervisor mode
...
db{0}> bt
db_disasm() at netbsd:db_disasm+0xcd
db_trap() at netbsd:db_trap+0x16b
kdb_trap() at netbsd:kdb_trap+0x12a
trap() at netbsd:trap+0x49d
--- trap (number 6) ---
?() at 0
nvme_poll() at netbsd:nvme_poll+0x154
nvme_attach() at netbsd:nvme_attach+0x12d2
nvme_pci_attach() at netbsd:nvme_pci_attach+0x6fe
...
nvme_poll+0x154 is "test %eax,%eax" just after returning from
nvme_q_complete() in nvme.c:1261.
https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1261
1240 static int
1241 nvme_poll(struct nvme_softc *sc, struct nvme_queue *q, struct nvme_ccb *ccb,
1242 void (*fill)(struct nvme_queue *, struct nvme_ccb *, void *), int timo_sec)
1243 {
....
1261 if (nvme_q_complete(sc, q) == 0)
0000000000004903 <nvme_poll>:
....
4a52: e8 ee c4 ff ff callq f45 <nvme_q_complete>
4a57: 85 c0 test %eax,%eax
....
I don't understand why "execution of NULL" occurs by such a
instruction. Maybe this is an incorrect alert. By using kernel
with option KUBSAN, I found
nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30
prevented execution of 0x0 (SMEP)
fatal page fault in supervisor mode
...
db{0}> reboot
UBSan: Undefined Behavior in ../../../../dev/ic/nvme.c:1306:11, member access within null pointer of type 'struct nvme_poll_state'
fatal page fault in supervisor mode
trap type 6 code 0x2 rip 0xffffffff8102487c cs 0x8 rflags 0x10246 cr2 0x40 ilevel 0x3 rsp 0xffffd0b57e9b5fc0
curlwp 0xffffc02d225038c0 pid 0.4 lowest kstack 0xffffd0b57e9b22c0
kernel: page fault trap, code=0
Stopped in pid 0.4 (system) at netbsd:nvme_poll_done+0x7a: movq %rsi,40(%rbx)
This should be the real cause of panic. Then, backtrace reads,
db{0}> bt
nvme_poll_done() at netbsd:nvme_poll_done+0x7a <--- nvme.c:1306
nvme_q_complete() at netbsd:nvme_q_complete+0x259 <--- nvme.c:1374
softint_dispatch() at netbsd:softint_dispatch+0x20b
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xffffd0b57e9b60f0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
db{0}>
where
https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1306
1299 static void
1300 nvme_poll_done(struct nvme_queue *q, struct nvme_ccb *ccb,
1301 struct nvme_cqe *cqe)
1302 {
1303 struct nvme_poll_state *state = ccb->ccb_cookie;
....
1306 state->c = *cqe;
https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1374
1327 static int
1328 nvme_q_complete(struct nvme_softc *sc, struct nvme_queue *q)
1329 {
....
1372 mutex_exit(&q->q_cq_mtx);
1373 ccb->ccb_done(q, ccb, cqe);
1374 mutex_enter(&q->q_cq_mtx);
Do you think why ccb->ccb_cookie becomes NULL? I uploaded full
dmesg and log of DDB at:
http://www.netbsd.org/~rin/nvme_panic_20190829/dmesg.20190831
Thanks,
rin
From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical
CPUs >= 32 ?
Date: Sun, 8 Sep 2019 10:00:27 +0000
This is a little violent, but perhaps as a stopgap measure, we can force
INTx if too many CPUs are detected.
From: Kimihiro Nonaka <nonakap@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: jdolecek@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
rokuyama.rk@gmail.com
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
>= 32 ?
Date: Thu, 19 Sep 2019 01:46:27 +0900
--0000000000009edf670592d6949a
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Please try the attached patch.
On Sat, Aug 31, 2019 at 6:55 PM Rin Okuyama <rokuyama.rk@gmail.com> wrote:
>
> The following reply was made to PR kern/54503; it has been noted by GNATS=
.
>
> From: Rin Okuyama <rokuyama.rk@gmail.com>
> To: =3D?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=3D?=3D <jaromir.dolecek@gmail.com=
>
> Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
> Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical=
CPUs
> >=3D 32 ?
> Date: Sat, 31 Aug 2019 18:52:18 +0900
>
> On 2019/08/30 23:28, Jarom=C3=ADr Dole=C4=8Dek wrote:
> > Can you please try a kernel with nvme_q_complete() marked __noinline,
> > to see where exactly inside that function the code panics? I've
> > reviewed the code and I don't see any particular reason why it would
> > fail while setting up 32nd queue.
>
> nvme_q_complete() is not expanded inline even without __noinline.
>
> Instruction at the fault address,
>
> allocated pic msix4 type edge pin 31 level 6 to cpu0 slot 22 idt =
entry 136
> nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to c=
pu30
> prevented execution of 0x0 (SMEP)
> fatal page fault in supervisor mode
> ...
> db{0}> bt
> db_disasm() at netbsd:db_disasm+0xcd
> db_trap() at netbsd:db_trap+0x16b
> kdb_trap() at netbsd:kdb_trap+0x12a
> trap() at netbsd:trap+0x49d
> --- trap (number 6) ---
> ?() at 0
> nvme_poll() at netbsd:nvme_poll+0x154
> nvme_attach() at netbsd:nvme_attach+0x12d2
> nvme_pci_attach() at netbsd:nvme_pci_attach+0x6fe
> ...
>
> nvme_poll+0x154 is "test %eax,%eax" just after returning from
> nvme_q_complete() in nvme.c:1261.
>
> https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1261
>
> 1240 static int
> 1241 nvme_poll(struct nvme_softc *sc, struct nvme_queue *q, struct n=
vme_ccb *ccb,
> 1242 void (*fill)(struct nvme_queue *, struct nvme_ccb *, void *=
), int timo_sec)
> 1243 {
> ....
> 1261 if (nvme_q_complete(sc, q) =3D=3D 0)
>
> 0000000000004903 <nvme_poll>:
> ....
> 4a52: e8 ee c4 ff ff callq f45 <nvme_q_complete>
> 4a57: 85 c0 test %eax,%eax
> ....
>
> I don't understand why "execution of NULL" occurs by such a
> instruction. Maybe this is an incorrect alert. By using kernel
> with option KUBSAN, I found
>
> nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to c=
pu30
> prevented execution of 0x0 (SMEP)
> fatal page fault in supervisor mode
> ...
> db{0}> reboot
> UBSan: Undefined Behavior in ../../../../dev/ic/nvme.c:1306:11, m=
ember access within null pointer of type 'struct nvme_poll_state'
> fatal page fault in supervisor mode
> trap type 6 code 0x2 rip 0xffffffff8102487c cs 0x8 rflags 0x10246=
cr2 0x40 ilevel 0x3 rsp 0xffffd0b57e9b5fc0
> curlwp 0xffffc02d225038c0 pid 0.4 lowest kstack 0xffffd0b57e9b22c=
0
> kernel: page fault trap, code=3D0
> Stopped in pid 0.4 (system) at netbsd:nvme_poll_done+0x7a: m=
ovq %rsi,40(%rbx)
>
> This should be the real cause of panic. Then, backtrace reads,
>
> db{0}> bt
> nvme_poll_done() at netbsd:nvme_poll_done+0x7a <--- nvme=
.c:1306
> nvme_q_complete() at netbsd:nvme_q_complete+0x259 <--- nvme=
.c:1374
> softint_dispatch() at netbsd:softint_dispatch+0x20b
> DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xffffd0b57e9b60=
f0
> Xsoftintr() at netbsd:Xsoftintr+0x4f
> --- interrupt ---
> 0:
> db{0}>
>
> where
>
> https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1306
>
> 1299 static void
> 1300 nvme_poll_done(struct nvme_queue *q, struct nvme_ccb *ccb,
> 1301 struct nvme_cqe *cqe)
> 1302 {
> 1303 struct nvme_poll_state *state =3D ccb->ccb_cookie;
> ....
> 1306 state->c =3D *cqe;
>
> https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1374
>
> 1327 static int
> 1328 nvme_q_complete(struct nvme_softc *sc, struct nvme_queue *q)
> 1329 {
> ....
> 1372 mutex_exit(&q->q_cq_mtx);
> 1373 ccb->ccb_done(q, ccb, cqe);
> 1374 mutex_enter(&q->q_cq_mtx);
>
> Do you think why ccb->ccb_cookie becomes NULL? I uploaded full
> dmesg and log of DDB at:
>
> http://www.netbsd.org/~rin/nvme_panic_20190829/dmesg.20190831
>
> Thanks,
> rin
>
--0000000000009edf670592d6949a
Content-Type: application/octet-stream; name="nvme.diff"
Content-Disposition: attachment; filename="nvme.diff"
Content-Transfer-Encoding: base64
Content-ID: <f_k0phz9240>
X-Attachment-Id: f_k0phz9240
ZGlmZiAtLWdpdCBhL3N5cy9kZXYvaWMvbnZtZS5jIGIvc3lzL2Rldi9pYy9udm1lLmMKaW5kZXgg
ZjkzMzc3YWJlNTUuLjViMWY3ZDk5NDgwIDEwMDY0NAotLS0gYS9zeXMvZGV2L2ljL252bWUuYwor
KysgYi9zeXMvZGV2L2ljL252bWUuYwpAQCAtMTMwMiw4ICsxMzAyLDggQEAgbnZtZV9wb2xsX2Rv
bmUoc3RydWN0IG52bWVfcXVldWUgKnEsIHN0cnVjdCBudm1lX2NjYiAqY2NiLAogewogCXN0cnVj
dCBudm1lX3BvbGxfc3RhdGUgKnN0YXRlID0gY2NiLT5jY2JfY29va2llOwogCi0JU0VUKGNxZS0+
ZmxhZ3MsIGh0b2xlMTYoTlZNRV9DUUVfUEhBU0UpKTsKIAlzdGF0ZS0+YyA9ICpjcWU7CisJU0VU
KHN0YXRlLT5jLmZsYWdzLCBodG9sZTE2KE5WTUVfQ1FFX1BIQVNFKSk7CiAKIAljY2ItPmNjYl9j
b29raWUgPSBzdGF0ZS0+Y29va2llOwogCXN0YXRlLT5kb25lKHEsIGNjYiwgJnN0YXRlLT5jKTsK
--0000000000009edf670592d6949a--
From: Rin Okuyama <rokuyama.rk@gmail.com>
To: Kimihiro Nonaka <nonakap@gmail.com>,
"gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: jdolecek@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
>= 32 ?
Date: Fri, 20 Sep 2019 09:45:47 +0900
On 2019/09/19 1:46, Kimihiro Nonaka wrote:
> Please try the attached patch.
Thank you for working on this. With your patch, it boots successfully
with hyper threading enabled:
cpu0 at mainbus0 apid 0
timecounter: Timecounter "lapic" frequency 24973379 Hz quality -100
cpu0: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz, id 0x50654
cpu0: package 0, core 0, smt 0
...
cpu47 at mainbus0 apid 59
cpu47: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz, id 0x50654
cpu47: package 1, core 13, smt 1
...
nvme0 at pci6 dev 0 function 0: vendor 8086 product 2700 (rev. 0x00)
nvme0: NVMe 1.0
allocated pic msix4 type edge pin 0 level 6 to cpu0 slot 20 idt entry 102
nvme0: for admin queue interrupting at msix4 vec 0
nvme0: INTEL SSDPED1D280GA, firmware E2010325, serial PHMB742101WX280CGN
allocated pic msix4 type edge pin 1 level 6 to cpu0 slot 21 idt entry 103
nvme0: for io queue 1 interrupting at msix4 vec 1 affinity to cpu0
...
allocated pic msix4 type edge pin 31 level 6 to cpu0 slot 22 idt entry 137
nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30
ld0 at nvme0 nsid 1
ld0: 260 GB, 34049 cyl, 255 head, 63 sec, 512 bytes/sect x 547002288 sectors
Here's full dmesg of GENERIC kernel (with bump MSGBUFSIZE and boot -v):
http://www.netbsd.org/~rin/nvme_panic_20190829/dmesg.20190920
Let me thank you again for your efforts!
rin
From: "NONAKA Kimihiro" <nonaka@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54503 CVS commit: src/sys/dev/ic
Date: Fri, 20 Sep 2019 05:32:42 +0000
Module Name: src
Committed By: nonaka
Date: Fri Sep 20 05:32:42 UTC 2019
Modified Files:
src/sys/dev/ic: nvme.c
Log Message:
Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().
A new completion queue entry check incorrectly determined that there was
a Completion Queue entry for a command that was not submitted.
Fix PR kern/54275, PR kern/54503, PR kern/54532.
To generate a diff of this commit:
cvs rdiff -u -r1.44 -r1.45 src/sys/dev/ic/nvme.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->pending-pullups
State-Changed-By: rin@NetBSD.org
State-Changed-When: Fri, 20 Sep 2019 21:52:38 +0000
State-Changed-Why:
[pullup-9 #218] by nonaka@
https://releng.netbsd.org/cgi-bin/req-9.cgi?show=218
Note that netbsd-9 with nvme.c rev 1.45 works fine on my machine.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54503 CVS commit: [netbsd-9] src/sys/dev/ic
Date: Sun, 22 Sep 2019 12:18:56 +0000
Module Name: src
Committed By: martin
Date: Sun Sep 22 12:18:56 UTC 2019
Modified Files:
src/sys/dev/ic [netbsd-9]: nvme.c
Log Message:
Pull up following revision(s) (requested by nonaka in ticket #218):
sys/dev/ic/nvme.c: revision 1.45
Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().
A new completion queue entry check incorrectly determined that there was
a Completion Queue entry for a command that was not submitted.
Fix PR kern/54275, PR kern/54503, PR kern/54532.
To generate a diff of this commit:
cvs rdiff -u -r1.44 -r1.44.2.1 src/sys/dev/ic/nvme.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: pending-pullups->closed
State-Changed-By: rin@NetBSD.org
State-Changed-When: Mon, 23 Sep 2019 05:05:12 +0000
State-Changed-Why:
Pullup to netbsd-9 done. Thank you nonaka for fixing it!
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54503 CVS commit: [netbsd-8] src/sys/dev/ic
Date: Wed, 25 Sep 2019 15:49:17 +0000
Module Name: src
Committed By: martin
Date: Wed Sep 25 15:49:17 UTC 2019
Modified Files:
src/sys/dev/ic [netbsd-8]: nvme.c
Log Message:
Pull up following revision(s) (requested by nonaka in ticket #1386):
sys/dev/ic/nvme.c: revision 1.45
Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().
A new completion queue entry check incorrectly determined that there was
a Completion Queue entry for a command that was not submitted.
Fix PR kern/54275, PR kern/54503, PR kern/54532.
To generate a diff of this commit:
cvs rdiff -u -r1.30.2.4 -r1.30.2.5 src/sys/dev/ic/nvme.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.