NetBSD Problem Report #54503

From www@netbsd.org  Thu Aug 29 07:52:28 2019
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 7FD297A1A1
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 29 Aug 2019 07:52:28 +0000 (UTC)
Message-Id: <20190829075227.72F147A1C4@mollari.NetBSD.org>
Date: Thu, 29 Aug 2019 07:52:27 +0000 (UTC)
From: rokuyama.rk@gmail.com
Reply-To: rokuyama.rk@gmail.com
To: gnats-bugs@NetBSD.org
Subject: Panic during attaching nvme(4) when # of logical CPUs >= 32 ?
X-Send-Pr-Version: www-1.0

>Number:         54503
>Category:       kern
>Synopsis:       Panic during attaching nvme(4) when # of logical CPUs >= 32 ?
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    jdolecek
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Aug 29 07:55:00 +0000 2019
>Closed-Date:    Mon Sep 23 05:05:12 +0000 2019
>Last-Modified:  Wed Sep 25 15:50:00 +0000 2019
>Originator:     Rin Okuyama
>Release:        9.99.10
>Organization:
Department of Physics, Meiji University
>Environment:
NetBSD kobrpd02 9.99.10 NetBSD 9.99.10 (GENERIC) #1: Thu Aug 29 12:07:14 JST 2019  rin@latipes:/build/work/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
Panic occurs during attaching nvme(4) when number of logical CPUs = 48
(24 cores x 2 threads per cores):

>> NetBSD/x86 BIOS Boot, Revision 5.11 (Tue Aug 27 14:53:16 UTC 2019) (from NetBSD 9.99.10)
>> Memory: 629/1702608 k
> boot -v
...
NetBSD 9.99.10 (GENERIC) #1: Thu Aug 29 12:07:14 JST 2019
...
cpu0 at mainbus0 apid 0
timecounter: Timecounter "lapic" frequency 24972680 Hz quality -100
cpu0: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz, id 0x50654
cpu0: package 0, core 0, smt 0
...
cpu47 at mainbus0 apid 59
cpu47: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz, id 0x50654
cpu47: package 1, core 13, smt 1
...
ppb3 at pci5 dev 0 function 0: vendor 8086 product 2030 (rev. 0x04)
ppb3: PCI Express capability version 2 <Root Port of PCI-E Root Complex> x16 @ 8.0GT/s
ppb3: link is x4 @ 8.0GT/s
pci6 at ppb3 bus 59
pci6: i/o space, memory space enabled, rd/line, wr/inv ok
nvme0 at pci6 dev 0 function 0: vendor 8086 product 2700 (rev. 0x00)
nvme0: NVMe 1.0
allocated pic msix4 type edge pin 0 level 6 to cpu0 slot 20 idt entry 101
nvme0: for admin queue interrupting at msix4 vec 0
nvme0: INTEL SSDPED1D280GA, firmware E2010325, serial PHMB742101WX280CGN
allocated pic msix4 type edge pin 1 level 6 to cpu0 slot 21 idt entry 102
nvme0: for io queue 1 interrupting at msix4 vec 1 affinity to cpu0
allocated pic msix4 type edge pin 2 level 6 to cpu0 slot 22 idt entry 103
nvme0: for io queue 2 interrupting at msix4 vec 2 affinity to cpu1
...
allocated pic msix4 type edge pin 31 level 6 to cpu0 slot 22 idt entry 136
nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30
prevented execution of 0x0 (SMEP)
fatal page fault in supervisor mode
trap type 6 code 0x10 rip 0 cs 0x8 rflags 0x10202 cr2 0 ilevel 0x8 rsp 0xffffffff81ae8318
curlwp 0xffffffff8165cc20 pid 0.1 lowest kstack 0xffffffff81ae42c0
kernel: page fault trap, code=0
Stopped in pid 0.1 (system) at  0:uvm_fault(0xffffffff817856e0, 0xffff8d4680000000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip 0xffffffff8021e5b6 cs 0x8 rflags 0x10a06 cr2 0xffff8d4680000000 ilevel 0x8 rsp 0xffffffff81ae7f00
curlwp 0xffffffff8165cc20 pid 0.1 lowest kstack 0xffffffff81ae42c0
      kernel: page fault trap, code=0
Stopped in pid 0.1 (system) at  netbsd:db_disasm+0xec:  movq    0(%rdx,%rcx,8),%
rcx
db{0}> bt
db_disasm() at netbsd:db_disasm+0xec
db_trap() at netbsd:db_trap+0xf4
kdb_trap() at netbsd:kdb_trap+0xe1
trap() at netbsd:trap+0x327
--- trap (number 6) ---
?() at 0
nvme_poll() at netbsd:nvme_poll+0x104             <--- nvme.c:1261
nvme_attach() at netbsd:nvme_attach+0x927         <--- nvme.c:1509
nvme_pci_attach() at netbsd:nvme_pci_attach+0x309 <--- nvme_pci.c:228
config_attach_loc() at netbsd:config_attach_loc+0x1a5
...

Full dmesg (with serial console) and netbsd.gdb is provided here
(This is GENERIC kernel with MSGBUFSIZE=1048576):

http://www.netbsd.org/~rin/nvme_panic_20190829/dmesg
http://www.netbsd.org/~rin/nvme_panic_20190829/netbsd.gdb.gz (CAUTION HUGE!!)

When hyper threading is disabled in BIOS, i.e., # of logical CPUs =
# of cores = 24, the system boots fine. Here's dmesg -t, intrctl list,
pcictl pci6 dump -d 0, and acpidump -d:

http://www.netbsd.org/~rin/nvme_panic_20190829/HT_disabled/dmesg
http://www.netbsd.org/~rin/nvme_panic_20190829/HT_disabled/intrctl
http://www.netbsd.org/~rin/nvme_panic_20190829/HT_disabled/pcictl
http://www.netbsd.org/~rin/nvme_panic_20190829/HT_disabled/acpidump

The system also boots fine if nvme* is disabled in userconf(4), even if
hyper threading is enabled in BIOS. Here's dmesg -t, intrctl list,
pcictl pci6 dump -d 0, and acpidump -d:

http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/dmesg
http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/intrctl
http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/pcictl
http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/acpidump

Can I provide any other information? Since this machine is located in
my previous office, I cannot access its console or BIOS immediately.
Now, it runs with hyper threading disabled.
>How-To-Repeat:
Boot that machine with nvme* and hyper threading enabled.
>Fix:
N/A

>Release-Note:

>Audit-Trail:
From: Thomas Klausner <wiz@NetBSD.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical
 CPUs >= 32 ?
Date: Thu, 29 Aug 2019 10:40:31 +0200

 On Thu, Aug 29, 2019 at 07:55:00AM +0000, rokuyama.rk@gmail.com wrote:
 > >Number:         54503
 > >Category:       kern
 > >Synopsis:       Panic during attaching nvme(4) when # of logical CPUs >= 32 ?

 Interesting workaround!
 This looks similar to my PR 54503 where the workaround was

 Index: nvme_pci.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/pci/nvme_pci.c,v
 retrieving revision 1.26
 diff -u -r1.26 nvme_pci.c
 --- nvme_pci.c  23 Jan 2019 06:56:19 -0000      1.26
 +++ nvme_pci.c  10 Jun 2019 08:18:33 -0000
 @@ -64,7 +64,7 @@
  #include <dev/ic/nvmereg.h>
  #include <dev/ic/nvmevar.h>

 -int nvme_pci_force_intx = 0;
 +int nvme_pci_force_intx = 1;
  int nvme_pci_mpsafe = 1;
  int nvme_pci_mq = 1;           /* INTx: ioq=1, MSI/MSI-X: ioq=ncpu */



  Thomas

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical
 CPUs >= 32 ?
Date: Thu, 29 Aug 2019 10:56:17 +0000

 on irc jared points out

 arch/x86/include/intrdefs.h
 50:#define MAX_INTR_SOURCES	32

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Thu, 29 Aug 2019 15:03:32 +0000
Responsible-Changed-Why:
I'll look at this.


From: Rin Okuyama <rokuyama.rk@gmail.com>
To: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
 >= 32 ?
Date: Fri, 30 Aug 2019 19:36:37 +0900

 On 2019/08/30 0:12, Jaromír Doleček wrote:
 > Le jeu. 29 août 2019 à 13:00, <coypu@sdf.org> a écrit :
 >>   on irc jared points out
 >>
 >>   arch/x86/include/intrdefs.h
 >>   50:#define MAX_INTR_SOURCES    32
 > 
 > I think this is only per CPU limit.

 I think so too. Actually, ixg(4) uses MSI-X interrupts with 49 vectors
 when hyper threading is enabled:

 http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/dmesg

 Considering with wiz@'s workaround, I guess that something may go wrong
 when MSI-X interrupts are established in the nvme(4) code.

 Thanks,
 rin

From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: Rin Okuyama <rokuyama.rk@gmail.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
 >= 32 ?
Date: Fri, 30 Aug 2019 16:28:31 +0200

 Can you please try a kernel with nvme_q_complete() marked __noinline,
 to see where exactly inside that function the code panics? I've
 reviewed the code and I don't see any particular reason why it would
 fail while setting up 32nd queue.

 Can you also send output of 'nvmectl identify nvme0' when booted
 successfully i.e. with HT disabled.

 Jaromir

 Le ven. 30 ao=C3=BBt 2019 =C3=A0 12:36, Rin Okuyama <rokuyama.rk@gmail.com>=
  a =C3=A9crit :
 >
 > On 2019/08/30 0:12, Jarom=C3=ADr Dole=C4=8Dek wrote:
 > > Le jeu. 29 ao=C3=BBt 2019 =C3=A0 13:00, <coypu@sdf.org> a =C3=A9crit :
 > >>   on irc jared points out
 > >>
 > >>   arch/x86/include/intrdefs.h
 > >>   50:#define MAX_INTR_SOURCES    32
 > >
 > > I think this is only per CPU limit.
 >
 > I think so too. Actually, ixg(4) uses MSI-X interrupts with 49 vectors
 > when hyper threading is enabled:
 >
 > http://www.netbsd.org/~rin/nvme_panic_20190829/nvme_disabled/dmesg
 >
 > Considering with wiz@'s workaround, I guess that something may go wrong
 > when MSI-X interrupts are established in the nvme(4) code.
 >
 > Thanks,
 > rin

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
 >= 32 ?
Date: Fri, 30 Aug 2019 23:43:54 +0900

 On 2019/08/30 23:28, Jaromír Doleček wrote:
 > Can you please try a kernel with nvme_q_complete() marked __noinline,
 > to see where exactly inside that function the code panics? I've
 > reviewed the code and I don't see any particular reason why it would
 > fail while setting up 32nd queue.
 > 
 > Can you also send output of 'nvmectl identify nvme0' when booted
 > successfully i.e. with HT disabled.

 Thank you very much for working on this!

 Since this machine is located in my previous office, I cannot access
 its BIOS/console immediately. I will visit there tomorrow to try it.

 Here's output of nvmectl with HT disabled:

 ----
 % sudo nvmectl identify nvme0
 Controller Capabilities/Features
 ================================
 Vendor ID:                  8086
 Subsystem Vendor ID:        8086
 Serial Number:              PHMB742101WX280CGN
 Model Number:               INTEL SSDPED1D280GA
 Firmware Version:           E2010325
 Recommended Arb Burst:      0
 IEEE OUI Identifier:        e4 d2 5c
 Multi-Interface Cap:        00
 Max Data Transfer Size:     131072
 Controller ID:              0x00

 Admin Command Set Attributes
 ============================
 Security Send/Receive:       Supported
 Format NVM:                  Supported
 Firmware Activate/Download:  Supported
 Namespace Managment:         Not Supported
 Abort Command Limit:         4
 Async Event Request Limit:   4
 Number of Firmware Slots:    1
 Firmware Slot 1 Read-Only:   No
 Per-Namespace SMART Log:     No
 Error Log Page Entries:      64
 Number of Power States:      1

 NVM Command Set Attributes
 ==========================
 Submission Queue Entry Size
    Max:                       64
    Min:                       64
 Completion Queue Entry Size
    Max:                       16
    Min:                       16
 Number of Namespaces:        1
 Compare Command:             Not Supported
 Write Uncorrectable Command: Supported
 Dataset Management Command:  Supported
 Write Zeroes Command:        Not Supported
 Features Save/Select Field:  Not Supported
 Reservation:                 Not Supported
 Volatile Write Cache:        Not Present
 ----

 Thanks,
 rin

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
 >= 32 ?
Date: Sat, 31 Aug 2019 18:52:18 +0900

 On 2019/08/30 23:28, Jaromír Doleček wrote:
 > Can you please try a kernel with nvme_q_complete() marked __noinline,
 > to see where exactly inside that function the code panics? I've
 > reviewed the code and I don't see any particular reason why it would
 > fail while setting up 32nd queue.

 nvme_q_complete() is not expanded inline even without __noinline.

 Instruction at the fault address,

 	allocated pic msix4 type edge pin 31 level 6 to cpu0 slot 22 idt entry 136
 	nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30
 	prevented execution of 0x0 (SMEP)
 	fatal page fault in supervisor mode
 	...
 	db{0}> bt
 	db_disasm() at netbsd:db_disasm+0xcd
 	db_trap() at netbsd:db_trap+0x16b
 	kdb_trap() at netbsd:kdb_trap+0x12a
 	trap() at netbsd:trap+0x49d
 	--- trap (number 6) ---
 	?() at 0
 	nvme_poll() at netbsd:nvme_poll+0x154
 	nvme_attach() at netbsd:nvme_attach+0x12d2
 	nvme_pci_attach() at netbsd:nvme_pci_attach+0x6fe
 	...

 nvme_poll+0x154 is "test %eax,%eax" just after returning from
 nvme_q_complete() in nvme.c:1261.

 https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1261

     1240 static int
     1241 nvme_poll(struct nvme_softc *sc, struct nvme_queue *q, struct nvme_ccb *ccb,
     1242     void (*fill)(struct nvme_queue *, struct nvme_ccb *, void *), int timo_sec)
     1243 {
     ....
     1261 		if (nvme_q_complete(sc, q) == 0)

 0000000000004903 <nvme_poll>:
      ....
      4a52:       e8 ee c4 ff ff          callq  f45 <nvme_q_complete>
      4a57:       85 c0                   test   %eax,%eax
      ....

 I don't understand why "execution of NULL" occurs by such a
 instruction. Maybe this is an incorrect alert. By using kernel
 with option KUBSAN, I found

 	nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30
 	prevented execution of 0x0 (SMEP)
 	fatal page fault in supervisor mode
 	...
 	db{0}> reboot
 	UBSan: Undefined Behavior in ../../../../dev/ic/nvme.c:1306:11, member access within null pointer of type 'struct nvme_poll_state'
 	fatal page fault in supervisor mode
 	trap type 6 code 0x2 rip 0xffffffff8102487c cs 0x8 rflags 0x10246 cr2 0x40 ilevel 0x3 rsp 0xffffd0b57e9b5fc0
 	curlwp 0xffffc02d225038c0 pid 0.4 lowest kstack 0xffffd0b57e9b22c0
 	kernel: page fault trap, code=0
 	Stopped in pid 0.4 (system) at  netbsd:nvme_poll_done+0x7a:     movq    %rsi,40(%rbx)

 This should be the real cause of panic. Then, backtrace reads,

 	db{0}> bt
 	nvme_poll_done() at netbsd:nvme_poll_done+0x7a          <--- nvme.c:1306
 	nvme_q_complete() at netbsd:nvme_q_complete+0x259       <--- nvme.c:1374
 	softint_dispatch() at netbsd:softint_dispatch+0x20b
 	DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xffffd0b57e9b60f0
 	Xsoftintr() at netbsd:Xsoftintr+0x4f
 	--- interrupt ---
 	0:
 	db{0}>

 where

 https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1306

     1299 static void
     1300 nvme_poll_done(struct nvme_queue *q, struct nvme_ccb *ccb,
     1301     struct nvme_cqe *cqe)
     1302 {
     1303 	struct nvme_poll_state *state = ccb->ccb_cookie;
     ....
     1306 	state->c = *cqe;

 https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1374

     1327 static int
     1328 nvme_q_complete(struct nvme_softc *sc, struct nvme_queue *q)
     1329 {
     ....
     1372 		mutex_exit(&q->q_cq_mtx);
     1373 		ccb->ccb_done(q, ccb, cqe);
     1374 		mutex_enter(&q->q_cq_mtx);

 Do you think why ccb->ccb_cookie becomes NULL? I uploaded full
 dmesg and log of DDB at:

 http://www.netbsd.org/~rin/nvme_panic_20190829/dmesg.20190831

 Thanks,
 rin

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical
 CPUs >= 32 ?
Date: Sun, 8 Sep 2019 10:00:27 +0000

 This is a little violent, but perhaps as a stopgap measure, we can force
 INTx if too many CPUs are detected.

From: Kimihiro Nonaka <nonakap@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: jdolecek@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, 
	rokuyama.rk@gmail.com
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
 >= 32 ?
Date: Thu, 19 Sep 2019 01:46:27 +0900

 --0000000000009edf670592d6949a
 Content-Type: text/plain; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 Please try the attached patch.

 On Sat, Aug 31, 2019 at 6:55 PM Rin Okuyama <rokuyama.rk@gmail.com> wrote:
 >
 > The following reply was made to PR kern/54503; it has been noted by GNATS=
 .
 >
 > From: Rin Okuyama <rokuyama.rk@gmail.com>
 > To: =3D?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=3D?=3D <jaromir.dolecek@gmail.com=
 >
 > Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
 > Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical=
  CPUs
 >  >=3D 32 ?
 > Date: Sat, 31 Aug 2019 18:52:18 +0900
 >
 >  On 2019/08/30 23:28, Jarom=C3=ADr Dole=C4=8Dek wrote:
 >  > Can you please try a kernel with nvme_q_complete() marked __noinline,
 >  > to see where exactly inside that function the code panics? I've
 >  > reviewed the code and I don't see any particular reason why it would
 >  > fail while setting up 32nd queue.
 >
 >  nvme_q_complete() is not expanded inline even without __noinline.
 >
 >  Instruction at the fault address,
 >
 >         allocated pic msix4 type edge pin 31 level 6 to cpu0 slot 22 idt =
 entry 136
 >         nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to c=
 pu30
 >         prevented execution of 0x0 (SMEP)
 >         fatal page fault in supervisor mode
 >         ...
 >         db{0}> bt
 >         db_disasm() at netbsd:db_disasm+0xcd
 >         db_trap() at netbsd:db_trap+0x16b
 >         kdb_trap() at netbsd:kdb_trap+0x12a
 >         trap() at netbsd:trap+0x49d
 >         --- trap (number 6) ---
 >         ?() at 0
 >         nvme_poll() at netbsd:nvme_poll+0x154
 >         nvme_attach() at netbsd:nvme_attach+0x12d2
 >         nvme_pci_attach() at netbsd:nvme_pci_attach+0x6fe
 >         ...
 >
 >  nvme_poll+0x154 is "test %eax,%eax" just after returning from
 >  nvme_q_complete() in nvme.c:1261.
 >
 >  https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1261
 >
 >      1240 static int
 >      1241 nvme_poll(struct nvme_softc *sc, struct nvme_queue *q, struct n=
 vme_ccb *ccb,
 >      1242     void (*fill)(struct nvme_queue *, struct nvme_ccb *, void *=
 ), int timo_sec)
 >      1243 {
 >      ....
 >      1261               if (nvme_q_complete(sc, q) =3D=3D 0)
 >
 >  0000000000004903 <nvme_poll>:
 >       ....
 >       4a52:       e8 ee c4 ff ff          callq  f45 <nvme_q_complete>
 >       4a57:       85 c0                   test   %eax,%eax
 >       ....
 >
 >  I don't understand why "execution of NULL" occurs by such a
 >  instruction. Maybe this is an incorrect alert. By using kernel
 >  with option KUBSAN, I found
 >
 >         nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to c=
 pu30
 >         prevented execution of 0x0 (SMEP)
 >         fatal page fault in supervisor mode
 >         ...
 >         db{0}> reboot
 >         UBSan: Undefined Behavior in ../../../../dev/ic/nvme.c:1306:11, m=
 ember access within null pointer of type 'struct nvme_poll_state'
 >         fatal page fault in supervisor mode
 >         trap type 6 code 0x2 rip 0xffffffff8102487c cs 0x8 rflags 0x10246=
  cr2 0x40 ilevel 0x3 rsp 0xffffd0b57e9b5fc0
 >         curlwp 0xffffc02d225038c0 pid 0.4 lowest kstack 0xffffd0b57e9b22c=
 0
 >         kernel: page fault trap, code=3D0
 >         Stopped in pid 0.4 (system) at  netbsd:nvme_poll_done+0x7a:     m=
 ovq    %rsi,40(%rbx)
 >
 >  This should be the real cause of panic. Then, backtrace reads,
 >
 >         db{0}> bt
 >         nvme_poll_done() at netbsd:nvme_poll_done+0x7a          <--- nvme=
 .c:1306
 >         nvme_q_complete() at netbsd:nvme_q_complete+0x259       <--- nvme=
 .c:1374
 >         softint_dispatch() at netbsd:softint_dispatch+0x20b
 >         DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xffffd0b57e9b60=
 f0
 >         Xsoftintr() at netbsd:Xsoftintr+0x4f
 >         --- interrupt ---
 >         0:
 >         db{0}>
 >
 >  where
 >
 >  https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1306
 >
 >      1299 static void
 >      1300 nvme_poll_done(struct nvme_queue *q, struct nvme_ccb *ccb,
 >      1301     struct nvme_cqe *cqe)
 >      1302 {
 >      1303       struct nvme_poll_state *state =3D ccb->ccb_cookie;
 >      ....
 >      1306       state->c =3D *cqe;
 >
 >  https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1374
 >
 >      1327 static int
 >      1328 nvme_q_complete(struct nvme_softc *sc, struct nvme_queue *q)
 >      1329 {
 >      ....
 >      1372               mutex_exit(&q->q_cq_mtx);
 >      1373               ccb->ccb_done(q, ccb, cqe);
 >      1374               mutex_enter(&q->q_cq_mtx);
 >
 >  Do you think why ccb->ccb_cookie becomes NULL? I uploaded full
 >  dmesg and log of DDB at:
 >
 >  http://www.netbsd.org/~rin/nvme_panic_20190829/dmesg.20190831
 >
 >  Thanks,
 >  rin
 >

 --0000000000009edf670592d6949a
 Content-Type: application/octet-stream; name="nvme.diff"
 Content-Disposition: attachment; filename="nvme.diff"
 Content-Transfer-Encoding: base64
 Content-ID: <f_k0phz9240>
 X-Attachment-Id: f_k0phz9240

 ZGlmZiAtLWdpdCBhL3N5cy9kZXYvaWMvbnZtZS5jIGIvc3lzL2Rldi9pYy9udm1lLmMKaW5kZXgg
 ZjkzMzc3YWJlNTUuLjViMWY3ZDk5NDgwIDEwMDY0NAotLS0gYS9zeXMvZGV2L2ljL252bWUuYwor
 KysgYi9zeXMvZGV2L2ljL252bWUuYwpAQCAtMTMwMiw4ICsxMzAyLDggQEAgbnZtZV9wb2xsX2Rv
 bmUoc3RydWN0IG52bWVfcXVldWUgKnEsIHN0cnVjdCBudm1lX2NjYiAqY2NiLAogewogCXN0cnVj
 dCBudm1lX3BvbGxfc3RhdGUgKnN0YXRlID0gY2NiLT5jY2JfY29va2llOwogCi0JU0VUKGNxZS0+
 ZmxhZ3MsIGh0b2xlMTYoTlZNRV9DUUVfUEhBU0UpKTsKIAlzdGF0ZS0+YyA9ICpjcWU7CisJU0VU
 KHN0YXRlLT5jLmZsYWdzLCBodG9sZTE2KE5WTUVfQ1FFX1BIQVNFKSk7CiAKIAljY2ItPmNjYl9j
 b29raWUgPSBzdGF0ZS0+Y29va2llOwogCXN0YXRlLT5kb25lKHEsIGNjYiwgJnN0YXRlLT5jKTsK
 --0000000000009edf670592d6949a--

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: Kimihiro Nonaka <nonakap@gmail.com>,
 "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: jdolecek@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
 >= 32 ?
Date: Fri, 20 Sep 2019 09:45:47 +0900

 On 2019/09/19 1:46, Kimihiro Nonaka wrote:
 > Please try the attached patch.

 Thank you for working on this. With your patch, it boots successfully
 with hyper threading enabled:

 	cpu0 at mainbus0 apid 0
 	timecounter: Timecounter "lapic" frequency 24973379 Hz quality -100
 	cpu0: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz, id 0x50654
 	cpu0: package 0, core 0, smt 0
 	...
 	cpu47 at mainbus0 apid 59
 	cpu47: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz, id 0x50654
 	cpu47: package 1, core 13, smt 1
 	...
 	nvme0 at pci6 dev 0 function 0: vendor 8086 product 2700 (rev. 0x00)
 	nvme0: NVMe 1.0
 	allocated pic msix4 type edge pin 0 level 6 to cpu0 slot 20 idt entry 102
 	nvme0: for admin queue interrupting at msix4 vec 0
 	nvme0: INTEL SSDPED1D280GA, firmware E2010325, serial PHMB742101WX280CGN
 	allocated pic msix4 type edge pin 1 level 6 to cpu0 slot 21 idt entry 103
 	nvme0: for io queue 1 interrupting at msix4 vec 1 affinity to cpu0
 	...
 	allocated pic msix4 type edge pin 31 level 6 to cpu0 slot 22 idt entry 137
 	nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30
 	ld0 at nvme0 nsid 1
 	ld0: 260 GB, 34049 cyl, 255 head, 63 sec, 512 bytes/sect x 547002288 sectors

 Here's full dmesg of GENERIC kernel (with bump MSGBUFSIZE and boot -v):

 	http://www.netbsd.org/~rin/nvme_panic_20190829/dmesg.20190920

 Let me thank you again for your efforts!

 rin

From: "NONAKA Kimihiro" <nonaka@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54503 CVS commit: src/sys/dev/ic
Date: Fri, 20 Sep 2019 05:32:42 +0000

 Module Name:	src
 Committed By:	nonaka
 Date:		Fri Sep 20 05:32:42 UTC 2019

 Modified Files:
 	src/sys/dev/ic: nvme.c

 Log Message:
 Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().

 A new completion queue entry check incorrectly determined that there was
 a Completion Queue entry for a command that was not submitted.

 Fix PR kern/54275, PR kern/54503, PR kern/54532.


 To generate a diff of this commit:
 cvs rdiff -u -r1.44 -r1.45 src/sys/dev/ic/nvme.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->pending-pullups
State-Changed-By: rin@NetBSD.org
State-Changed-When: Fri, 20 Sep 2019 21:52:38 +0000
State-Changed-Why:
[pullup-9 #218] by nonaka@
https://releng.netbsd.org/cgi-bin/req-9.cgi?show=218

Note that netbsd-9 with nvme.c rev 1.45 works fine on my machine.


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54503 CVS commit: [netbsd-9] src/sys/dev/ic
Date: Sun, 22 Sep 2019 12:18:56 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Sun Sep 22 12:18:56 UTC 2019

 Modified Files:
 	src/sys/dev/ic [netbsd-9]: nvme.c

 Log Message:
 Pull up following revision(s) (requested by nonaka in ticket #218):

 	sys/dev/ic/nvme.c: revision 1.45

 Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().

 A new completion queue entry check incorrectly determined that there was
 a Completion Queue entry for a command that was not submitted.

 Fix PR kern/54275, PR kern/54503, PR kern/54532.


 To generate a diff of this commit:
 cvs rdiff -u -r1.44 -r1.44.2.1 src/sys/dev/ic/nvme.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: rin@NetBSD.org
State-Changed-When: Mon, 23 Sep 2019 05:05:12 +0000
State-Changed-Why:
Pullup to netbsd-9 done. Thank you nonaka for fixing it!


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54503 CVS commit: [netbsd-8] src/sys/dev/ic
Date: Wed, 25 Sep 2019 15:49:17 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Wed Sep 25 15:49:17 UTC 2019

 Modified Files:
 	src/sys/dev/ic [netbsd-8]: nvme.c

 Log Message:
 Pull up following revision(s) (requested by nonaka in ticket #1386):

 	sys/dev/ic/nvme.c: revision 1.45

 Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().

 A new completion queue entry check incorrectly determined that there was
 a Completion Queue entry for a command that was not submitted.

 Fix PR kern/54275, PR kern/54503, PR kern/54532.


 To generate a diff of this commit:
 cvs rdiff -u -r1.30.2.4 -r1.30.2.5 src/sys/dev/ic/nvme.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.