NetBSD Problem Report #54275

From wiz@yt.nih.at  Wed Jun  5 23:09:10 2019
Return-Path: <wiz@yt.nih.at>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id DBAB97A1C2
	for <gnats-bugs@gnats.NetBSD.org>; Wed,  5 Jun 2019 23:09:10 +0000 (UTC)
Message-Id: <20190605214340.677602AC0FC@yt.nih.at>
Date: Wed,  5 Jun 2019 23:43:40 +0200 (CEST)
From: Thomas Klausner <wiz@NetBSD.org>
Reply-To: Thomas Klausner <wiz@NetBSD.org>
To: gnats-bugs@NetBSD.org
Subject: nvme_poll() called too early?
X-Send-Pr-Version: 3.95

>Number:         54275
>Category:       kern
>Synopsis:       nvme_poll() called too early?
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jun 05 23:10:00 +0000 2019
>Closed-Date:    Sun Sep 22 19:13:33 +0000 2019
>Last-Modified:  Wed Sep 25 15:50:03 +0000 2019
>Originator:     Thomas Klausner
>Release:        NetBSD 8.99.41
>Organization:
Curiosity is the very basis of education and if you tell me that
curiosity killed the cat, I say only that the cat died nobly.
- Arnold Edinborough
>Environment:


Architecture: x86_64
Machine: amd64
>Description:
Booting the amd64 USB install image from
https://nyftp.netbsd.org/pub/NetBSD-daily/HEAD/201905242230Z/images/
on a new Threadripper system with a Samsung SSD 1TB 970 Evo plus, I
see (handcopied):

panic: prevented jump to null instruction pointer (SMEP)
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x160
snprintf() at netbsd:snprintf
trap() at netbsd:trap+0x2ab
--- trap (number 6) ---
?() at 0
nvme_poll() at netbsd:nvme_poll+0x104
nvme_attach() at netbsd:nvme_atached+0x927
nvme_pci_attach() at netbsd_nvme_pci_attach+0x309
config_attach_loc()...
config_found_sm_loc()...
pci_probe_device()...
pci_enumerate_bus()...
pcirescan()...
pciattach()...
config_attach_loc()...
config_found_sm_loc()...
mp_pci_scan()...
amd64_mainbus_attach()...
mainbus_attach()...
config_attach_loc()...
cpu_configure()...
main()...
cpu0: End traceback

Disabling nvme with "userconf disable nvme*" makes the kernel boot successfully.

OpenBSD 6.5 and FreeBSD 12.0 boot successfully on the same system, including
the nvme device.

dmesg output for the nvme device (on NetBSD):

nvme0 at pci0 dev 0 function 0: vendor 144d product a808 (rev. 0x00) (not sure if it's 808 or 000 or something inbetween, bad screenshot)
nvme0: NVMe 1.3
nvme0: for admin queue interrupting at msix3 vec 0
nvme0: for io queue ... interrupting at msix3 vec ... affinity to cpu ... (where ... goes from 0 to 31)


>How-To-Repeat:
Boot on the system with nvme enabled (default on).
>Fix:
Please!

>Release-Note:

>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org, wiz@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: kern/54275: nvme_poll() called too early?
Date: Thu, 06 Jun 2019 11:26:10 +1000

 can you test this change?

   https://www.netbsd.org/~mrg/nvme_pci.diff

 this crashed when i booted an i386 kernel, but it's needed on the
 arm64 systems currently as interrupt routing isn't available when
 the interrupt controller isn't yet attached.

 (still doesn't work -- i do not get interrupts.)


 .mrg.

From: Thomas Klausner <wiz@netbsd.org>
To: matthew green <mrg@eterna.com.au>
Cc: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Subject: Re: kern/54275: nvme_poll() called too early?
Date: Thu, 6 Jun 2019 07:53:41 +0200

 Thanks for the patch!

 It makes the machine panic differently:

 (handcopied)

 nvme0: for admin queue interrupting at msix9 vec 0
 nvme0: Samsung SSD 970 EVO Plus 1TB, firmware 1B2QEXM7, serial SOMETHING
 ahcisata0 port 5: device present, speed: 6.GB/s
 ahcisata0 port 5: clearing WDCIL_RSI failed for drive 15 (THIS LINE DOES NOT HAVE A TIMESTAMP)
 ahcisata0 port 5: drive 15 reset failed, sdisabling PMP (THIS LINE DOES NOT HAVE A TIMESTAMP)
 ahcisata0 port 5: device present, speed: 6.GB/s
 uhidev...
 ukbd0...
 wskbd...
 uhidev...
 uhid...
 wd0...
 nvme0: for io queue 1 interrupting at msix9 vec 1 affinity to cpu0
 (repeat with increasing queue, vec, and cpuX)
 nvme0: for io queue 31 interrupting at msix9 vec 31 affinity to cpu30
 fatal page fault in supervisor mode
 trap type 6 code 0x10 rip 0 cs 0x8 rflags 0x10202 cr2 0 ilevel 0x3 rsp 0xffffac090732b028
 curlwp 0xfffffa619cdec980 pid 0.22 lowest kstack 0xffffac09073272c0
 kernel: page fault trap, code=0
 Stopped in pid 0.22 (system) at 0:fatal page fault in supervisor mode
 trap type 6 code 0x10 rip 0xffffffff8021e646 cs 0x8 rflags 0x10216 cr2 0xffff964b00000000 ilevel 0x8 rsp 0xffffac090732ac10
 curlwp 0xfffffa619cdec980 pid 0.22 lowest kstrack 0xffffac09073272c0
 kernel: page fault trap, code=0
 Stopped in pid 0.22 (system) at netbsd:db_disasm+0xec: movq 0(%rdx,%rcx,8),%
 rcx
 db{1}> bt
 db_disasm()
 db_trap()
 kdb_trap()
 trap()
 --- trap (number 6) ---
 ?() at 0
 softint_dispatch()
 DDB lost frame for netbsd:Xsoftintr
 Xsoftintr()
 --- interrupt ---
 1bce138d178e1389:
 db{1}>

 Cheers,
  Thomas

From: Thomas Klausner <wiz@NetBSD.org>
To: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: kern/54275: nvme_poll() called too early?
Date: Mon, 10 Jun 2019 10:21:09 +0200

 The following diff makes it boot with nvme for me:

 Index: nvme_pci.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/pci/nvme_pci.c,v
 retrieving revision 1.26
 diff -u -r1.26 nvme_pci.c
 --- nvme_pci.c	23 Jan 2019 06:56:19 -0000	1.26
 +++ nvme_pci.c	10 Jun 2019 08:18:33 -0000
 @@ -64,7 +64,7 @@
  #include <dev/ic/nvmereg.h>
  #include <dev/ic/nvmevar.h>

 -int nvme_pci_force_intx = 0;
 +int nvme_pci_force_intx = 1;
  int nvme_pci_mpsafe = 1;
  int nvme_pci_mq = 1;		/* INTx: ioq=1, MSI/MSI-X: ioq=ncpu */



 There doesn't seem to be a quirk for this. How do we want to handle this?
  Thomas

From: "NONAKA Kimihiro" <nonaka@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54275 CVS commit: src/sys/dev/ic
Date: Fri, 20 Sep 2019 05:32:42 +0000

 Module Name:	src
 Committed By:	nonaka
 Date:		Fri Sep 20 05:32:42 UTC 2019

 Modified Files:
 	src/sys/dev/ic: nvme.c

 Log Message:
 Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().

 A new completion queue entry check incorrectly determined that there was
 a Completion Queue entry for a command that was not submitted.

 Fix PR kern/54275, PR kern/54503, PR kern/54532.


 To generate a diff of this commit:
 cvs rdiff -u -r1.44 -r1.45 src/sys/dev/ic/nvme.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54275 CVS commit: [netbsd-9] src/sys/dev/ic
Date: Sun, 22 Sep 2019 12:18:56 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Sun Sep 22 12:18:56 UTC 2019

 Modified Files:
 	src/sys/dev/ic [netbsd-9]: nvme.c

 Log Message:
 Pull up following revision(s) (requested by nonaka in ticket #218):

 	sys/dev/ic/nvme.c: revision 1.45

 Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().

 A new completion queue entry check incorrectly determined that there was
 a Completion Queue entry for a command that was not submitted.

 Fix PR kern/54275, PR kern/54503, PR kern/54532.


 To generate a diff of this commit:
 cvs rdiff -u -r1.44 -r1.44.2.1 src/sys/dev/ic/nvme.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: wiz@NetBSD.org
State-Changed-When: Sun, 22 Sep 2019 19:13:33 +0000
State-Changed-Why:
nonaka's patch seems to work fine.
The kernel boots without issues, and it has survived building NetBSD
and a limited bulk build - much further than it usually got.
Thank you, nonaka!


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54275 CVS commit: [netbsd-8] src/sys/dev/ic
Date: Wed, 25 Sep 2019 15:49:17 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Wed Sep 25 15:49:17 UTC 2019

 Modified Files:
 	src/sys/dev/ic [netbsd-8]: nvme.c

 Log Message:
 Pull up following revision(s) (requested by nonaka in ticket #1386):

 	sys/dev/ic/nvme.c: revision 1.45

 Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().

 A new completion queue entry check incorrectly determined that there was
 a Completion Queue entry for a command that was not submitted.

 Fix PR kern/54275, PR kern/54503, PR kern/54532.


 To generate a diff of this commit:
 cvs rdiff -u -r1.30.2.4 -r1.30.2.5 src/sys/dev/ic/nvme.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.