NetBSD Problem Report #54275
From wiz@yt.nih.at Wed Jun 5 23:09:10 2019
Return-Path: <wiz@yt.nih.at>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id DBAB97A1C2
for <gnats-bugs@gnats.NetBSD.org>; Wed, 5 Jun 2019 23:09:10 +0000 (UTC)
Message-Id: <20190605214340.677602AC0FC@yt.nih.at>
Date: Wed, 5 Jun 2019 23:43:40 +0200 (CEST)
From: Thomas Klausner <wiz@NetBSD.org>
Reply-To: Thomas Klausner <wiz@NetBSD.org>
To: gnats-bugs@NetBSD.org
Subject: nvme_poll() called too early?
X-Send-Pr-Version: 3.95
>Number: 54275
>Category: kern
>Synopsis: nvme_poll() called too early?
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jun 05 23:10:00 +0000 2019
>Closed-Date: Sun Sep 22 19:13:33 +0000 2019
>Last-Modified: Wed Sep 25 15:50:03 +0000 2019
>Originator: Thomas Klausner
>Release: NetBSD 8.99.41
>Organization:
Curiosity is the very basis of education and if you tell me that
curiosity killed the cat, I say only that the cat died nobly.
- Arnold Edinborough
>Environment:
Architecture: x86_64
Machine: amd64
>Description:
Booting the amd64 USB install image from
https://nyftp.netbsd.org/pub/NetBSD-daily/HEAD/201905242230Z/images/
on a new Threadripper system with a Samsung SSD 1TB 970 Evo plus, I
see (handcopied):
panic: prevented jump to null instruction pointer (SMEP)
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x160
snprintf() at netbsd:snprintf
trap() at netbsd:trap+0x2ab
--- trap (number 6) ---
?() at 0
nvme_poll() at netbsd:nvme_poll+0x104
nvme_attach() at netbsd:nvme_atached+0x927
nvme_pci_attach() at netbsd_nvme_pci_attach+0x309
config_attach_loc()...
config_found_sm_loc()...
pci_probe_device()...
pci_enumerate_bus()...
pcirescan()...
pciattach()...
config_attach_loc()...
config_found_sm_loc()...
mp_pci_scan()...
amd64_mainbus_attach()...
mainbus_attach()...
config_attach_loc()...
cpu_configure()...
main()...
cpu0: End traceback
Disabling nvme with "userconf disable nvme*" makes the kernel boot successfully.
OpenBSD 6.5 and FreeBSD 12.0 boot successfully on the same system, including
the nvme device.
dmesg output for the nvme device (on NetBSD):
nvme0 at pci0 dev 0 function 0: vendor 144d product a808 (rev. 0x00) (not sure if it's 808 or 000 or something inbetween, bad screenshot)
nvme0: NVMe 1.3
nvme0: for admin queue interrupting at msix3 vec 0
nvme0: for io queue ... interrupting at msix3 vec ... affinity to cpu ... (where ... goes from 0 to 31)
>How-To-Repeat:
Boot on the system with nvme enabled (default on).
>Fix:
Please!
>Release-Note:
>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org, wiz@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: re: kern/54275: nvme_poll() called too early?
Date: Thu, 06 Jun 2019 11:26:10 +1000
can you test this change?
https://www.netbsd.org/~mrg/nvme_pci.diff
this crashed when i booted an i386 kernel, but it's needed on the
arm64 systems currently as interrupt routing isn't available when
the interrupt controller isn't yet attached.
(still doesn't work -- i do not get interrupts.)
.mrg.
From: Thomas Klausner <wiz@netbsd.org>
To: matthew green <mrg@eterna.com.au>
Cc: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Subject: Re: kern/54275: nvme_poll() called too early?
Date: Thu, 6 Jun 2019 07:53:41 +0200
Thanks for the patch!
It makes the machine panic differently:
(handcopied)
nvme0: for admin queue interrupting at msix9 vec 0
nvme0: Samsung SSD 970 EVO Plus 1TB, firmware 1B2QEXM7, serial SOMETHING
ahcisata0 port 5: device present, speed: 6.GB/s
ahcisata0 port 5: clearing WDCIL_RSI failed for drive 15 (THIS LINE DOES NOT HAVE A TIMESTAMP)
ahcisata0 port 5: drive 15 reset failed, sdisabling PMP (THIS LINE DOES NOT HAVE A TIMESTAMP)
ahcisata0 port 5: device present, speed: 6.GB/s
uhidev...
ukbd0...
wskbd...
uhidev...
uhid...
wd0...
nvme0: for io queue 1 interrupting at msix9 vec 1 affinity to cpu0
(repeat with increasing queue, vec, and cpuX)
nvme0: for io queue 31 interrupting at msix9 vec 31 affinity to cpu30
fatal page fault in supervisor mode
trap type 6 code 0x10 rip 0 cs 0x8 rflags 0x10202 cr2 0 ilevel 0x3 rsp 0xffffac090732b028
curlwp 0xfffffa619cdec980 pid 0.22 lowest kstack 0xffffac09073272c0
kernel: page fault trap, code=0
Stopped in pid 0.22 (system) at 0:fatal page fault in supervisor mode
trap type 6 code 0x10 rip 0xffffffff8021e646 cs 0x8 rflags 0x10216 cr2 0xffff964b00000000 ilevel 0x8 rsp 0xffffac090732ac10
curlwp 0xfffffa619cdec980 pid 0.22 lowest kstrack 0xffffac09073272c0
kernel: page fault trap, code=0
Stopped in pid 0.22 (system) at netbsd:db_disasm+0xec: movq 0(%rdx,%rcx,8),%
rcx
db{1}> bt
db_disasm()
db_trap()
kdb_trap()
trap()
--- trap (number 6) ---
?() at 0
softint_dispatch()
DDB lost frame for netbsd:Xsoftintr
Xsoftintr()
--- interrupt ---
1bce138d178e1389:
db{1}>
Cheers,
Thomas
From: Thomas Klausner <wiz@NetBSD.org>
To: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Cc:
Subject: Re: kern/54275: nvme_poll() called too early?
Date: Mon, 10 Jun 2019 10:21:09 +0200
The following diff makes it boot with nvme for me:
Index: nvme_pci.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/nvme_pci.c,v
retrieving revision 1.26
diff -u -r1.26 nvme_pci.c
--- nvme_pci.c 23 Jan 2019 06:56:19 -0000 1.26
+++ nvme_pci.c 10 Jun 2019 08:18:33 -0000
@@ -64,7 +64,7 @@
#include <dev/ic/nvmereg.h>
#include <dev/ic/nvmevar.h>
-int nvme_pci_force_intx = 0;
+int nvme_pci_force_intx = 1;
int nvme_pci_mpsafe = 1;
int nvme_pci_mq = 1; /* INTx: ioq=1, MSI/MSI-X: ioq=ncpu */
There doesn't seem to be a quirk for this. How do we want to handle this?
Thomas
From: "NONAKA Kimihiro" <nonaka@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54275 CVS commit: src/sys/dev/ic
Date: Fri, 20 Sep 2019 05:32:42 +0000
Module Name: src
Committed By: nonaka
Date: Fri Sep 20 05:32:42 UTC 2019
Modified Files:
src/sys/dev/ic: nvme.c
Log Message:
Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().
A new completion queue entry check incorrectly determined that there was
a Completion Queue entry for a command that was not submitted.
Fix PR kern/54275, PR kern/54503, PR kern/54532.
To generate a diff of this commit:
cvs rdiff -u -r1.44 -r1.45 src/sys/dev/ic/nvme.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54275 CVS commit: [netbsd-9] src/sys/dev/ic
Date: Sun, 22 Sep 2019 12:18:56 +0000
Module Name: src
Committed By: martin
Date: Sun Sep 22 12:18:56 UTC 2019
Modified Files:
src/sys/dev/ic [netbsd-9]: nvme.c
Log Message:
Pull up following revision(s) (requested by nonaka in ticket #218):
sys/dev/ic/nvme.c: revision 1.45
Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().
A new completion queue entry check incorrectly determined that there was
a Completion Queue entry for a command that was not submitted.
Fix PR kern/54275, PR kern/54503, PR kern/54532.
To generate a diff of this commit:
cvs rdiff -u -r1.44 -r1.44.2.1 src/sys/dev/ic/nvme.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: wiz@NetBSD.org
State-Changed-When: Sun, 22 Sep 2019 19:13:33 +0000
State-Changed-Why:
nonaka's patch seems to work fine.
The kernel boots without issues, and it has survived building NetBSD
and a limited bulk build - much further than it usually got.
Thank you, nonaka!
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54275 CVS commit: [netbsd-8] src/sys/dev/ic
Date: Wed, 25 Sep 2019 15:49:17 +0000
Module Name: src
Committed By: martin
Date: Wed Sep 25 15:49:17 UTC 2019
Modified Files:
src/sys/dev/ic [netbsd-8]: nvme.c
Log Message:
Pull up following revision(s) (requested by nonaka in ticket #1386):
sys/dev/ic/nvme.c: revision 1.45
Don't set Phase Tag bit of Completion Queue entry at nvme_poll_done().
A new completion queue entry check incorrectly determined that there was
a Completion Queue entry for a command that was not submitted.
Fix PR kern/54275, PR kern/54503, PR kern/54532.
To generate a diff of this commit:
cvs rdiff -u -r1.30.2.4 -r1.30.2.5 src/sys/dev/ic/nvme.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.