NetBSD Problem Report #52034
From www@NetBSD.org Sun Mar 5 11:19:23 2017
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id B14917A16E
for <gnats-bugs@gnats.NetBSD.org>; Sun, 5 Mar 2017 11:19:23 +0000 (UTC)
Message-Id: <20170305111922.BAEA07A1FC@mollari.NetBSD.org>
Date: Sun, 5 Mar 2017 11:19:22 +0000 (UTC)
From: bsiegert@NetBSD.org
Reply-To: bsiegert@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: Kernel panic on Google Compute Engine
X-Send-Pr-Version: www-1.0
>Number: 52034
>Category: kern
>Synopsis: Kernel panic on Google Compute Engine
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: jdolecek
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Mar 05 11:20:00 +0000 2017
>Closed-Date: Sun Oct 29 12:18:42 +0000 2017
>Last-Modified: Sun Oct 29 12:18:42 +0000 2017
>Originator: Benny Siegert
>Release: NetBSD 7.1_RC2
>Organization:
The NetBSD Foundation
>Environment:
>Description:
My NetBSD (amd64) test machine on Google Compute Engine seems kinda crashy whenever there is significant disk activity -- such as installing packages, unpacking tarballs, etc.
I managed to grab this backtrace from the serial console:
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff80113ac4 cs 8 rflags 10202 cr2 ffff80008b58c000 ilevel 0 rsp fffffe810e367c08
curlwp 0xfffffe81104279a0 pid 790.1 lowest kstack 0xfffffe810e3652c0
panic: trap
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
snprintf() at netbsd:snprintf
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
uiomove() at netbsd:uiomove+0x91
ubc_uiomove() at netbsd:ubc_uiomove+0xa6
ffs_read() at netbsd:ffs_read+0x372
VOP_READ() at netbsd:VOP_READ+0x37
vn_read() at netbsd:vn_read+0x94
dofileread() at netbsd:dofileread+0x90
sys_read() at netbsd:sys_read+0x5f
syscall() at netbsd:syscall+0x9a
--- syscall (number 3) ---
7f7ff6c3c3ba:
cpu0: End traceback...
>How-To-Repeat:
Launch a NetBSD instance on GCE (https://github.com/google/netbsd-gce/ may help).
Use the disk.
>Fix:
?
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: bsiegert@NetBSD.org
Responsible-Changed-When: Mon, 06 Mar 2017 17:06:35 +0000
Responsible-Changed-Why:
You hacked on the vioscsi driver relatively recently. Could you take a look please?
From: Benny Siegert <bsiegert@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52034: Kernel panic on Google Compute Engine
Date: Mon, 6 Mar 2017 18:01:10 +0100
Maya suggested a DIAGNOSTIC kernel. A 7.1_RC2 DIAGNOSTIC kernel does
not even make it through the boot. Instead, it crashes with an
assertion in the vioscsi driver:
acpicpu0 at cpu0: ACPI CPU
acpicpu1 at cpu1: ACPI CPU
panic: kernel diagnostic assertion "(!cpu_intr_p() &&
!cpu_softintr_p()) || (pc->pc_pool.pr_ipl != IPL_NONE || cold ||
panicstr != NULL)" failed: file "/usr/src/sys/kern/subr_pool.c", line
2211 pool 'vmmpepl' is IPL_NONE, but called from interrupt context
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff80289b9d cs 8 rflags 246 cr2 0 ilevel 6
rsp fffffe810da90bc8
curlwp 0xfffffe821f766440 pid 0.6 lowest kstack 0xfffffe810da8d2c0
Stopped in pid 0.6 (system) at netbsd:breakpoint+0x5: leave
db{0}> bt
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x13c
kern_assert() at netbsd:kern_assert+0x4f
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x12c
uvm_mapent_alloc.isra.2() at netbsd:uvm_mapent_alloc.isra.2+0x20
uvm_map_clip_start() at netbsd:uvm_map_clip_start+0x1b
uvm_unmap_remove() at netbsd:uvm_unmap_remove+0x2fe
uvm_unmap1() at netbsd:uvm_unmap1+0x35
_bus_dmamap_destroy.isra.11() at netbsd:_bus_dmamap_destroy.isra.11+0x41
vioscsi_req_put() at netbsd:vioscsi_req_put+0x51
vioscsi_vq_done() at netbsd:vioscsi_vq_done+0xaf
virtio_vq_intr() at netbsd:virtio_vq_intr+0x70
virtio_intr() at netbsd:virtio_intr+0x38
intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x19
Xintr_ioapic_level3() at netbsd:Xintr_ioapic_level3+0xf2
--- interrupt ---
Xspllower() at netbsd:Xspllower+0xe
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810da90ff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
db{0}> show procs
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
0 34 2 0 200 fffffe810e315560 vmem_rehash
0 25 3 0 200 fffffe810db360e0 scsibus0 xscmd
0 24 3 0 200 fffffe810db36500 lnxsyswq lnxsyswq
0 23 3 0 200 fffffe810db36920 pms0 pmsreset
0 22 3 1 200 fffffe810dad50c0 xcall/1 xcall
0 21 1 1 200 fffffe810dad54e0 softser/1
0 20 1 1 200 fffffe810dad5900 softclk/1
0 19 1 1 200 fffffe810dabe0a0 softbio/1
0 18 1 1 200 fffffe810dabe4c0 softnet/1
0 > 17 7 1 201 fffffe810dabe8e0 idle/1
0 16 3 0 200 fffffe821db3d080 sysmon smtaskq
0 15 3 0 200 fffffe821db3d4a0 pmfsuspend pmfsuspend
0 14 3 0 200 fffffe821db3d8c0 pmfevent pmfevent
0 13 3 0 200 fffffe821eb57060 sopendfree sopendfr
0 12 3 0 200 fffffe821eb57480 nfssilly nfssilly
0 11 3 0 200 fffffe821eb578a0 cachegc cachegc
0 10 3 0 200 fffffe821f75f040 vrele vrele
0 9 3 0 200 fffffe821f75f460 vdrain vdrain
0 8 3 0 200 fffffe821f75f880 modunload mod_unld
0 7 3 0 200 fffffe821f766020 xcall/0 xcall
0 6 1 0 200 fffffe821f766440 softser/0
0 5 1 0 200 fffffe821f766860 softclk/0
0 4 1 0 200 fffffe821f76f000 softbio/0
0 3 1 0 200 fffffe821f76f420 softnet/0
0 2 1 0 201 fffffe821f76f840 idle/0
0 > 1 7 0 200 ffffffff810c25e0 swapper
State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 06 Mar 2017 21:21:59 +0000
State-Changed-Why:
There was a bugfix committed several months back for virtscsi driver
on -current. It was related to resource leak on SCSI errors, potentially
might fix your issue. Can you test with -current kernel?
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52034 (Kernel panic on Google Compute Engine)
Date: Mon, 6 Mar 2017 22:28:09 +0100
I see the patch was pulled up to netbsd-7 branch, so actually -current
should behave the same. Still, can you please check it out?
Jaromir
2017-03-06 22:21 GMT+01:00 <jdolecek@netbsd.org>:
> Synopsis: Kernel panic on Google Compute Engine
>
> State-Changed-From-To: open->feedback
> State-Changed-By: jdolecek@NetBSD.org
> State-Changed-When: Mon, 06 Mar 2017 21:21:59 +0000
> State-Changed-Why:
> There was a bugfix committed several months back for virtscsi driver
> on -current. It was related to resource leak on SCSI errors, potentially
> might fix your issue. Can you test with -current kernel?
>
>
>
State-Changed-From-To: feedback->analyzed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 07 Mar 2017 19:23:21 +0000
State-Changed-Why:
I'll refactor the driver to avoid creating/destroying the dmamaps in each
request, which will also fix this crash.
State-Changed-From-To: analyzed->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 07 Mar 2017 22:05:04 +0000
State-Changed-Why:
A fix was committed to -current, can you try it out?
FWIW, I tried booting 7.1_RC2 kernel on 'micro' google compute instance
and it worked, as well as -current kernel before my change. I guess
it depends on amount of RAM for the VM.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/52034 CVS commit: src/sys/dev/pci
Date: Tue, 7 Mar 2017 22:03:04 +0000
Module Name: src
Committed By: jdolecek
Date: Tue Mar 7 22:03:04 UTC 2017
Modified Files:
src/sys/dev/pci: vioscsi.c
Log Message:
allocate bus dma maps during attachment, rather than creating and destroying
them for each request; besides being faster, bus_dmamap_destroy() is not
safe to be called from interrupt context
adresses PR kern/52034 by Benny Siegert
To generate a diff of this commit:
cvs rdiff -u -r1.8 -r1.9 src/sys/dev/pci/vioscsi.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Benny Siegert <bsiegert@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: jdolecek@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/52034 (Kernel panic on Google Compute Engine)
Date: Wed, 8 Mar 2017 20:21:51 +0100
With a non-DIAGNOSTIC kernel, there is no error or anything, it just
tends to panic if you do disk-intensive workloads, e.g. unpacking a
pkgsrc tarball.
I tried a DIAGNOSTIC kernel on -current (just before your fix) and had
no problem. Will try the kernel with the refactored driver next.
From: Benny Siegert <bsiegert@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: jdolecek@NetBSD.org,
gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org,
bsiegert@NetBSD.org
Subject: Re: kern/52034 (Kernel panic on Google Compute Engine)
Date: Sun, 2 Jul 2017 12:32:26 +0100
> With a non-DIAGNOSTIC kernel, there is no error or anything, it just
> tends to panic if you do disk-intensive workloads, e.g. unpacking a
> pkgsrc tarball.
>=20
> I tried a DIAGNOSTIC kernel on -current (just before your fix) and had
> no problem. Will try the kernel with the refactored driver next.
I tried the following:
I downloaded =
http://nyftp.netbsd.org/pub/NetBSD-daily/netbsd-7-1/201706301130Z/amd64/bi=
nary/kernel/netbsd-GENERIC.gz. This identifies as NetBSD 7.1.0_PATCH =
(GENERIC.201706301130Z).
Unpacking a pkgsrc tarball makes the whole machine hang after a few =
minutes; I can see disk activity going to 0 in external monitoring. =
Breaking into ddb at this point shows that the machine is =E2=80=9Eidle=E2=
=80=9C and all processes with disk access are in tstile:
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff8028259d cs 8 rflags 202 cr2 =
ffff80008af4e000 ilevel 8 rsp fffffe810da84d00
curlwp 0xfffffe821f770840 pid 0.2 lowest kstack 0xfffffe810da822c0
Stopped in pid 0.2 (system) at netbsd:breakpoint+0x5: leave
db{0}> bt
breakpoint() at netbsd:breakpoint+0x5
comintr() at netbsd:comintr+0x524
Xintr_ioapic_edge4() at netbsd:Xintr_ioapic_edge4+0xea
--- interrupt ---
x86_stihlt() at netbsd:x86_stihlt+0x6
acpicpu_cstate_idle_enter() at netbsd:acpicpu_cstate_idle_enter+0xc2
acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0x6d
idle_loop() at netbsd:idle_loop+0xe8
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
107 1 3 1 0 fffffe8117b405c0 cron tstile
826 1 3 0 0 fffffe821d9b6660 tar tstile
831 1 3 0 80 fffffe8117b401a0 xzcat pipe_wr
510 1 3 1 80 fffffe810e23e120 ksh pause
623 1 3 0 80 fffffe810e23e540 sshd select
661 1 3 0 80 fffffe810e20d940 sshd select
677 1 3 1 0 fffffe810e217980 getty tstile
804 1 3 1 80 fffffe821dae7600 cron nanoslp
636 1 3 0 80 fffffe821be8dac0 inetd kqueue
461 1 3 1 80 fffffe8167ed81c0 qmgr kqueue
605 1 3 0 80 fffffe821d9b3200 pickup kqueue
753 1 3 1 80 fffffe821bba4260 master kqueue
507 1 3 0 80 fffffe821da84220 sshd select
435 1 3 1 80 fffffe821dae7a20 powerd kqueue
295 1 3 1 0 fffffe821dae71e0 syslogd tstile
249 1 3 0 80 fffffe8167ed8a00 dhcpcd kqueue
1 1 3 1 80 fffffe810e39d9a0 init wait
0 42 3 0 200 fffffe810eade9c0 physiod physiod
0 41 3 1 200 fffffe8117b409e0 aiodoned =
aiodoned
0 40 3 1 200 fffffe810eade180 ioflush tstile
0 39 3 1 200 fffffe810eade5a0 pgdaemon =
pgdaemon=
State-Changed-From-To: feedback->open
State-Changed-By: bsiegert@NetBSD.org
State-Changed-When: Sun, 02 Jul 2017 12:10:13 +0000
State-Changed-Why:
Feedback sent.
From: Benny Siegert <bsiegert@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: jdolecek@NetBSD.org,
gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org,
bsiegert@NetBSD.org
Subject: Re: kern/52034 (Kernel panic on Google Compute Engine)
Date: Sun, 2 Jul 2017 13:08:33 +0100
With a daily netbsd-7 kernel =
(nyftp.netbsd.org/pub/NetBSD-daily/netbsd-7/201706292340Z/amd64/binary/ker=
nel/netbsd-GENERIC.gz), identifying as NetBSD 7.1_STABLE =
(GENERIC.201706292340Z), I get an actual kernel panic when unpacking the =
pkgsrc archive:
login: fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff80113aa4 cs 8 rflags 10207 cr2 =
ffff80008b2fc000 ilevel 0 rsp fffffe810e2e1c08
curlwp 0xfffffe810e23e960 pid 40.1 lowest kstack 0xfffffe810e2df2c0
panic: trap
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
snprintf() at netbsd:snprintf
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
uiomove() at netbsd:uiomove+0x91
ubc_uiomove() at netbsd:ubc_uiomove+0xa6
ffs_read() at netbsd:ffs_read+0x372
VOP_READ() at netbsd:VOP_READ+0x37
vn_read() at netbsd:vn_read+0x94
dofileread() at netbsd:dofileread+0x90
sys_read() at netbsd:sys_read+0x5f
syscall() at netbsd:syscall+0x9a
--- syscall (number 3) ---
7f7ff703c3ba:
cpu0: End traceback...
State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sun, 22 Oct 2017 13:27:14 +0000
State-Changed-Why:
vioscsi has some fixes on HEAD and netbsd-8 branches. Can you try that instead
of netbsd-7?
State-Changed-From-To: feedback->closed
State-Changed-By: bsiegert@NetBSD.org
State-Changed-When: Sun, 29 Oct 2017 12:18:42 +0000
State-Changed-Why:
The problem no longer occurs on either netbsd-8 or HEAD.
It is a pity that -7 does not have these patches, but recommending
NetBSD-8 is a usable workaround. Thanks!
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.