NetBSD Problem Report #52034

From www@NetBSD.org  Sun Mar  5 11:19:23 2017
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id B14917A16E
	for <gnats-bugs@gnats.NetBSD.org>; Sun,  5 Mar 2017 11:19:23 +0000 (UTC)
Message-Id: <20170305111922.BAEA07A1FC@mollari.NetBSD.org>
Date: Sun,  5 Mar 2017 11:19:22 +0000 (UTC)
From: bsiegert@NetBSD.org
Reply-To: bsiegert@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: Kernel panic on Google Compute Engine
X-Send-Pr-Version: www-1.0

>Number:         52034
>Category:       kern
>Synopsis:       Kernel panic on Google Compute Engine
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    jdolecek
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Mar 05 11:20:00 +0000 2017
>Closed-Date:    Sun Oct 29 12:18:42 +0000 2017
>Last-Modified:  Sun Oct 29 12:18:42 +0000 2017
>Originator:     Benny Siegert
>Release:        NetBSD 7.1_RC2
>Organization:
The NetBSD Foundation
>Environment:
>Description:
My NetBSD (amd64) test machine on Google Compute Engine seems kinda crashy whenever there is significant disk activity -- such as installing packages, unpacking tarballs, etc.

I managed to grab this backtrace from the serial console:


fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff80113ac4 cs 8 rflags 10202 cr2 ffff80008b58c000 ilevel 0 rsp fffffe810e367c08
curlwp 0xfffffe81104279a0 pid 790.1 lowest kstack 0xfffffe810e3652c0
panic: trap
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
snprintf() at netbsd:snprintf
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
uiomove() at netbsd:uiomove+0x91
ubc_uiomove() at netbsd:ubc_uiomove+0xa6
ffs_read() at netbsd:ffs_read+0x372
VOP_READ() at netbsd:VOP_READ+0x37
vn_read() at netbsd:vn_read+0x94
dofileread() at netbsd:dofileread+0x90
sys_read() at netbsd:sys_read+0x5f
syscall() at netbsd:syscall+0x9a
--- syscall (number 3) ---
7f7ff6c3c3ba:
cpu0: End traceback...
>How-To-Repeat:
Launch a NetBSD instance on GCE (https://github.com/google/netbsd-gce/ may help).

Use the disk.
>Fix:
?

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: bsiegert@NetBSD.org
Responsible-Changed-When: Mon, 06 Mar 2017 17:06:35 +0000
Responsible-Changed-Why:
You hacked on the vioscsi driver relatively recently. Could you take a look please?


From: Benny Siegert <bsiegert@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52034: Kernel panic on Google Compute Engine
Date: Mon, 6 Mar 2017 18:01:10 +0100

 Maya suggested a DIAGNOSTIC kernel. A 7.1_RC2 DIAGNOSTIC kernel does
 not even make it through the boot. Instead, it crashes with an
 assertion in the vioscsi driver:

 acpicpu0 at cpu0: ACPI CPU
 acpicpu1 at cpu1: ACPI CPU
 panic: kernel diagnostic assertion "(!cpu_intr_p() &&
 !cpu_softintr_p()) || (pc->pc_pool.pr_ipl != IPL_NONE || cold ||
 panicstr != NULL)" failed: file "/usr/src/sys/kern/subr_pool.c", line
 2211 pool 'vmmpepl' is IPL_NONE, but called from interrupt context

 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip ffffffff80289b9d cs 8 rflags 246 cr2 0 ilevel 6
 rsp fffffe810da90bc8
 curlwp 0xfffffe821f766440 pid 0.6 lowest kstack 0xfffffe810da8d2c0
 Stopped in pid 0.6 (system) at  netbsd:breakpoint+0x5:  leave
 db{0}> bt
 breakpoint() at netbsd:breakpoint+0x5
 vpanic() at netbsd:vpanic+0x13c
 kern_assert() at netbsd:kern_assert+0x4f
 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x12c
 uvm_mapent_alloc.isra.2() at netbsd:uvm_mapent_alloc.isra.2+0x20
 uvm_map_clip_start() at netbsd:uvm_map_clip_start+0x1b
 uvm_unmap_remove() at netbsd:uvm_unmap_remove+0x2fe
 uvm_unmap1() at netbsd:uvm_unmap1+0x35
 _bus_dmamap_destroy.isra.11() at netbsd:_bus_dmamap_destroy.isra.11+0x41
 vioscsi_req_put() at netbsd:vioscsi_req_put+0x51
 vioscsi_vq_done() at netbsd:vioscsi_vq_done+0xaf
 virtio_vq_intr() at netbsd:virtio_vq_intr+0x70
 virtio_intr() at netbsd:virtio_intr+0x38
 intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x19
 Xintr_ioapic_level3() at netbsd:Xintr_ioapic_level3+0xf2
 --- interrupt ---
 Xspllower() at netbsd:Xspllower+0xe
 DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810da90ff0
 Xsoftintr() at netbsd:Xsoftintr+0x4f
 --- interrupt ---

 db{0}> show procs
 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 0       34 2   0       200   fffffe810e315560        vmem_rehash
 0       25 3   0       200   fffffe810db360e0           scsibus0 xscmd
 0       24 3   0       200   fffffe810db36500           lnxsyswq lnxsyswq
 0       23 3   0       200   fffffe810db36920               pms0 pmsreset
 0       22 3   1       200   fffffe810dad50c0            xcall/1 xcall
 0       21 1   1       200   fffffe810dad54e0          softser/1
 0       20 1   1       200   fffffe810dad5900          softclk/1
 0       19 1   1       200   fffffe810dabe0a0          softbio/1
 0       18 1   1       200   fffffe810dabe4c0          softnet/1
 0    >  17 7   1       201   fffffe810dabe8e0             idle/1
 0       16 3   0       200   fffffe821db3d080             sysmon smtaskq
 0       15 3   0       200   fffffe821db3d4a0         pmfsuspend pmfsuspend
 0       14 3   0       200   fffffe821db3d8c0           pmfevent pmfevent
 0       13 3   0       200   fffffe821eb57060         sopendfree sopendfr
 0       12 3   0       200   fffffe821eb57480           nfssilly nfssilly
 0       11 3   0       200   fffffe821eb578a0            cachegc cachegc
 0       10 3   0       200   fffffe821f75f040              vrele vrele
 0        9 3   0       200   fffffe821f75f460             vdrain vdrain
 0        8 3   0       200   fffffe821f75f880          modunload mod_unld
 0        7 3   0       200   fffffe821f766020            xcall/0 xcall
 0        6 1   0       200   fffffe821f766440          softser/0
 0        5 1   0       200   fffffe821f766860          softclk/0
 0        4 1   0       200   fffffe821f76f000          softbio/0
 0        3 1   0       200   fffffe821f76f420          softnet/0
 0        2 1   0       201   fffffe821f76f840             idle/0
 0    >   1 7   0       200   ffffffff810c25e0            swapper

State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 06 Mar 2017 21:21:59 +0000
State-Changed-Why:
There was a bugfix committed several months back for virtscsi driver
on -current. It was related to resource leak on SCSI errors, potentially
might fix your issue. Can you test with -current kernel?


From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52034 (Kernel panic on Google Compute Engine)
Date: Mon, 6 Mar 2017 22:28:09 +0100

 I see the patch was pulled up to netbsd-7 branch, so actually -current
 should behave the same. Still, can you please check it out?

 Jaromir

 2017-03-06 22:21 GMT+01:00  <jdolecek@netbsd.org>:
 > Synopsis: Kernel panic on Google Compute Engine
 >
 > State-Changed-From-To: open->feedback
 > State-Changed-By: jdolecek@NetBSD.org
 > State-Changed-When: Mon, 06 Mar 2017 21:21:59 +0000
 > State-Changed-Why:
 > There was a bugfix committed several months back for virtscsi driver
 > on -current. It was related to resource leak on SCSI errors, potentially
 > might fix your issue. Can you test with -current kernel?
 >
 >
 >

State-Changed-From-To: feedback->analyzed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 07 Mar 2017 19:23:21 +0000
State-Changed-Why:
I'll refactor the driver to avoid creating/destroying the dmamaps in each
request, which will also fix this crash.


State-Changed-From-To: analyzed->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 07 Mar 2017 22:05:04 +0000
State-Changed-Why:
A fix was committed to -current, can you try it out?

FWIW, I tried booting 7.1_RC2 kernel on 'micro' google compute instance
and it worked, as well as -current kernel before my change. I guess
it depends on amount of RAM for the VM.


From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52034 CVS commit: src/sys/dev/pci
Date: Tue, 7 Mar 2017 22:03:04 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Tue Mar  7 22:03:04 UTC 2017

 Modified Files:
 	src/sys/dev/pci: vioscsi.c

 Log Message:
 allocate bus dma maps during attachment, rather than creating and destroying
 them for each request; besides being faster, bus_dmamap_destroy() is not
 safe to be called from interrupt context

 adresses PR kern/52034 by Benny Siegert


 To generate a diff of this commit:
 cvs rdiff -u -r1.8 -r1.9 src/sys/dev/pci/vioscsi.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Benny Siegert <bsiegert@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: jdolecek@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/52034 (Kernel panic on Google Compute Engine)
Date: Wed, 8 Mar 2017 20:21:51 +0100

 With a non-DIAGNOSTIC kernel, there is no error or anything, it just
 tends to panic if you do disk-intensive workloads, e.g. unpacking a
 pkgsrc tarball.

 I tried a DIAGNOSTIC kernel on -current (just before your fix) and had
 no problem. Will try the kernel with the refactored driver next.

From: Benny Siegert <bsiegert@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: jdolecek@NetBSD.org,
 gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org,
 bsiegert@NetBSD.org
Subject: Re: kern/52034 (Kernel panic on Google Compute Engine)
Date: Sun, 2 Jul 2017 12:32:26 +0100

 > With a non-DIAGNOSTIC kernel, there is no error or anything, it just
 > tends to panic if you do disk-intensive workloads, e.g. unpacking a
 > pkgsrc tarball.
 >=20
 > I tried a DIAGNOSTIC kernel on -current (just before your fix) and had
 > no problem. Will try the kernel with the refactored driver next.

 I tried the following:

 I downloaded =
 http://nyftp.netbsd.org/pub/NetBSD-daily/netbsd-7-1/201706301130Z/amd64/bi=
 nary/kernel/netbsd-GENERIC.gz. This identifies as NetBSD 7.1.0_PATCH =
 (GENERIC.201706301130Z).

 Unpacking a pkgsrc tarball makes the whole machine hang after a few =
 minutes; I can see disk activity going to 0 in external monitoring. =
 Breaking into ddb at this point shows that the machine is =E2=80=9Eidle=E2=
 =80=9C and all processes with disk access are in tstile:

 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip ffffffff8028259d cs 8 rflags 202 cr2 =
 ffff80008af4e000 ilevel 8 rsp fffffe810da84d00
 curlwp 0xfffffe821f770840 pid 0.2 lowest kstack 0xfffffe810da822c0
 Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x5:  leave
 db{0}> bt
 breakpoint() at netbsd:breakpoint+0x5
 comintr() at netbsd:comintr+0x524
 Xintr_ioapic_edge4() at netbsd:Xintr_ioapic_edge4+0xea
 --- interrupt ---
 x86_stihlt() at netbsd:x86_stihlt+0x6
 acpicpu_cstate_idle_enter() at netbsd:acpicpu_cstate_idle_enter+0xc2
 acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0x6d
 idle_loop() at netbsd:idle_loop+0xe8
 db{0}> ps
 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 107      1 3   1         0   fffffe8117b405c0               cron tstile
 826      1 3   0         0   fffffe821d9b6660                tar tstile
 831      1 3   0        80   fffffe8117b401a0              xzcat pipe_wr
 510      1 3   1        80   fffffe810e23e120                ksh pause
 623      1 3   0        80   fffffe810e23e540               sshd select
 661      1 3   0        80   fffffe810e20d940               sshd select
 677      1 3   1         0   fffffe810e217980              getty tstile
 804      1 3   1        80   fffffe821dae7600               cron nanoslp
 636      1 3   0        80   fffffe821be8dac0              inetd kqueue
 461      1 3   1        80   fffffe8167ed81c0               qmgr kqueue
 605      1 3   0        80   fffffe821d9b3200             pickup kqueue
 753      1 3   1        80   fffffe821bba4260             master kqueue
 507      1 3   0        80   fffffe821da84220               sshd select
 435      1 3   1        80   fffffe821dae7a20             powerd kqueue
 295      1 3   1         0   fffffe821dae71e0            syslogd tstile
 249      1 3   0        80   fffffe8167ed8a00             dhcpcd kqueue
 1        1 3   1        80   fffffe810e39d9a0               init wait
 0       42 3   0       200   fffffe810eade9c0            physiod physiod
 0       41 3   1       200   fffffe8117b409e0           aiodoned =
 aiodoned
 0       40 3   1       200   fffffe810eade180            ioflush tstile
 0       39 3   1       200   fffffe810eade5a0           pgdaemon =
 pgdaemon=

State-Changed-From-To: feedback->open
State-Changed-By: bsiegert@NetBSD.org
State-Changed-When: Sun, 02 Jul 2017 12:10:13 +0000
State-Changed-Why:
Feedback sent.


From: Benny Siegert <bsiegert@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: jdolecek@NetBSD.org,
 gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org,
 bsiegert@NetBSD.org
Subject: Re: kern/52034 (Kernel panic on Google Compute Engine)
Date: Sun, 2 Jul 2017 13:08:33 +0100

 With a daily netbsd-7 kernel =
 (nyftp.netbsd.org/pub/NetBSD-daily/netbsd-7/201706292340Z/amd64/binary/ker=
 nel/netbsd-GENERIC.gz), identifying as NetBSD 7.1_STABLE =
 (GENERIC.201706292340Z), I get an actual kernel panic when unpacking the =
 pkgsrc archive:

 login: fatal page fault in supervisor mode
 trap type 6 code 0 rip ffffffff80113aa4 cs 8 rflags 10207 cr2 =
 ffff80008b2fc000 ilevel 0 rsp fffffe810e2e1c08
 curlwp 0xfffffe810e23e960 pid 40.1 lowest kstack 0xfffffe810e2df2c0
 panic: trap
 cpu0: Begin traceback...
 vpanic() at netbsd:vpanic+0x13c
 snprintf() at netbsd:snprintf
 startlwp() at netbsd:startlwp
 alltraps() at netbsd:alltraps+0x96
 uiomove() at netbsd:uiomove+0x91
 ubc_uiomove() at netbsd:ubc_uiomove+0xa6
 ffs_read() at netbsd:ffs_read+0x372
 VOP_READ() at netbsd:VOP_READ+0x37
 vn_read() at netbsd:vn_read+0x94
 dofileread() at netbsd:dofileread+0x90
 sys_read() at netbsd:sys_read+0x5f
 syscall() at netbsd:syscall+0x9a
 --- syscall (number 3) ---
 7f7ff703c3ba:
 cpu0: End traceback...


State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sun, 22 Oct 2017 13:27:14 +0000
State-Changed-Why:
vioscsi has some fixes on HEAD and netbsd-8 branches. Can you try that instead
of netbsd-7?


State-Changed-From-To: feedback->closed
State-Changed-By: bsiegert@NetBSD.org
State-Changed-When: Sun, 29 Oct 2017 12:18:42 +0000
State-Changed-Why:
The problem no longer occurs on either netbsd-8 or HEAD.
It is a pity that -7 does not have these patches, but recommending
NetBSD-8 is a usable workaround. Thanks!


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.