NetBSD Problem Report #55182
From john@ziaspace.com Wed Apr 15 16:59:25 2020
Return-Path: <john@ziaspace.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 9EA521A9219
for <gnats-bugs@gnats.NetBSD.org>; Wed, 15 Apr 2020 16:59:25 +0000 (UTC)
Message-Id: <202004151659.03FGxLNZ000170@athena.zia.io>
Date: Wed, 15 Apr 2020 16:59:21 GMT
From: john@ziaspace.com
Reply-To: john@ziaspace.com
To: gnats-bugs@NetBSD.org
Subject: NPF on NetBSD 9 can lock / panic machine
X-Send-Pr-Version: 3.95
>Number: 55182
>Category: kern
>Synopsis: NPF on NetBSD 9 can lock / panic machine
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: rmind
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Apr 15 17:00:01 +0000 2020
>Closed-Date: Thu May 28 14:45:54 +0000 2020
>Last-Modified: Thu May 28 14:45:54 +0000 2020
>Originator: John Klos
>Release: NetBSD 9.0_STABLE
>Organization:
>Environment:
System: NetBSD athena.zia.io 9.0_STABLE NetBSD 9.0_STABLE (ATHENA-$Revision: 9.0q $) #1: Fri Apr 10 06:29:38 UTC 2020 john@athena.zia.io:/home/obj-alpha/sys/arch/alpha/compile/ATHENA alpha
Architecture: alpha
Machine: alpha
>Description:
NPF on NetBSD 9 has caused a direct panic on Alpha and a complete lockup
on amd64 by running:
echo "199.233.217.205" >> /etc/npf_blacklist ; /etc/rc.d/npf reload
The panic on Alpha gave:
[ 4656449.519341] CPU 0: fatal kernel trap:
[ 4656449.521294] CPU 0 trap entry = 0x2 (memory management fault)
[ 4656449.522270] CPU 0 a0 = 0x0
[ 4656449.523247] CPU 0 a1 = 0x1
[ 4656449.524224] CPU 0 a2 = 0x0
[ 4656449.525200] CPU 0 pc = 0xfffffc0000bc9048
[ 4656449.526177] CPU 0 ra = 0xfffffc0000bc2d0c
[ 4656449.527153] CPU 0 pv = 0xfffffc0000bc9030
[ 4656449.528130] CPU 0 curlwp = 0xfffffc0150c7c580
[ 4656449.529106] CPU 0 pid = 29135, comm = npfctl
[ 4656449.531059] panic: trap
The amd64 system had to be power cycled (it is remote).
On other amd64 NetBSD 9 systems, I have observed:
Plenty of NAT traffic, fine.
Tons of network connections and work, fine.
Both together, I can lock up a machine pretty reliably within an hour.
One issue is that even when I've had this happen locally, I cannot get
in to the kernel debugger on amd64 after a lockup.
The configurations are essentially the same as in the NPF documentation.
This is both with GENERIC kernels and with kernels with NPF compiled in.
>How-To-Repeat:
Set up a machine to do NAT via NPF for a somewhat busy network using the
example configurations in the NPF documentation.
While the network is reasonably busy, either make changes to NPF and run
"/etc/rc.d/npf reload", or create lots of network traffic on the machine
running NPF. There will be a non-trivial chance of lockup or panic.
>Fix:
>Release-Note:
>Audit-Trail:
From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55182: NPF on NetBSD 9 can lock / panic machine
Date: Thu, 16 Apr 2020 05:07:42 +0000
Got any kernel coredump or backtrace? the panic isn't super informative.
I see that crash(8) is somewhat limited on alpha (so not sure if it does
kernel core dumps at all), but on amd64 you should be able to do:
cd /var/crash
And if there's a netbsd.1.gz there:
gunzip netbsd.1*
crash -M netbsd.1 -N netbsd.1.core
crash> bt
Also if you do:
gdb /netbsd # even better if it's the netbsd.gdb built from the same
# sources. It's in the kernel build directory.
# e.g. obj/sys/arch/alpha/compile/GENERIC/netbsd.gdb
(gdb) info line *(0xfffffc0000bc2d0c)
^^ this is the value mentioned as 'ra'.
it should be the function calling into this
one, so it's part of the backtrace
What function is it?
From: Timo Buhrmester <fstd.lkml@gmail.com>
To: john@ziaspace.com
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55182: NPF on NetBSD 9 can lock / panic machine
Date: Fri, 17 Apr 2020 22:48:38 +0200
> The configurations are essentially the same as in the NPF documentation.
Could you reveal your exact npf.conf anyway, please? I'm a little
confused as to why NAT works for you but not for myself (PR 53962)
and I'd like to investigate in that direction.
Responsible-Changed-From-To: kern-bug-people->rmind
Responsible-Changed-By: rmind@NetBSD.org
Responsible-Changed-When: Mon, 20 Apr 2020 14:28:39 +0000
Responsible-Changed-Why:
Take. Need more information, though. It might be the same thmap related issue.
Can you please try to obtain the backtrace?
From: "Mindaugas Rasiukevicius" <rmind@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55182 CVS commit: src
Date: Sat, 23 May 2020 19:56:00 +0000
Module Name: src
Committed By: rmind
Date: Sat May 23 19:56:00 UTC 2020
Modified Files:
src/sys/net/npf: npf_conf.c npf_conn.c npf_conn.h npf_conndb.c
npf_inet.c npf_nat.c
src/usr.sbin/npf/npfctl: npf_build.c npf_show.c npfctl.h
Log Message:
Backport selected NPF fixes from the upstream (to be pulled up):
- npf_conndb_lookup: protect the connection lookup with pserialize(9),
instead of incorrectly assuming that the handler always runs at IPL_SOFNET.
Should fix crashes reported on high load (PR/55182).
- npf_config_destroy: handle partially initialized config; fixes crashes
with some invalid configurations.
- NAT policy creation / destruction: set the initial reference and do not
wait for reference draining on destruction; destroy the policy on the
last reference drop instead. Fixes a lockup with the dynamic NAT rules.
- npf_nat_{export,import}: fix a regression since dynamic NAT rules.
- npfctl: fix a regression and restore the default group behaviour.
- Add npf_cache_tcp() and validate the TCP data offset (from maxv@).
To generate a diff of this commit:
cvs rdiff -u -r1.15 -r1.16 src/sys/net/npf/npf_conf.c
cvs rdiff -u -r1.30 -r1.31 src/sys/net/npf/npf_conn.c
cvs rdiff -u -r1.18 -r1.19 src/sys/net/npf/npf_conn.h
cvs rdiff -u -r1.7 -r1.8 src/sys/net/npf/npf_conndb.c
cvs rdiff -u -r1.55 -r1.56 src/sys/net/npf/npf_inet.c
cvs rdiff -u -r1.48 -r1.49 src/sys/net/npf/npf_nat.c
cvs rdiff -u -r1.53 -r1.54 src/usr.sbin/npf/npfctl/npf_build.c
cvs rdiff -u -r1.30 -r1.31 src/usr.sbin/npf/npfctl/npf_show.c
cvs rdiff -u -r1.51 -r1.52 src/usr.sbin/npf/npfctl/npfctl.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55182 CVS commit: [netbsd-9] src
Date: Mon, 25 May 2020 17:25:28 +0000
Module Name: src
Committed By: martin
Date: Mon May 25 17:25:28 UTC 2020
Modified Files:
src/sys/net/npf [netbsd-9]: npf_conf.c npf_conn.c npf_conn.h
npf_conndb.c npf_inet.c npf_nat.c
src/usr.sbin/npf/npfctl [netbsd-9]: npf_build.c npf_show.c npfctl.h
Log Message:
Pull up following revision(s) (requested by rmind in ticket #930):
usr.sbin/npf/npfctl/npf_build.c: revision 1.54
sys/net/npf/npf_conn.h: revision 1.19
usr.sbin/npf/npfctl/npfctl.h: revision 1.52
usr.sbin/npf/npfctl/npf_show.c: revision 1.31
sys/net/npf/npf_conf.c: revision 1.16
sys/net/npf/npf_nat.c: revision 1.49
sys/net/npf/npf_inet.c: revision 1.56
sys/net/npf/npf_conndb.c: revision 1.8
sys/net/npf/npf_conn.c: revision 1.31
Backport selected NPF fixes from the upstream (to be pulled up):
- npf_conndb_lookup: protect the connection lookup with pserialize(9),
instead of incorrectly assuming that the handler always runs at IPL_SOFNET.
Should fix crashes reported on high load (PR/55182).
- npf_config_destroy: handle partially initialized config; fixes crashes
with some invalid configurations.
- NAT policy creation / destruction: set the initial reference and do not
wait for reference draining on destruction; destroy the policy on the
last reference drop instead. Fixes a lockup with the dynamic NAT rules.
- npf_nat_{export,import}: fix a regression since dynamic NAT rules.
- npfctl: fix a regression and restore the default group behaviour.
- Add npf_cache_tcp() and validate the TCP data offset (from maxv@).
To generate a diff of this commit:
cvs rdiff -u -r1.13.2.2 -r1.13.2.3 src/sys/net/npf/npf_conf.c
cvs rdiff -u -r1.27.2.2 -r1.27.2.3 src/sys/net/npf/npf_conn.c
cvs rdiff -u -r1.16.2.2 -r1.16.2.3 src/sys/net/npf/npf_conn.h
cvs rdiff -u -r1.6 -r1.6.2.1 src/sys/net/npf/npf_conndb.c
cvs rdiff -u -r1.54.2.1 -r1.54.2.2 src/sys/net/npf/npf_inet.c
cvs rdiff -u -r1.46.2.2 -r1.46.2.3 src/sys/net/npf/npf_nat.c
cvs rdiff -u -r1.50.2.2 -r1.50.2.3 src/usr.sbin/npf/npfctl/npf_build.c
cvs rdiff -u -r1.28.2.1 -r1.28.2.2 src/usr.sbin/npf/npfctl/npf_show.c
cvs rdiff -u -r1.48.2.2 -r1.48.2.3 src/usr.sbin/npf/npfctl/npfctl.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Thu, 28 May 2020 14:45:54 +0000
State-Changed-Why:
Fixed.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.