NetBSD Problem Report #55182

From john@ziaspace.com  Wed Apr 15 16:59:25 2020
Return-Path: <john@ziaspace.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 9EA521A9219
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 15 Apr 2020 16:59:25 +0000 (UTC)
Message-Id: <202004151659.03FGxLNZ000170@athena.zia.io>
Date: Wed, 15 Apr 2020 16:59:21 GMT
From: john@ziaspace.com
Reply-To: john@ziaspace.com
To: gnats-bugs@NetBSD.org
Subject: NPF on NetBSD 9 can lock / panic machine
X-Send-Pr-Version: 3.95

>Number:         55182
>Category:       kern
>Synopsis:       NPF on NetBSD 9 can lock / panic machine
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    rmind
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Apr 15 17:00:01 +0000 2020
>Closed-Date:    Thu May 28 14:45:54 +0000 2020
>Last-Modified:  Thu May 28 14:45:54 +0000 2020
>Originator:     John Klos
>Release:        NetBSD 9.0_STABLE
>Organization:

>Environment:


System: NetBSD athena.zia.io 9.0_STABLE NetBSD 9.0_STABLE (ATHENA-$Revision: 9.0q $) #1: Fri Apr 10 06:29:38 UTC 2020 john@athena.zia.io:/home/obj-alpha/sys/arch/alpha/compile/ATHENA alpha
Architecture: alpha
Machine: alpha
>Description:

NPF on NetBSD 9 has caused a direct panic on Alpha and a complete lockup
on amd64 by running:

echo "199.233.217.205" >> /etc/npf_blacklist ; /etc/rc.d/npf reload

The panic on Alpha gave:

[ 4656449.519341] CPU 0: fatal kernel trap:

[ 4656449.521294] CPU 0    trap entry = 0x2 (memory management fault)
[ 4656449.522270] CPU 0    a0         = 0x0
[ 4656449.523247] CPU 0    a1         = 0x1
[ 4656449.524224] CPU 0    a2         = 0x0
[ 4656449.525200] CPU 0    pc         = 0xfffffc0000bc9048
[ 4656449.526177] CPU 0    ra         = 0xfffffc0000bc2d0c
[ 4656449.527153] CPU 0    pv         = 0xfffffc0000bc9030
[ 4656449.528130] CPU 0    curlwp     = 0xfffffc0150c7c580
[ 4656449.529106] CPU 0        pid = 29135, comm = npfctl

[ 4656449.531059] panic: trap

The amd64 system had to be power cycled (it is remote).

On other amd64 NetBSD 9 systems, I have observed:

Plenty of NAT traffic, fine.
Tons of network connections and work, fine.
Both together, I can lock up a machine pretty reliably within an hour.

One issue is that even when I've had this happen locally, I cannot get 
in to the kernel debugger on amd64 after a lockup.

The configurations are essentially the same as in the NPF documentation.
This is both with GENERIC kernels and with kernels with NPF compiled in.

>How-To-Repeat:


Set up a machine to do NAT via NPF for a somewhat busy network using the 
example configurations in the NPF documentation.

While the network is reasonably busy, either make changes to NPF and run
"/etc/rc.d/npf reload", or create lots of network traffic on the machine 
running NPF. There will be a non-trivial chance of lockup or panic.

>Fix:


>Release-Note:

>Audit-Trail:
From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55182: NPF on NetBSD 9 can lock / panic machine
Date: Thu, 16 Apr 2020 05:07:42 +0000

 Got any kernel coredump or backtrace? the panic isn't super informative.
 I see that crash(8) is somewhat limited on alpha (so not sure if it does
 kernel core dumps at all), but on amd64 you should be able to do:

 cd /var/crash
 And if there's a netbsd.1.gz there:
 gunzip netbsd.1*
 crash -M netbsd.1 -N netbsd.1.core
 crash> bt


 Also if you do:
 gdb /netbsd	# even better if it's the netbsd.gdb built from the same
 		# sources. It's in the kernel build directory.
 		# e.g. obj/sys/arch/alpha/compile/GENERIC/netbsd.gdb

 (gdb) info line *(0xfffffc0000bc2d0c)
 			^^ this is the value mentioned as 'ra'.
 			   it should be the function calling into this
 			   one, so it's part of the backtrace

 What function is it?

From: Timo Buhrmester <fstd.lkml@gmail.com>
To: john@ziaspace.com
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55182: NPF on NetBSD 9 can lock / panic machine
Date: Fri, 17 Apr 2020 22:48:38 +0200

 > The configurations are essentially the same as in the NPF documentation.

 Could you reveal your exact npf.conf anyway, please?  I'm a little
 confused as to why NAT works for you but not for myself (PR 53962)
 and I'd like to investigate in that direction.

Responsible-Changed-From-To: kern-bug-people->rmind
Responsible-Changed-By: rmind@NetBSD.org
Responsible-Changed-When: Mon, 20 Apr 2020 14:28:39 +0000
Responsible-Changed-Why:
Take.  Need more information, though.  It might be the same thmap related issue.
Can you please try to obtain the backtrace?


From: "Mindaugas Rasiukevicius" <rmind@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55182 CVS commit: src
Date: Sat, 23 May 2020 19:56:00 +0000

 Module Name:	src
 Committed By:	rmind
 Date:		Sat May 23 19:56:00 UTC 2020

 Modified Files:
 	src/sys/net/npf: npf_conf.c npf_conn.c npf_conn.h npf_conndb.c
 	    npf_inet.c npf_nat.c
 	src/usr.sbin/npf/npfctl: npf_build.c npf_show.c npfctl.h

 Log Message:
 Backport selected NPF fixes from the upstream (to be pulled up):

 - npf_conndb_lookup: protect the connection lookup with pserialize(9),
   instead of incorrectly assuming that the handler always runs at IPL_SOFNET.
   Should fix crashes reported on high load (PR/55182).

 - npf_config_destroy: handle partially initialized config; fixes crashes
   with some invalid configurations.

 - NAT policy creation / destruction: set the initial reference and do not
   wait for reference draining on destruction; destroy the policy on the
   last reference drop instead.  Fixes a lockup with the dynamic NAT rules.

 - npf_nat_{export,import}: fix a regression since dynamic NAT rules.

 - npfctl: fix a regression and restore the default group behaviour.

 - Add npf_cache_tcp() and validate the TCP data offset (from maxv@).


 To generate a diff of this commit:
 cvs rdiff -u -r1.15 -r1.16 src/sys/net/npf/npf_conf.c
 cvs rdiff -u -r1.30 -r1.31 src/sys/net/npf/npf_conn.c
 cvs rdiff -u -r1.18 -r1.19 src/sys/net/npf/npf_conn.h
 cvs rdiff -u -r1.7 -r1.8 src/sys/net/npf/npf_conndb.c
 cvs rdiff -u -r1.55 -r1.56 src/sys/net/npf/npf_inet.c
 cvs rdiff -u -r1.48 -r1.49 src/sys/net/npf/npf_nat.c
 cvs rdiff -u -r1.53 -r1.54 src/usr.sbin/npf/npfctl/npf_build.c
 cvs rdiff -u -r1.30 -r1.31 src/usr.sbin/npf/npfctl/npf_show.c
 cvs rdiff -u -r1.51 -r1.52 src/usr.sbin/npf/npfctl/npfctl.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55182 CVS commit: [netbsd-9] src
Date: Mon, 25 May 2020 17:25:28 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Mon May 25 17:25:28 UTC 2020

 Modified Files:
 	src/sys/net/npf [netbsd-9]: npf_conf.c npf_conn.c npf_conn.h
 	    npf_conndb.c npf_inet.c npf_nat.c
 	src/usr.sbin/npf/npfctl [netbsd-9]: npf_build.c npf_show.c npfctl.h

 Log Message:
 Pull up following revision(s) (requested by rmind in ticket #930):

 	usr.sbin/npf/npfctl/npf_build.c: revision 1.54
 	sys/net/npf/npf_conn.h: revision 1.19
 	usr.sbin/npf/npfctl/npfctl.h: revision 1.52
 	usr.sbin/npf/npfctl/npf_show.c: revision 1.31
 	sys/net/npf/npf_conf.c: revision 1.16
 	sys/net/npf/npf_nat.c: revision 1.49
 	sys/net/npf/npf_inet.c: revision 1.56
 	sys/net/npf/npf_conndb.c: revision 1.8
 	sys/net/npf/npf_conn.c: revision 1.31

 Backport selected NPF fixes from the upstream (to be pulled up):

 - npf_conndb_lookup: protect the connection lookup with pserialize(9),
   instead of incorrectly assuming that the handler always runs at IPL_SOFNET.
   Should fix crashes reported on high load (PR/55182).

 - npf_config_destroy: handle partially initialized config; fixes crashes
   with some invalid configurations.

 - NAT policy creation / destruction: set the initial reference and do not
   wait for reference draining on destruction; destroy the policy on the
   last reference drop instead.  Fixes a lockup with the dynamic NAT rules.

 - npf_nat_{export,import}: fix a regression since dynamic NAT rules.

 - npfctl: fix a regression and restore the default group behaviour.

 - Add npf_cache_tcp() and validate the TCP data offset (from maxv@).


 To generate a diff of this commit:
 cvs rdiff -u -r1.13.2.2 -r1.13.2.3 src/sys/net/npf/npf_conf.c
 cvs rdiff -u -r1.27.2.2 -r1.27.2.3 src/sys/net/npf/npf_conn.c
 cvs rdiff -u -r1.16.2.2 -r1.16.2.3 src/sys/net/npf/npf_conn.h
 cvs rdiff -u -r1.6 -r1.6.2.1 src/sys/net/npf/npf_conndb.c
 cvs rdiff -u -r1.54.2.1 -r1.54.2.2 src/sys/net/npf/npf_inet.c
 cvs rdiff -u -r1.46.2.2 -r1.46.2.3 src/sys/net/npf/npf_nat.c
 cvs rdiff -u -r1.50.2.2 -r1.50.2.3 src/usr.sbin/npf/npfctl/npf_build.c
 cvs rdiff -u -r1.28.2.1 -r1.28.2.2 src/usr.sbin/npf/npfctl/npf_show.c
 cvs rdiff -u -r1.48.2.2 -r1.48.2.3 src/usr.sbin/npf/npfctl/npfctl.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Thu, 28 May 2020 14:45:54 +0000
State-Changed-Why:
Fixed.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.