NetBSD Problem Report #52043

From www@NetBSD.org  Tue Mar  7 05:45:14 2017
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 9A07D7A262
	for <gnats-bugs@gnats.NetBSD.org>; Tue,  7 Mar 2017 05:45:14 +0000 (UTC)
Message-Id: <20170307054513.486187A276@mollari.NetBSD.org>
Date: Tue,  7 Mar 2017 05:45:13 +0000 (UTC)
From: djlambe11@earlham.edu
Reply-To: djlambe11@earlham.edu
To: gnats-bugs@NetBSD.org
Subject: npf kernel panic on sparc64
X-Send-Pr-Version: www-1.0

>Number:         52043
>Category:       kern
>Synopsis:       npf kernel panic on sparc64
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Mar 07 05:50:00 +0000 2017
>Closed-Date:    Fri Mar 23 09:53:01 +0000 2018
>Last-Modified:  Fri Mar 23 09:53:01 +0000 2018
>Originator:     Dakotah Lambert
>Release:        NetBSD 7.0.2
>Organization:
Earlham College
>Environment:
NetBSD lutra.lutras-hacking.ddns.net 7.0.2 NetBSD 7.0.2 (GENERIC.DEBUG) #0: Mon Mar 6 19:35:56 EST 2017 root@:/var/src/sys/arch/sparc64/compile/GENERIC.DEBUG sparc64
>Description:
I have a Sun Netra T1 AC200 server, UltraSPARC IIe at 500MHz with 1Gb RAM, and two hard drives.  I run an SSH server on the machine (public-key only, no passwords), and the SSH log tends to fill up with bad authorization attempts from what I assume are bots.  Since SCSI drives are hard to find, I installed fail2ban and configured npf in hopes of reducing the volume of data that gets dumped into this log.

The contents of /etc/npf.conf are:

---

set bpf.jit off
$ext_if=gem0
$local_net=192.168.2.0/25

table <fail2ban> type tree dynamic

group "external" on $ext_if {
        pass in final from $local_net
        block in final from <fail2ban>
        pass out final all
        pass all
}

group default {
        pass final on lo0 all
        block all
}

---

The "set bpf.jit off" was added because npf told me to put it in.  The "gem0" is one of the two built-in Ethernet interfaces of the server.

Before configuring npf and allowing the module (the only LKM I use) to load, the server never went down unexpectedly.  Unfortunately, its reliability fell from "constantly up" to "crashes after a couple hours to a day" after having made this change.

Since the crash appears in ptree_insert_node_common (backtrace at end of section), I am tempted to believe that changing my table from "tree" to "hash" might act as a work-around, but I have not tested this yet.

$ ident /netbsd | grep ptree.c
     $NetBSD: ptree.c,v 1.10 2012/10/06 22:15:09 matt Exp $
$ ident /stand/sparc64/7.0/modules/npf | grep npf_tableset.c
     $NetBSD: npf_tableset.c,v 1.22 2014/08/11 01:54:12 rmind Exp $
$ ident /stand/sparc64/7.0/modules/npf | grep npf_ctl.c
     $NetBSD: npf_ctl.c,v 1.38.2.3 2015/06/10 16:57:58 snj Exp $

I am not sure where "line 501" comes from, as the assertion that failed appears to be at line 450 in the actual C code.

But following the backtrace, it looks like npf_table_insert has its third parameter set to 0.  From npf_ctl.c:

   751          case NPF_CMD_TABLE_ADD:
   752                  error = npf_table_insert(t, nct->nct_data.ent.alen,
   753                      &nct->nct_data.ent.addr, nct->nct_data.ent.mask);
   754                  break;

Then "&nct->nct_data.ent.addr" is evaluating to 0 (NULL).  Might that be the problem?

---

panic: kernel diagnostic assertion "PTN_LEAF_POSITION(ptn) == id.id_parent_slot" failed: file "../../../../../../lib/libkern/../../../common/lib/libc/gen/ptree.c", line 501
cpu0: Begin traceback...
cpu0: End traceback...
Stopped in pid 1426.1 (npfctl) at       netbsd:cpu_Debugger+0x4:        nop
db{0}> bt
db{0}> sync
Frame pointer is at 0x12dbbc411
Call traceback:
 netbsd:cpu_reboot+0x208(a, 1c99748, 0, 1c99400, 1cd4b60, 1c93800) fp = 12dbbc4d1
 netbsd:db_sync_cmd+0x20(100, 0, 1c19c00, 1cb3000, f, 102d3c960) fp = 12dbbc581
 netbsd:db_command+0x94(10f7144, 0, ffffffffffffffff, 12dbbcef8, 2, 73) fp = 12dbbc631
 netbsd:db_command_loop+0x118(1c16be0, 1c16c40, 0, 1c9b000, 1c16800, 16a3fe8) fp = 12dbbc771
 netbsd:db_trap+0x100(10f7148, 0, 18787e0, 1c19c00, 1c16be0, 1c9b000) fp = 12dbbc851
 netbsd:kdb_trap+0xdc(101, 0, 1838ac0, e0048000, 1cb0000, 0) fp = 12dbbc911
 netbsd:trap+0x4a0(101, 12dbbd3c0, 4, 1c19c00, 1c00000, 1cf3400) fp = 12dbbc9c1
 netbsd:1010e40+0(12dbbd3c0, 101, 10f7140, 441d0006, 14bdc60, 1cf36e0) fp = 12dbbcb11
 netbsd:vpanic+0x16c(18787e0, 1cf35b0, 1825548, e0048000, 1c19c00, 1c19c00) fp = 12dbbccf1
 netbsd:kern_assert+0x34(1825548, 12dbbd6e8, 1cf2000, 1cf35b0, 1cf3400, 104) fp = 12dbbcda1
 netbsd:ptree_insert_node_common+0x308(1825548, 1825580, 18c00c0, 18bfcb8, 1f5, 10109ef90) fp = 12dbbce61
 npf:npf_table_insert+0x198(100f1c908, 102e43e80, 0, 7fff, 2014000, 16203a0) fp = 12dbbcf41
 npf:npfctl_table+0xc8(100f1c908, 4, 12dbbdc94, ff, 0, 16) fp = 12dbbd001
 netbsd:cdev_ioctl+0x68(12dbbdc80, 80284e67, 12dbbdc80, 1, 102d3c960, 0) fp = 12dbbd0d1
 netbsd:VOP_IOCTL+0x38(c600, 80284e67, 12dbbdc80, 1, 102d3c960, 203bad0) fp = 12dbbd181
 netbsd:vn_ioctl+0xa4(1019553a0, 80284e67, 12dbbdc80, 1, 100ee3ec0, 0) fp = 12dbbd261
 netbsd:sys_ioctl+0x254(10270c400, 80284e67, 12dbbdc80, 12dbba000, 1, 1019553a0) fp = 12dbbd3c1
 netbsd:syscall+0x3a8(0, 12dbbdde0, 1020907d0, 0, 10270c400, 80284e67) fp = 12dbbd501
 netbsd:101106c+0(12dbbded0, 4e, fffffffffe559700, 36, 12dbbdf40, 102d3c960) fp = 12dbbd621
 netbsd:10cca0+0(3, 80284e67, ffffffffffffbac8, ffffffffffffbadc, 2c, ffffffffffffbadc) fp = ffffffffffffb1c1

dumping to dev 7,1 offset 2098887
dump succeeded
cpu0: rebooting
>How-To-Repeat:
1) Boot server
2) Enable npf and fail2ban
3) Wait
4) After a few hours, the system has crashed
>Fix:
Workaround: Do not load the npf module.  This is not satisfactory.

>Release-Note:

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/52043: npf kernel panic on sparc64
Date: Tue, 7 Mar 2017 10:06:59 +0100

 Would it be possible to try a kernel and modules form NetBSD 7.1 RC2 ?

 There have been quite a few pullups of NPF changes to netbsd-7, but none
 to netbsd-7-0 (which you are using). If those crashes would be fixed
 by the pullups to netbsd-7, we could re-evaluate the netbsd-7-0 pullup
 decisions.

 NPF in -current is even better, I think (but I am not asking you to try that
 on a production machine).

 Martin

From: Dakotah Lambert <djlambe11@earlham.edu>
To: gnats-bugs@NetBSD.org
Cc: djlambe11@earlham.edu
Subject: Re: kern/52043: npf kernel panic on sparc64
Date: Wed, 8 Mar 2017 11:21:10 -0500

 On 07 Mar 2017 09:10, Martin Husemann wrote:
 >  Would it be possible to try a kernel and modules form NetBSD
 >  7.1 RC2 ?
 >  
 A 500 MHz machine is not a particularly fast compiler, but after
 letting a build run throughout yesterday I have a 7.1 RC2 kernel
 and modules running.

 >  There have been quite a few pullups of NPF changes to netbsd-7,
 >  but none to netbsd-7-0 (which you are using). If those crashes
 >  would be fixed by the pullups to netbsd-7, we could re-evaluate
 >  the netbsd-7-0 pullup decisions.
 >  
 At the time of this writing, my uptime is 11:36 under the 7.1 RC2
 kernel.  The files that have changed in the NPF module between the
 two versions:

      $NetBSD: lpm.c,v 1.1.2.4 2016/12/27 07:03:52 snj Exp $
      $NetBSD: npf_mbuf.c,v 1.13.2.2 2016/01/26 01:27:21 riz Exp $
      $NetBSD: npf_tableset.c,v 1.22.2.1 2016/12/18 07:40:50 snj Exp $

 Is there any other information I can give you at this time?
 -- 
 Dakotah Lambert

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, djlambe11@earlham.edu
Cc: 
Subject: Re: kern/52043: npf kernel panic on sparc64
Date: Wed, 8 Mar 2017 12:35:32 -0500

 On Mar 8,  4:25pm, djlambe11@earlham.edu (Dakotah Lambert) wrote:
 -- Subject: Re: kern/52043: npf kernel panic on sparc64

 |  At the time of this writing, my uptime is 11:36 under the 7.1 RC2
 |  kernel.  The files that have changed in the NPF module between the
 |  two versions:
 |  
 |       $NetBSD: lpm.c,v 1.1.2.4 2016/12/27 07:03:52 snj Exp $
 |       $NetBSD: npf_mbuf.c,v 1.13.2.2 2016/01/26 01:27:21 riz Exp $
 |       $NetBSD: npf_tableset.c,v 1.22.2.1 2016/12/18 07:40:50 snj Exp $
 |  
 |  Is there any other information I can give you at this time?

 How quickly did it crash before?

 christos

From: Dakotah Lambert <djlambe11@earlham.edu>
To: gnats-bugs@NetBSD.org
Cc: djlambe11@earlham.edu
Subject: Re: kern/52043: npf kernel panic on sparc64
Date: Wed, 8 Mar 2017 12:59:40 -0500

 On 08 Mar 2017 17:40, Christos Zoulas wrote:
 >  How quickly did it crash before?
 >  
 Right.  It has lasted

 * Shortest:  2h36+/-10m
 * Usually:   7hxx
 * Longest:  22hxx

 So right now it is at nearly double its average uptime, but it has
 not yet reached the maximum recorded one.
 -- 
 Dakotah Lambert

From: Dakotah Lambert <djlambe11@earlham.edu>
To: gnats-bugs@NetBSD.org
Cc: djlambe11@earlham.edu
Subject: Re: kern/52043: npf kernel panic on sparc64
Date: Tue, 14 Mar 2017 00:17:19 -0400

 Further update: the server has been up for five days with the same
 frequency of insertions and removals into the dynamic table, and I
 am comfortable saying it is no longer crashing.  The only oddities
 in the logs are about the RNG failing statistical RNG tests, which
 presumably has nothing to do with this issue.
 -- 
 Dakotah Lambert

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, djlambe11@earlham.edu
Subject: Re: kern/52043: npf kernel panic on sparc64
Date: Tue, 14 Mar 2017 10:05:54 +0100

 Great - would going with 7.1 be an option for you, as it already has proven
 to be stable in your environment?

 Martin

From: Dakotah Lambert <djlambe11@earlham.edu>
To: gnats-bugs@NetBSD.org
Cc: djlambe11@earlham.edu
Subject: Re: kern/52043: npf kernel panic on sparc64
Date: Tue, 14 Mar 2017 16:10:35 -0400

 On 14 Mar 2017 09:10, Martin Husemann wrote:
 >  Great - would going with 7.1 be an option for you, as it
 >  already has proven to be stable in your environment?
 >  
 It would, yes.
 -- 
 Dakotah Lambert

State-Changed-From-To: open->closed
State-Changed-By: maxv@NetBSD.org
State-Changed-When: Fri, 23 Mar 2018 09:53:01 +0000
State-Changed-Why:
Fixed.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.