NetBSD Problem Report #55661

From www@netbsd.org  Tue Sep 15 07:08:02 2020
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 777501A9239
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 15 Sep 2020 07:08:02 +0000 (UTC)
Message-Id: <20200915070801.634621A923A@mollari.NetBSD.org>
Date: Tue, 15 Sep 2020 07:08:01 +0000 (UTC)
From: pr@xn--rvztrtkrfrgp-bbb7j2b8f0b9d7a21oft.com
Reply-To: pr@xn--rvztrtkrfrgp-bbb7j2b8f0b9d7a21oft.com
To: gnats-bugs@NetBSD.org
Subject: pppoe renegotiation timeout causing panic
X-Send-Pr-Version: www-1.0

>Number:         55661
>Category:       port-amd64
>Synopsis:       pppoe renegotiation timeout causing panic
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-amd64-maintainer
>State:          feedback
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Sep 15 07:10:00 +0000 2020
>Closed-Date:    
>Last-Modified:  Fri Nov 27 09:18:57 +0000 2020
>Originator:     Ben Gergely
>Release:        9.99.72
>Organization:
>Environment:
NetBSD 9.99.72 amd64
>Description:
Kinda hard to debug because it requires my ISP to fall on it's face.

I didnt manage to grab the other panic that mentioned an xcall which didnt get tee'd to the log.

I'm assuming the panics are from trying to renegotiate a pppoe session after a PADI timeout, I've not got a pppd log to check specifically what happened just before but a timeout it's an educated guess based on what what was happening with the ISP.



Sep 15 03:46:27 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
Sep 15 03:46:27 sludge npfd: reopening pcap socket
Sep 15 03:46:27 sludge npfd: 167 packets read from `/var/log/npflog0.pcap'
Sep 15 03:46:27 sludge savecore: reboot after panic: [   7.6760811] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126 
Sep 15 03:46:27 sludge savecore: system went down at Tue Sep 15 03:45:45 2020

Sep 15 04:47:55 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
Sep 15 04:47:55 sludge npfd: reopening pcap socket
Sep 15 04:47:55 sludge npfd: 176 packets read from `/var/log/npflog0.pcap'
Sep 15 04:47:55 sludge savecore: reboot after panic: [   9.0854538] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126 
Sep 15 04:47:55 sludge savecore: system went down at Tue Sep 15 04:47:14 2020

Sep 15 05:11:25 sludge savecore: reboot after panic: [   7.6538159] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126 
Sep 15 05:11:25 sludge savecore: system went down at Tue Sep 15 05:10:44 2020

Sep 15 06:24:33 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
Sep 15 06:24:33 sludge npfd: reopening pcap socket
Sep 15 06:24:33 sludge npfd: 176 packets read from `/var/log/npflog0.pcap'
Sep 15 06:24:33 sludge savecore: reboot after panic: [   7.8855546] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126 
Sep 15 06:24:33 sludge savecore: system went down at Tue Sep 15 06:23:52 2020

Sep 15 07:01:02 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
Sep 15 07:01:02 sludge npfd: reopening pcap socket
Sep 15 07:01:02 sludge npfd: 177 packets read from `/var/log/npflog0.pcap'
Sep 15 07:01:02 sludge savecore: reboot after panic: [   7.2022737] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126 
Sep 15 07:01:02 sludge savecore: system went down at Tue Sep 15 07:00:21 2020

>How-To-Repeat:
Get your ISP's backhaul to fail so pppoe sessions timeout and endlessly try to renegotiate.

Might work on a tunnel, not tried yet.
>Fix:

>Release-Note:

>Audit-Trail:
From: Kengo NAKAHARA <k-nakahara@iij.ad.jp>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing panic
Date: Wed, 16 Sep 2020 10:37:29 +0900

 Hi,

 It seems that your system saved core dump.  Could you show the backtrace?
 E.g. do the following command
 # crash /var/crash/netbsd.X.core.gz /var/crash/netbsd.X.gz  # "X" is the latest number
 crash> bt


 Thanks,

 On 2020/09/15 16:10, pr@xn--rvztrtkrfrgp-bbb7j2b8f0b9d7a21oft.com wrote:
 >> Number:         55661
 >> Category:       port-amd64
 >> Synopsis:       pppoe renegotiation timeout causing panic
 >> Confidential:   no
 >> Severity:       serious
 >> Priority:       medium
 >> Responsible:    port-amd64-maintainer
 >> State:          open
 >> Class:          sw-bug
 >> Submitter-Id:   net
 >> Arrival-Date:   Tue Sep 15 07:10:00 +0000 2020
 >> Originator:     Ben Gergely
 >> Release:        9.99.72
 >> Organization:
 >> Environment:
 > NetBSD 9.99.72 amd64
 >> Description:
 > Kinda hard to debug because it requires my ISP to fall on it's face.
 > 
 > I didnt manage to grab the other panic that mentioned an xcall which didnt get tee'd to the log.
 > 
 > I'm assuming the panics are from trying to renegotiate a pppoe session after a PADI timeout, I've not got a pppd log to check specifically what happened just before but a timeout it's an educated guess based on what what was happening with the ISP.
 > 
 > 
 > 
 > Sep 15 03:46:27 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
 > Sep 15 03:46:27 sludge npfd: reopening pcap socket
 > Sep 15 03:46:27 sludge npfd: 167 packets read from `/var/log/npflog0.pcap'
 > Sep 15 03:46:27 sludge savecore: reboot after panic: [   7.6760811] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
 > Sep 15 03:46:27 sludge savecore: system went down at Tue Sep 15 03:45:45 2020
 > 
 > Sep 15 04:47:55 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
 > Sep 15 04:47:55 sludge npfd: reopening pcap socket
 > Sep 15 04:47:55 sludge npfd: 176 packets read from `/var/log/npflog0.pcap'
 > Sep 15 04:47:55 sludge savecore: reboot after panic: [   9.0854538] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
 > Sep 15 04:47:55 sludge savecore: system went down at Tue Sep 15 04:47:14 2020
 > 
 > Sep 15 05:11:25 sludge savecore: reboot after panic: [   7.6538159] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
 > Sep 15 05:11:25 sludge savecore: system went down at Tue Sep 15 05:10:44 2020
 > 
 > Sep 15 06:24:33 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
 > Sep 15 06:24:33 sludge npfd: reopening pcap socket
 > Sep 15 06:24:33 sludge npfd: 176 packets read from `/var/log/npflog0.pcap'
 > Sep 15 06:24:33 sludge savecore: reboot after panic: [   7.8855546] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
 > Sep 15 06:24:33 sludge savecore: system went down at Tue Sep 15 06:23:52 2020
 > 
 > Sep 15 07:01:02 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
 > Sep 15 07:01:02 sludge npfd: reopening pcap socket
 > Sep 15 07:01:02 sludge npfd: 177 packets read from `/var/log/npflog0.pcap'
 > Sep 15 07:01:02 sludge savecore: reboot after panic: [   7.2022737] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
 > Sep 15 07:01:02 sludge savecore: system went down at Tue Sep 15 07:00:21 2020
 > 
 >> How-To-Repeat:
 > Get your ISP's backhaul to fail so pppoe sessions timeout and endlessly try to renegotiate.
 > 
 > Might work on a tunnel, not tried yet.
 >> Fix:
 > 

 -- 
 //////////////////////////////////////////////////////////////////////
 Internet Initiative Japan Inc.

 Device Engineering Section,
 Product Development Department,
 Product Division,
 Technology Unit

 Kengo NAKAHARA <k-nakahara@iij.ad.jp>

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing panic
Date: Wed, 16 Sep 2020 07:27:22 +0200

 I think the if_down "abuse" in pppoe should be removed and an internal flag
 be set to stop reconnect attempts, to be cleared with pppoectl. But this
 is a user visible change of behaviour - so may not be good to pullup to
 active branches.

 A workaround would be to move LCP timeout handling from a callout to a
 workqueue, so the if_down() call would happen in thread context (which now
 is required).

 Martin

From: Kengo NAKAHARA <k-nakahara@iij.ad.jp>
To: gnats-bugs@netbsd.org, port-amd64-maintainer@netbsd.org,
        gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
        pr@xn--rvztrtkrfrgp-bbb7j2b8f0b9d7a21oft.com
Cc: 
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing panic
Date: Thu, 17 Sep 2020 14:33:23 +0900

 Hi,

 Thank you for your pointing out.

 On 2020/09/16 14:30, Martin Husemann wrote:
 > The following reply was made to PR port-amd64/55661; it has been noted by GNATS.
 > 
 > From: Martin Husemann <martin@duskware.de>
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing panic
 > Date: Wed, 16 Sep 2020 07:27:22 +0200
 > 
 >   I think the if_down "abuse" in pppoe should be removed and an internal flag
 >   be set to stop reconnect attempts, to be cleared with pppoectl. But this
 >   is a user visible change of behaviour - so may not be good to pullup to
 >   active branches.
 >   
 >   A workaround would be to move LCP timeout handling from a callout to a
 >   workqueue, so the if_down() call would happen in thread context (which now
 >   is required).

 I'm pretty sure that yamaguchi@ wrote that workqueue code.  I ask
 him to commit the code after testing.


 Thanks,

 -- 
 //////////////////////////////////////////////////////////////////////
 Internet Initiative Japan Inc.

 Device Engineering Section,
 Product Development Department,
 Product Division,
 Technology Unit

 Kengo NAKAHARA <k-nakahara@iij.ad.jp>

State-Changed-From-To: open->closed
State-Changed-By: yamaguchi@NetBSD.org
State-Changed-When: Wed, 23 Sep 2020 01:35:14 +0000
State-Changed-Why:
fixed


State-Changed-From-To: closed->open
State-Changed-By: roy@NetBSD.org
State-Changed-When: Thu, 24 Sep 2020 02:57:05 +0000
State-Changed-Why:
I can reliably panic this still.


From: Roy Marples <roy@marples.name>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing pani
Date: Thu, 24 Sep 2020 03:56:03 +0100

 I can reliably reproduce this on NetBSD-9.99.73 without an ISP.

 I have setup a pppoe(4) interface with LINK0 and enabled PPPOE server in the kernel.
 I have a VM that in turn is a pppoe client for the above.
 Upon shutting down the VM, the NetBSD host panics.

 Slightly blurred backtrace here:
 https://photos.app.goo.gl/6KRiUe3ifdhRDgy19

 Riastradh | sppp_keepalive needs to defer if_down to workqueue
 Riastradh | not fully fixed, please reopen and share stack trace with yamaguchi-san

 Roy

From: Benedek Gergely <pr@xn--rvztrtkrfrgp-bbb7j2b8f0b9d7a21oft.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing panic
Date: Sun, 4 Oct 2020 03:16:25 +0100

 I didn't include the backtrace from the dump as it's empty:

 sludge# crash netbsd.89.core.gz netbsd.89.gz
 Crash version 9.99.73, image version 9.99.73.
 Kernel compiled without options LOCKDEBUG.
 Output from a running system is unreliable.
 crash> bt
 0:
 crash> 

 It's also not stopping in ddb or I'd just grab a backtrace
 over serial.

From: s ymgch <s.ymgch228@gmail.com>
To: gnats-bugs@netbsd.org, roy@marples.name
Cc: port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing pani
Date: Thu, 26 Nov 2020 16:17:10 +0900

 Hi,

 I committed some fixes around pppoe(4). And these may have fixed the panic.
 Can you check?

 -- yamaguchi

 On Thu, Sep 24, 2020 at 12:00 PM Roy Marples <roy@marples.name> wrote:
 >
 > The following reply was made to PR port-amd64/55661; it has been noted by GNATS.
 >
 > From: Roy Marples <roy@marples.name>
 > To: gnats-bugs@NetBSD.org
 > Cc:
 > Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing pani
 > Date: Thu, 24 Sep 2020 03:56:03 +0100
 >
 >  I can reliably reproduce this on NetBSD-9.99.73 without an ISP.
 >
 >  I have setup a pppoe(4) interface with LINK0 and enabled PPPOE server in the kernel.
 >  I have a VM that in turn is a pppoe client for the above.
 >  Upon shutting down the VM, the NetBSD host panics.
 >
 >  Slightly blurred backtrace here:
 >  https://photos.app.goo.gl/6KRiUe3ifdhRDgy19
 >
 >  Riastradh | sppp_keepalive needs to defer if_down to workqueue
 >  Riastradh | not fully fixed, please reopen and share stack trace with yamaguchi-san
 >
 >  Roy
 >

State-Changed-From-To: open->feedback
State-Changed-By: yamaguchi@NetBSD.org
State-Changed-When: Fri, 27 Nov 2020 09:18:57 +0000
State-Changed-Why:


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.