NetBSD Problem Report #55661
From www@netbsd.org Tue Sep 15 07:08:02 2020
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 777501A9239
for <gnats-bugs@gnats.NetBSD.org>; Tue, 15 Sep 2020 07:08:02 +0000 (UTC)
Message-Id: <20200915070801.634621A923A@mollari.NetBSD.org>
Date: Tue, 15 Sep 2020 07:08:01 +0000 (UTC)
From: pr@xn--rvztrtkrfrgp-bbb7j2b8f0b9d7a21oft.com
Reply-To: pr@xn--rvztrtkrfrgp-bbb7j2b8f0b9d7a21oft.com
To: gnats-bugs@NetBSD.org
Subject: pppoe renegotiation timeout causing panic
X-Send-Pr-Version: www-1.0
>Number: 55661
>Category: port-amd64
>Synopsis: pppoe renegotiation timeout causing panic
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: port-amd64-maintainer
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Sep 15 07:10:00 +0000 2020
>Closed-Date: Sat Nov 20 05:50:20 +0000 2021
>Last-Modified: Sat Nov 20 05:50:20 +0000 2021
>Originator: Ben Gergely
>Release: 9.99.72
>Organization:
>Environment:
NetBSD 9.99.72 amd64
>Description:
Kinda hard to debug because it requires my ISP to fall on it's face.
I didnt manage to grab the other panic that mentioned an xcall which didnt get tee'd to the log.
I'm assuming the panics are from trying to renegotiate a pppoe session after a PADI timeout, I've not got a pppd log to check specifically what happened just before but a timeout it's an educated guess based on what what was happening with the ISP.
Sep 15 03:46:27 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
Sep 15 03:46:27 sludge npfd: reopening pcap socket
Sep 15 03:46:27 sludge npfd: 167 packets read from `/var/log/npflog0.pcap'
Sep 15 03:46:27 sludge savecore: reboot after panic: [ 7.6760811] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
Sep 15 03:46:27 sludge savecore: system went down at Tue Sep 15 03:45:45 2020
Sep 15 04:47:55 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
Sep 15 04:47:55 sludge npfd: reopening pcap socket
Sep 15 04:47:55 sludge npfd: 176 packets read from `/var/log/npflog0.pcap'
Sep 15 04:47:55 sludge savecore: reboot after panic: [ 9.0854538] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
Sep 15 04:47:55 sludge savecore: system went down at Tue Sep 15 04:47:14 2020
Sep 15 05:11:25 sludge savecore: reboot after panic: [ 7.6538159] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
Sep 15 05:11:25 sludge savecore: system went down at Tue Sep 15 05:10:44 2020
Sep 15 06:24:33 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
Sep 15 06:24:33 sludge npfd: reopening pcap socket
Sep 15 06:24:33 sludge npfd: 176 packets read from `/var/log/npflog0.pcap'
Sep 15 06:24:33 sludge savecore: reboot after panic: [ 7.8855546] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
Sep 15 06:24:33 sludge savecore: system went down at Tue Sep 15 06:23:52 2020
Sep 15 07:01:02 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
Sep 15 07:01:02 sludge npfd: reopening pcap socket
Sep 15 07:01:02 sludge npfd: 177 packets read from `/var/log/npflog0.pcap'
Sep 15 07:01:02 sludge savecore: reboot after panic: [ 7.2022737] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
Sep 15 07:01:02 sludge savecore: system went down at Tue Sep 15 07:00:21 2020
>How-To-Repeat:
Get your ISP's backhaul to fail so pppoe sessions timeout and endlessly try to renegotiate.
Might work on a tunnel, not tried yet.
>Fix:
>Release-Note:
>Audit-Trail:
From: Kengo NAKAHARA <k-nakahara@iij.ad.jp>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing panic
Date: Wed, 16 Sep 2020 10:37:29 +0900
Hi,
It seems that your system saved core dump. Could you show the backtrace?
E.g. do the following command
# crash /var/crash/netbsd.X.core.gz /var/crash/netbsd.X.gz # "X" is the latest number
crash> bt
Thanks,
On 2020/09/15 16:10, pr@xn--rvztrtkrfrgp-bbb7j2b8f0b9d7a21oft.com wrote:
>> Number: 55661
>> Category: port-amd64
>> Synopsis: pppoe renegotiation timeout causing panic
>> Confidential: no
>> Severity: serious
>> Priority: medium
>> Responsible: port-amd64-maintainer
>> State: open
>> Class: sw-bug
>> Submitter-Id: net
>> Arrival-Date: Tue Sep 15 07:10:00 +0000 2020
>> Originator: Ben Gergely
>> Release: 9.99.72
>> Organization:
>> Environment:
> NetBSD 9.99.72 amd64
>> Description:
> Kinda hard to debug because it requires my ISP to fall on it's face.
>
> I didnt manage to grab the other panic that mentioned an xcall which didnt get tee'd to the log.
>
> I'm assuming the panics are from trying to renegotiate a pppoe session after a PADI timeout, I've not got a pppd log to check specifically what happened just before but a timeout it's an educated guess based on what what was happening with the ISP.
>
>
>
> Sep 15 03:46:27 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
> Sep 15 03:46:27 sludge npfd: reopening pcap socket
> Sep 15 03:46:27 sludge npfd: 167 packets read from `/var/log/npflog0.pcap'
> Sep 15 03:46:27 sludge savecore: reboot after panic: [ 7.6760811] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
> Sep 15 03:46:27 sludge savecore: system went down at Tue Sep 15 03:45:45 2020
>
> Sep 15 04:47:55 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
> Sep 15 04:47:55 sludge npfd: reopening pcap socket
> Sep 15 04:47:55 sludge npfd: 176 packets read from `/var/log/npflog0.pcap'
> Sep 15 04:47:55 sludge savecore: reboot after panic: [ 9.0854538] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
> Sep 15 04:47:55 sludge savecore: system went down at Tue Sep 15 04:47:14 2020
>
> Sep 15 05:11:25 sludge savecore: reboot after panic: [ 7.6538159] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
> Sep 15 05:11:25 sludge savecore: system went down at Tue Sep 15 05:10:44 2020
>
> Sep 15 06:24:33 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
> Sep 15 06:24:33 sludge npfd: reopening pcap socket
> Sep 15 06:24:33 sludge npfd: 176 packets read from `/var/log/npflog0.pcap'
> Sep 15 06:24:33 sludge savecore: reboot after panic: [ 7.8855546] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
> Sep 15 06:24:33 sludge savecore: system went down at Tue Sep 15 06:23:52 2020
>
> Sep 15 07:01:02 sludge /usr/sbin/ifwatchd[731]: watching interface pppoe0
> Sep 15 07:01:02 sludge npfd: reopening pcap socket
> Sep 15 07:01:02 sludge npfd: 177 packets read from `/var/log/npflog0.pcap'
> Sep 15 07:01:02 sludge savecore: reboot after panic: [ 7.2022737] panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0 || panicstr != NULL" failed: file "/usr/src/sys/kern/kern_condvar.c", line 126
> Sep 15 07:01:02 sludge savecore: system went down at Tue Sep 15 07:00:21 2020
>
>> How-To-Repeat:
> Get your ISP's backhaul to fail so pppoe sessions timeout and endlessly try to renegotiate.
>
> Might work on a tunnel, not tried yet.
>> Fix:
>
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.
Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit
Kengo NAKAHARA <k-nakahara@iij.ad.jp>
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing panic
Date: Wed, 16 Sep 2020 07:27:22 +0200
I think the if_down "abuse" in pppoe should be removed and an internal flag
be set to stop reconnect attempts, to be cleared with pppoectl. But this
is a user visible change of behaviour - so may not be good to pullup to
active branches.
A workaround would be to move LCP timeout handling from a callout to a
workqueue, so the if_down() call would happen in thread context (which now
is required).
Martin
From: Kengo NAKAHARA <k-nakahara@iij.ad.jp>
To: gnats-bugs@netbsd.org, port-amd64-maintainer@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
pr@xn--rvztrtkrfrgp-bbb7j2b8f0b9d7a21oft.com
Cc:
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing panic
Date: Thu, 17 Sep 2020 14:33:23 +0900
Hi,
Thank you for your pointing out.
On 2020/09/16 14:30, Martin Husemann wrote:
> The following reply was made to PR port-amd64/55661; it has been noted by GNATS.
>
> From: Martin Husemann <martin@duskware.de>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing panic
> Date: Wed, 16 Sep 2020 07:27:22 +0200
>
> I think the if_down "abuse" in pppoe should be removed and an internal flag
> be set to stop reconnect attempts, to be cleared with pppoectl. But this
> is a user visible change of behaviour - so may not be good to pullup to
> active branches.
>
> A workaround would be to move LCP timeout handling from a callout to a
> workqueue, so the if_down() call would happen in thread context (which now
> is required).
I'm pretty sure that yamaguchi@ wrote that workqueue code. I ask
him to commit the code after testing.
Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.
Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit
Kengo NAKAHARA <k-nakahara@iij.ad.jp>
State-Changed-From-To: open->closed
State-Changed-By: yamaguchi@NetBSD.org
State-Changed-When: Wed, 23 Sep 2020 01:35:14 +0000
State-Changed-Why:
fixed
State-Changed-From-To: closed->open
State-Changed-By: roy@NetBSD.org
State-Changed-When: Thu, 24 Sep 2020 02:57:05 +0000
State-Changed-Why:
I can reliably panic this still.
From: Roy Marples <roy@marples.name>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing pani
Date: Thu, 24 Sep 2020 03:56:03 +0100
I can reliably reproduce this on NetBSD-9.99.73 without an ISP.
I have setup a pppoe(4) interface with LINK0 and enabled PPPOE server in the kernel.
I have a VM that in turn is a pppoe client for the above.
Upon shutting down the VM, the NetBSD host panics.
Slightly blurred backtrace here:
https://photos.app.goo.gl/6KRiUe3ifdhRDgy19
Riastradh | sppp_keepalive needs to defer if_down to workqueue
Riastradh | not fully fixed, please reopen and share stack trace with yamaguchi-san
Roy
From: Benedek Gergely <pr@xn--rvztrtkrfrgp-bbb7j2b8f0b9d7a21oft.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing panic
Date: Sun, 4 Oct 2020 03:16:25 +0100
I didn't include the backtrace from the dump as it's empty:
sludge# crash netbsd.89.core.gz netbsd.89.gz
Crash version 9.99.73, image version 9.99.73.
Kernel compiled without options LOCKDEBUG.
Output from a running system is unreliable.
crash> bt
0:
crash>
It's also not stopping in ddb or I'd just grab a backtrace
over serial.
From: s ymgch <s.ymgch228@gmail.com>
To: gnats-bugs@netbsd.org, roy@marples.name
Cc: port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing pani
Date: Thu, 26 Nov 2020 16:17:10 +0900
Hi,
I committed some fixes around pppoe(4). And these may have fixed the panic.
Can you check?
-- yamaguchi
On Thu, Sep 24, 2020 at 12:00 PM Roy Marples <roy@marples.name> wrote:
>
> The following reply was made to PR port-amd64/55661; it has been noted by GNATS.
>
> From: Roy Marples <roy@marples.name>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: port-amd64/55661: pppoe renegotiation timeout causing pani
> Date: Thu, 24 Sep 2020 03:56:03 +0100
>
> I can reliably reproduce this on NetBSD-9.99.73 without an ISP.
>
> I have setup a pppoe(4) interface with LINK0 and enabled PPPOE server in the kernel.
> I have a VM that in turn is a pppoe client for the above.
> Upon shutting down the VM, the NetBSD host panics.
>
> Slightly blurred backtrace here:
> https://photos.app.goo.gl/6KRiUe3ifdhRDgy19
>
> Riastradh | sppp_keepalive needs to defer if_down to workqueue
> Riastradh | not fully fixed, please reopen and share stack trace with yamaguchi-san
>
> Roy
>
State-Changed-From-To: open->feedback
State-Changed-By: yamaguchi@NetBSD.org
State-Changed-When: Fri, 27 Nov 2020 09:18:57 +0000
State-Changed-Why:
State-Changed-From-To: feedback->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 20 Nov 2021 05:50:20 +0000
State-Changed-Why:
feedback mail is bouncing
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.