NetBSD Problem Report #54009
From www@NetBSD.org Sat Feb 23 19:45:17 2019
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 02F537A156
for <gnats-bugs@gnats.NetBSD.org>; Sat, 23 Feb 2019 19:45:17 +0000 (UTC)
Message-Id: <20190223194516.32B717A1DF@mollari.NetBSD.org>
Date: Sat, 23 Feb 2019 19:45:16 +0000 (UTC)
From: alnsn@yandex.ru
Reply-To: alnsn@yandex.ru
To: gnats-bugs@NetBSD.org
Subject: "l->l_pcu_cpu[id] == NULL" panic on aarch64
X-Send-Pr-Version: www-1.0
>Number: 54009
>Category: kern
>Synopsis: "l->l_pcu_cpu[id] == NULL" panic on aarch64
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: ryo
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Feb 23 19:50:00 +0000 2019
>Closed-Date: Thu Sep 05 09:30:52 +0000 2019
>Last-Modified: Fri Sep 06 19:40:01 +0000 2019
>Originator: Alexander Nasonov
>Release: NetBSD 8.99.34 aarch64
>Organization:
XMM SWAP LTD
>Environment:
NetBSD nebula 8.99.34 NetBSD 8.99.34 (GENERIC64) #0: Sat Feb 16 15:17:35 GMT 2019 alnsn@nebeda:/home/alnsn/netbsd-current/sljit/src/sys/arch/evbarm/compile/obj/GENERIC64 evbarm
>Description:
My Scaleway aarch64 cloud server running two tor relays crashes randomly.
Some times it takes several days (or even a couple of weeks) before it
crashes. Other times, it crashes several times a day.
Stack traces isn't always available because Scaleway console doesn't always
connect properly to a crashed VM but I managed to get two stack traces:
panic: kernel diagnostic assertion "l->l_pcu_cpu[id] == NULL", file /home/alnsn/netbsd-current/sljit/src/sys/kern/subr_pcu.c, line 339
cpu9: Begin traceback...http://dmesgd.nycbug.org/index.cgi?do=view&id=4787
trace fp ffffffc1252bfcc0
fp ffffffc1252bfce0 vpanic() at ffffffc00045fa08 netbsd:vpanic+0x198
fp ffffffc1252bfdb4 kern_assert() at ffffffc00058f6a4 netbsd:kern_assert+0x5c
fp ffffffc1252bfdd0 pcu_load() at ffffffc0004585ac netbsd:pcu_load +0x21c
fp ffffffc1252bfe70 trap_el0_sync() at ffffffc00006a918 netbsd:trap_el0_sync+0x108
fp ffffffc1252bfed0 el0_trap() at ffffffc0000685d0 netbsd:el0_trap
dmesg of the system: http://dmesgd.nycbug.org/index.cgi?do=view&id=4787
The system is built with:
MKPIE= yes
MKSLJIT= yes
MKDEBUG= yes
It runs on a fully encrypted disk (mounted as /altroot).
# sysctl.conf
ddb.onpanic=1
kern.defcorename=/var/crash/%u/%n.core
kern.maxfiles=98304
security.pax.mprotect.enabled=1
security.pax.mprotect.global=1
security.pax.aslr.enabled=1
security.pax.aslr.global=1
security.pax.segvguard.enabled=1
security.pax.segvguard.global=0
security.pax.segvguard.max_crashes=10
security.pax.segvguard.expiry_timeout=120
security.pax.segvguard.suspend_timeout=600
net.bpf.jit=0
vfs.generic.magiclinks=1
hw.firmware.path=/libdata/firmware:/usr/libdata/firmware:/altroot/usr/pkg/libdata/firmware:/altroot/usr/pkg/libdata
>How-To-Repeat:
Run a tor relay for a while. Once it gets enough activity, the system will start crashing randomly.
>Fix:
Not known.
>Release-Note:
>Audit-Trail:
From: Ryo Shimizu <ryo@nerv.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/54009: "l->l_pcu_cpu[id] == NULL" panic on aarch64
Date: Mon, 26 Aug 2019 04:35:35 +0900
I guess the cause is the lack of memory barrier.
Will the following patches fix it?
cvs -q diff -aup .
Index: subr_ipi.c
===================================================================
RCS file: /src/cvs/cvsroot-netbsd/src/sys/kern/subr_ipi.c,v
retrieving revision 1.4
diff -a -u -p -r1.4 subr_ipi.c
--- subr_ipi.c 6 Apr 2019 02:59:05 -0000 1.4
+++ subr_ipi.c 25 Aug 2019 19:23:47 -0000
@@ -331,6 +331,9 @@ ipi_msg_cpu_handler(void *arg __unused)
msg->func(msg->arg);
/* Ack the request. */
+#ifndef __HAVE_ATOMIC_AS_MEMBAR
+ membar_producer();
+#endif
atomic_dec_uint(&msg->_pending);
}
}
--
ryo shimizu
From: Alexander Nasonov <alnsn@yandex.ru>
To: Ryo Shimizu <ryo@nerv.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/54009: "l->l_pcu_cpu[id] == NULL" panic on aarch64
Date: Sun, 25 Aug 2019 21:31:17 +0100
Ryo Shimizu wrote:
>
> I guess the cause is the lack of memory barrier.
> Will the following patches fix it?
The bug annoyed me so much that I turned that server off.
But I recently turn it back on to test 9.0_BETA.
Two tor relays running on the server are still in a ramp up phase
and it will take about a month to get them running at full speed.
Once they run at a full speed, a chance of hitting the panic will
be much higher.
--
Alex
From: Ryo Shimizu <ryo@nerv.org>
To: Alexander Nasonov <alnsn@yandex.ru>
Cc: Ryo Shimizu <ryo@nerv.org>, gnats-bugs@NetBSD.org,
kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/54009: "l->l_pcu_cpu[id] == NULL" panic on aarch64
Date: Thu, 05 Sep 2019 18:08:40 +0900
>> I guess the cause is the lack of memory barrier.
>> Will the following patches fix it?
>
>The bug annoyed me so much that I turned that server off.
>But I recently turn it back on to test 9.0_BETA.
>
>Two tor relays running on the server are still in a ramp up phase
>and it will take about a month to get them running at full speed.
>Once they run at a full speed, a chance of hitting the panic will
>be much higher.
With only this verification patch applied, it was confirmed to be false positive.
cvs -q diff -aup .
Index: subr_pcu.c
===================================================================
RCS file: /src/cvs/cvsroot-netbsd/src/sys/kern/subr_pcu.c,v
retrieving revision 1.21
diff -a -u -p -r1.21 subr_pcu.c
--- subr_pcu.c 16 Oct 2017 15:03:57 -0000 1.21
+++ subr_pcu.c 29 Aug 2019 05:53:35 -0000
@@ -336,6 +336,13 @@ pcu_load(const pcu_ops_t *pcu)
s = splpcu();
curci = curcpu();
}
+#if 1
+ if (l->l_pcu_cpu[id] != NULL) {
+ printf("false positive?: l->l_pcu_cpu[id] == NULL? id=%u, l=%p, l->l_pcu_cpu[id]=%p\n", id, l, l->l_pcu_cpu[id]);
+ __asm __volatile ("dsb sy");
+ printf("check again: l->l_pcu_cpu[id] == NULL? id=%u, l=%p, l->l_pcu_cpu[id]=%p\n", id, l, l->l_pcu_cpu[id]);
+ }
+#endif
KASSERT(l->l_pcu_cpu[id] == NULL);
/* Save the PCU state on the current CPU, if there is any. */
[ 46.812281] false positive?: l->l_pcu_cpu[id] == NULL? id=0, l=0xffffffc004ba2300, l->l_pcu_cpu[id]=0xffffffc000a29580
[ 46.812281] check again: l->l_pcu_cpu[id] == NULL? id=0, l=0xffffffc004ba2300, l->l_pcu_cpu[id]=0x0
It's almost certainly a memory barrier problem.
I'll commit the fix. If you still reproduce it, please let me know.
Thanks,
--
ryo shimizu
From: "Ryo Shimizu" <ryo@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54009 CVS commit: src/sys/kern
Date: Thu, 5 Sep 2019 09:20:05 +0000
Module Name: src
Committed By: ryo
Date: Thu Sep 5 09:20:05 UTC 2019
Modified Files:
src/sys/kern: subr_ipi.c
Log Message:
requires memory barrier before IPI ack.
Problem was seen on the aarch64 cpus.
Fixes PR/54009
To generate a diff of this commit:
cvs rdiff -u -r1.4 -r1.5 src/sys/kern/subr_ipi.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
Responsible-Changed-From-To: kern-bug-people->ryo
Responsible-Changed-By: ryo@NetBSD.org
Responsible-Changed-When: Thu, 05 Sep 2019 09:30:52 +0000
Responsible-Changed-Why:
fixed
State-Changed-From-To: open->closed
State-Changed-By: ryo@NetBSD.org
State-Changed-When: Thu, 05 Sep 2019 09:30:52 +0000
State-Changed-Why:
From: Jason Thorpe <thorpej@me.com>
To: ryo@netbsd.org
Cc: kern-bug-people@netbsd.org,
netbsd-bugs@netbsd.org,
gnats-admin@netbsd.org,
alnsn@yandex.ru,
gnats-bugs@netbsd.org
Subject: Re: kern/54009 ("l->l_pcu_cpu[id] == NULL" panic on aarch64)
Date: Thu, 5 Sep 2019 07:33:42 -0700
This fix should get pulled up to netbsd-9 (and possibly netbsd-8).
> On Sep 5, 2019, at 2:30 AM, ryo@netbsd.org <ryo@NetBSD.org> wrote:
>
> Synopsis: "l->l_pcu_cpu[id] == NULL" panic on aarch64
>
> Responsible-Changed-From-To: kern-bug-people->ryo
> Responsible-Changed-By: ryo@NetBSD.org
> Responsible-Changed-When: Thu, 05 Sep 2019 09:30:52 +0000
> Responsible-Changed-Why:
> fixed
>
>
> State-Changed-From-To: open->closed
> State-Changed-By: ryo@NetBSD.org
> State-Changed-When: Thu, 05 Sep 2019 09:30:52 +0000
> State-Changed-Why:
>
>
>
-- thorpej
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54009 CVS commit: [netbsd-9] src/sys/kern
Date: Fri, 6 Sep 2019 19:37:52 +0000
Module Name: src
Committed By: martin
Date: Fri Sep 6 19:37:51 UTC 2019
Modified Files:
src/sys/kern [netbsd-9]: subr_ipi.c
Log Message:
Pull up following revision(s) (requested by ryo in ticket #181):
sys/kern/subr_ipi.c: revision 1.5
Requires memory barrier before IPI ack.
Problem was seen on the aarch64 cpus.
Fixes PR/54009
To generate a diff of this commit:
cvs rdiff -u -r1.4 -r1.4.4.1 src/sys/kern/subr_ipi.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.