NetBSD Problem Report #54009

From www@NetBSD.org  Sat Feb 23 19:45:17 2019
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 02F537A156
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 23 Feb 2019 19:45:17 +0000 (UTC)
Message-Id: <20190223194516.32B717A1DF@mollari.NetBSD.org>
Date: Sat, 23 Feb 2019 19:45:16 +0000 (UTC)
From: alnsn@yandex.ru
Reply-To: alnsn@yandex.ru
To: gnats-bugs@NetBSD.org
Subject: "l->l_pcu_cpu[id] == NULL" panic on aarch64
X-Send-Pr-Version: www-1.0

>Number:         54009
>Category:       kern
>Synopsis:       "l->l_pcu_cpu[id] == NULL" panic on aarch64
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    ryo
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Feb 23 19:50:00 +0000 2019
>Closed-Date:    Thu Sep 05 09:30:52 +0000 2019
>Last-Modified:  Fri Sep 06 19:40:01 +0000 2019
>Originator:     Alexander Nasonov
>Release:        NetBSD 8.99.34 aarch64
>Organization:
XMM SWAP LTD
>Environment:
NetBSD nebula 8.99.34 NetBSD 8.99.34 (GENERIC64) #0: Sat Feb 16 15:17:35 GMT 2019  alnsn@nebeda:/home/alnsn/netbsd-current/sljit/src/sys/arch/evbarm/compile/obj/GENERIC64 evbarm

>Description:
My Scaleway aarch64 cloud server running two tor relays crashes randomly.
Some times it takes several days (or even a couple of weeks) before it
crashes. Other times, it crashes several times a day.

Stack traces isn't always available because Scaleway console doesn't always
connect properly to a crashed VM but I managed to get two stack traces:

panic: kernel diagnostic assertion "l->l_pcu_cpu[id] == NULL", file /home/alnsn/netbsd-current/sljit/src/sys/kern/subr_pcu.c, line 339
cpu9: Begin traceback...http://dmesgd.nycbug.org/index.cgi?do=view&id=4787
trace fp ffffffc1252bfcc0                                                                                                          
fp ffffffc1252bfce0 vpanic() at ffffffc00045fa08 netbsd:vpanic+0x198
fp ffffffc1252bfdb4 kern_assert() at ffffffc00058f6a4 netbsd:kern_assert+0x5c
fp ffffffc1252bfdd0 pcu_load() at ffffffc0004585ac netbsd:pcu_load +0x21c
fp ffffffc1252bfe70 trap_el0_sync() at ffffffc00006a918 netbsd:trap_el0_sync+0x108
fp ffffffc1252bfed0 el0_trap() at ffffffc0000685d0 netbsd:el0_trap

dmesg of the system: http://dmesgd.nycbug.org/index.cgi?do=view&id=4787

The system is built with:

MKPIE=          yes
MKSLJIT=        yes
MKDEBUG=        yes

It runs on a fully encrypted disk (mounted as /altroot).

# sysctl.conf
ddb.onpanic=1
kern.defcorename=/var/crash/%u/%n.core
kern.maxfiles=98304
security.pax.mprotect.enabled=1
security.pax.mprotect.global=1
security.pax.aslr.enabled=1
security.pax.aslr.global=1
security.pax.segvguard.enabled=1
security.pax.segvguard.global=0
security.pax.segvguard.max_crashes=10
security.pax.segvguard.expiry_timeout=120
security.pax.segvguard.suspend_timeout=600
net.bpf.jit=0
vfs.generic.magiclinks=1
hw.firmware.path=/libdata/firmware:/usr/libdata/firmware:/altroot/usr/pkg/libdata/firmware:/altroot/usr/pkg/libdata
>How-To-Repeat:
Run a tor relay for a while. Once it gets enough activity, the system will start crashing randomly.
>Fix:
Not known.

>Release-Note:

>Audit-Trail:
From: Ryo Shimizu <ryo@nerv.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: Re: kern/54009: "l->l_pcu_cpu[id] == NULL" panic on aarch64
Date: Mon, 26 Aug 2019 04:35:35 +0900

 I guess the cause is the lack of memory barrier.
 Will the following patches fix it?


 cvs -q diff -aup .
 Index: subr_ipi.c
 ===================================================================
 RCS file: /src/cvs/cvsroot-netbsd/src/sys/kern/subr_ipi.c,v
 retrieving revision 1.4
 diff -a -u -p -r1.4 subr_ipi.c
 --- subr_ipi.c	6 Apr 2019 02:59:05 -0000	1.4
 +++ subr_ipi.c	25 Aug 2019 19:23:47 -0000
 @@ -331,6 +331,9 @@ ipi_msg_cpu_handler(void *arg __unused)
  		msg->func(msg->arg);

  		/* Ack the request. */
 +#ifndef __HAVE_ATOMIC_AS_MEMBAR
 +		membar_producer();
 +#endif
  		atomic_dec_uint(&msg->_pending);
  	}
  }

 -- 
 ryo shimizu

From: Alexander Nasonov <alnsn@yandex.ru>
To: Ryo Shimizu <ryo@nerv.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/54009: "l->l_pcu_cpu[id] == NULL" panic on aarch64
Date: Sun, 25 Aug 2019 21:31:17 +0100

 Ryo Shimizu wrote:
 > 
 > I guess the cause is the lack of memory barrier.
 > Will the following patches fix it?

 The bug annoyed me so much that I turned that server off.
 But I recently turn it back on to test 9.0_BETA.

 Two tor relays running on the server are still in a ramp up phase
 and it will take about a month to get them running at full speed.
 Once they run at a full speed, a chance of hitting the panic will
 be much higher.

 -- 
 Alex

From: Ryo Shimizu <ryo@nerv.org>
To: Alexander Nasonov <alnsn@yandex.ru>
Cc: Ryo Shimizu <ryo@nerv.org>, gnats-bugs@NetBSD.org,
    kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: Re: kern/54009: "l->l_pcu_cpu[id] == NULL" panic on aarch64
Date: Thu, 05 Sep 2019 18:08:40 +0900

 >> I guess the cause is the lack of memory barrier.
 >> Will the following patches fix it?
 >
 >The bug annoyed me so much that I turned that server off.
 >But I recently turn it back on to test 9.0_BETA.
 >
 >Two tor relays running on the server are still in a ramp up phase
 >and it will take about a month to get them running at full speed.
 >Once they run at a full speed, a chance of hitting the panic will
 >be much higher.


 With only this verification patch applied, it was confirmed to be false positive.

 cvs -q diff -aup .
 Index: subr_pcu.c
 ===================================================================
 RCS file: /src/cvs/cvsroot-netbsd/src/sys/kern/subr_pcu.c,v
 retrieving revision 1.21
 diff -a -u -p -r1.21 subr_pcu.c
 --- subr_pcu.c	16 Oct 2017 15:03:57 -0000	1.21
 +++ subr_pcu.c	29 Aug 2019 05:53:35 -0000
 @@ -336,6 +336,13 @@ pcu_load(const pcu_ops_t *pcu)
  		s = splpcu();
  		curci = curcpu();
  	}
 +#if 1
 +	if (l->l_pcu_cpu[id] != NULL) {
 +		printf("false positive?: l->l_pcu_cpu[id] == NULL? id=%u, l=%p, l->l_pcu_cpu[id]=%p\n", id, l, l->l_pcu_cpu[id]);
 +		__asm __volatile ("dsb sy");
 +		printf("check again: l->l_pcu_cpu[id] == NULL? id=%u, l=%p, l->l_pcu_cpu[id]=%p\n", id, l, l->l_pcu_cpu[id]);
 +	}
 +#endif
  	KASSERT(l->l_pcu_cpu[id] == NULL);

  	/* Save the PCU state on the current CPU, if there is any. */


 [    46.812281] false positive?: l->l_pcu_cpu[id] == NULL? id=0, l=0xffffffc004ba2300, l->l_pcu_cpu[id]=0xffffffc000a29580
 [    46.812281] check again: l->l_pcu_cpu[id] == NULL? id=0, l=0xffffffc004ba2300, l->l_pcu_cpu[id]=0x0

 It's almost certainly a memory barrier problem.
 I'll commit the fix. If you still reproduce it, please let me know.

 Thanks,
 -- 
 ryo shimizu

From: "Ryo Shimizu" <ryo@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54009 CVS commit: src/sys/kern
Date: Thu, 5 Sep 2019 09:20:05 +0000

 Module Name:	src
 Committed By:	ryo
 Date:		Thu Sep  5 09:20:05 UTC 2019

 Modified Files:
 	src/sys/kern: subr_ipi.c

 Log Message:
 requires memory barrier before IPI ack.
 Problem was seen on the aarch64 cpus.

 Fixes PR/54009


 To generate a diff of this commit:
 cvs rdiff -u -r1.4 -r1.5 src/sys/kern/subr_ipi.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: kern-bug-people->ryo
Responsible-Changed-By: ryo@NetBSD.org
Responsible-Changed-When: Thu, 05 Sep 2019 09:30:52 +0000
Responsible-Changed-Why:
fixed


State-Changed-From-To: open->closed
State-Changed-By: ryo@NetBSD.org
State-Changed-When: Thu, 05 Sep 2019 09:30:52 +0000
State-Changed-Why:


From: Jason Thorpe <thorpej@me.com>
To: ryo@netbsd.org
Cc: kern-bug-people@netbsd.org,
 netbsd-bugs@netbsd.org,
 gnats-admin@netbsd.org,
 alnsn@yandex.ru,
 gnats-bugs@netbsd.org
Subject: Re: kern/54009 ("l->l_pcu_cpu[id] == NULL" panic on aarch64)
Date: Thu, 5 Sep 2019 07:33:42 -0700

 This fix should get pulled up to netbsd-9 (and possibly netbsd-8).

 > On Sep 5, 2019, at 2:30 AM, ryo@netbsd.org <ryo@NetBSD.org> wrote:
 > 
 > Synopsis: "l->l_pcu_cpu[id] == NULL" panic on aarch64
 > 
 > Responsible-Changed-From-To: kern-bug-people->ryo
 > Responsible-Changed-By: ryo@NetBSD.org
 > Responsible-Changed-When: Thu, 05 Sep 2019 09:30:52 +0000
 > Responsible-Changed-Why:
 > fixed
 > 
 > 
 > State-Changed-From-To: open->closed
 > State-Changed-By: ryo@NetBSD.org
 > State-Changed-When: Thu, 05 Sep 2019 09:30:52 +0000
 > State-Changed-Why:
 > 
 > 
 > 

 -- thorpej

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54009 CVS commit: [netbsd-9] src/sys/kern
Date: Fri, 6 Sep 2019 19:37:52 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Fri Sep  6 19:37:51 UTC 2019

 Modified Files:
 	src/sys/kern [netbsd-9]: subr_ipi.c

 Log Message:
 Pull up following revision(s) (requested by ryo in ticket #181):

 	sys/kern/subr_ipi.c: revision 1.5

 Requires memory barrier before IPI ack.
 Problem was seen on the aarch64 cpus.
 Fixes PR/54009


 To generate a diff of this commit:
 cvs rdiff -u -r1.4 -r1.4.4.1 src/sys/kern/subr_ipi.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.