NetBSD Problem Report #56264

From www@netbsd.org  Mon Jun 21 09:55:34 2021
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 6556E1A921F
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 21 Jun 2021 09:55:34 +0000 (UTC)
Message-Id: <20210621095532.F1A201A923D@mollari.NetBSD.org>
Date: Mon, 21 Jun 2021 09:55:32 +0000 (UTC)
From: rokuyama.rk@gmail.com
Reply-To: rokuyama.rk@gmail.com
To: gnats-bugs@NetBSD.org
Subject: diagnostic assertion "l->l_stat == LSONPROC" failed on RPI3
X-Send-Pr-Version: www-1.0

>Number:         56264
>Category:       port-arm
>Synopsis:       diagnostic assertion "l->l_stat == LSONPROC" failed on RPI3
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    skrll
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Jun 21 10:00:00 +0000 2021
>Closed-Date:    Thu Nov 04 07:00:54 +0000 2021
>Last-Modified:  Thu Nov 04 07:00:54 +0000 2021
>Originator:     Rin Okuyama
>Release:        9.99.85
>Organization:
Department of Physics, Meiji University
>Environment:
GENERIC64 kernel for NetBSD/aarch64{,eb} 9.99.85 as of 2021-06-21
GENERIC and RPI2 kernels for 32-bit arm are also potentially affected.
>Description:
GENERIC64 kernel for aarch64{,eb} fails to boot on RPI3 due to

----
panic: kernel diagnostic assertion "l->l_stat == LSONPROC" failed: file "../../../../kern/kern_sleepq.c", line 227
----

Backtrace is

----
[   1.6097589] cpu1: Begin traceback...
[   1.6097589] trace fp ffffc0006eba79b0
[   1.6197592] fp ffffc0006eba79e0 vpanic() at ffffc0000052cacc netbsd:vpanic+0x14c
[   1.6197592] fp ffffc0006eba7a40 kern_assert() at ffffc0000078a6c8 netbsd:kern_assert+0x58
[   1.6297644] fp ffffc0006eba7ad0 sleepq_enqueue() at ffffc000004f5724 netbsd:sleepq_enqueue+0x174
[   1.6397642] fp ffffc0006eba7b10 cv_enter() at ffffc000004bbda0 netbsd:cv_enter+0xf0
[   1.6497665] fp ffffc0006eba7b50 cv_wait() at ffffc000004bbfc8 netbsd:cv_wait+0x38
[   1.6597679] fp ffffc0006eba7b80 xc_wait() at ffffc00000536490 netbsd:xc_wait+0xb0
[   1.6697755] fp ffffc0006eba7bc0 percpu_backend_alloc() at ffffc000005237f4 netbsd:percpu_backend_alloc+0x154
[   1.6797713] fp ffffc0006eba7c40 vmem_xalloc() at ffffc000005345b8 netbsd:vmem_xalloc+0x578
[   1.6897752] fp ffffc0006eba7d10 vmem_alloc() at ffffc00000534ac4 netbsd:vmem_alloc+0x84
[   1.6997755] fp ffffc0006eba7d70 percpu_create() at ffffc00000523cc0 netbsd:percpu_create+0x40
[   1.7097788] fp ffffc0006eba7e00 pic_add() at ffffc0000000314c netbsd:pic_add+0xfc
[   1.7097788] fp ffffc0006eba7e40 bcm2836mp_intr_init() at ffffc0000001adb0 netbsd:bcm2836mp_intr_init+0x90
[   1.7297831] fp ffffc0006eba7e90 arm_fdt_cpu_hatch() at ffffc00000068ef8 netbsd:arm_fdt_cpu_hatch+0x24
[   1.7397826] fp ffffc0006eba7eb0 cpu_hatch() at ffffc0000009a0dc netbsd:cpu_hatch+0xbc
[   1.7397826] fp 0000000000000000 cpu_mpstart() at ffffc00000001a88 netbsd:cpu_mpstart+0x19c
[   1.7497887] cpu1: End traceback...
----

Although this failure was reported as kern/55889

----
http://gnats.netbsd.org/55889
----

and fixed for a moment, it starts to fail again with this commit:

----
http://www.nerv.org/netbsd/changeset.cgi?id=20210609T232251Z.095141ac330a067f4506a48b81678d3ed2bf99f9#src/sys/dev/dev_verbose.h
----

If dev_verbose.h is reverted to rev 1.5, the kernel boots multiuser.
However, still, I don't think that dev_verbose.h is guilty.

GENERIC kernel for evbarmv7hf{,eb} boots just fine on RPI3, but if this
KASSERT is inserted to bcm2836mp_intr_init():

----
Index: sys/arch/arm/broadcom/bcm2835_intr.c
===================================================================
RCS file: /home/netbsd/src/sys/arch/arm/broadcom/bcm2835_intr.c,v
retrieving revision 1.38
diff -p -u -r1.38 bcm2835_intr.c
--- sys/arch/arm/broadcom/bcm2835_intr.c	8 Mar 2021 14:22:42 -0000	1.38
+++ sys/arch/arm/broadcom/bcm2835_intr.c	21 Jun 2021 09:23:06 -0000
@@ -867,6 +867,8 @@ bcm2836mp_intr_init(void *priv, struct c
 	const cpuid_t cpuid = ci->ci_core_id;
 	struct pic_softc * const pic = &bcm2836mp_pic[cpuid];

+	KASSERT(curlwp->l_stat == LSONPROC);
+
 	KASSERT(cpuid < BCM2836_NCPUS);

 #if defined(MULTIPROCESSOR)
----

it fires also for 32-bit kernel. So, the reason why 32-bit kernel works
and 64-bit kernel does not is just because percpu_create() *happens* not
to be blocked. This may also be reason why old versions of dev_verbose.h
happen to work, I guess.

Therefore, I think that pic_add() should not be called from
arm_fdt_cpu_hatch() via bcm2836mp_intr_init(). But, I'm not sure where
it should be...
>How-To-Repeat:
Boot GENERIC64 kernel (both for aarch64{,eb}) on RPI3.
>Fix:
N/A

>Release-Note:

>Audit-Trail:
From: "Rin Okuyama" <rin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56264 CVS commit: src/sys/arch/arm/broadcom
Date: Wed, 1 Sep 2021 03:08:08 +0000

 Module Name:	src
 Committed By:	rin
 Date:		Wed Sep  1 03:08:08 UTC 2021

 Modified Files:
 	src/sys/arch/arm/broadcom: bcm2835_intr.c

 Log Message:
 PR port-arm/56264

 Register all PICs when bcmicu1 is attached, in order to avoid calling
 pic_add() from cpu_hatch(), which blocks for aarch64 kernel on RPI3.
 This prevented MP kernel to boot due to KASSERT failure as described
 in the PR.

 This is a kind of a workaround; the real fix should be to

 (a) reorganize cpu_hatch() for aarch64 and arm:
 http://mail-index.netbsd.org/port-arm/2021/06/21/msg007320.html

 (b) or change MI abstraction of ``MP ready'':
 http://mail-index.netbsd.org/port-arm/2021/06/22/msg007327.html

 However, still, this fix does not bring about any penalty, and it is
 not good to leave RPI3 broken for months...

 Tested on RPI3 (aarch64 MP, armv7hf MP) as well as RPI1 (armv6hf UP).


 To generate a diff of this commit:
 cvs rdiff -u -r1.38 -r1.39 src/sys/arch/arm/broadcom/bcm2835_intr.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->analyzed
State-Changed-By: rin@NetBSD.org
State-Changed-When: Sun, 19 Sep 2021 04:39:53 +0000
State-Changed-Why:
Workaround committed.


Responsible-Changed-From-To: port-arm-maintainer->skrll
Responsible-Changed-By: skrll@NetBSD.org
Responsible-Changed-When: Mon, 18 Oct 2021 07:33:22 +0000
Responsible-Changed-Why:
take


From: Rin Okuyama <rokuyama.rk@gmail.com>
To: Nick Hudson <nick.hudson@gmx.co.uk>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>
Subject: Re: port-arm/56264: diagnostic assertion "l->l_stat == LSONPROC"
 failed on RPI3
Date: Tue, 26 Oct 2021 10:47:24 +0900

 Fix has been suggested on port-arm:

 http://mail-index.netbsd.org/port-arm/2021/10/21/msg007460.html

 Nick, can you please look into it?

 Thanks,
 rin

From: "Nick Hudson" <skrll@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56264 CVS commit: src/sys/arch
Date: Sun, 31 Oct 2021 16:23:48 +0000

 Module Name:	src
 Committed By:	skrll
 Date:		Sun Oct 31 16:23:48 UTC 2021

 Modified Files:
 	src/sys/arch/aarch64/aarch64: aarch64_machdep.c cpu.c cpufunc.c
 	    db_machdep.c locore.S
 	src/sys/arch/aarch64/include: cpu.h cpufunc.h db_machdep.h
 	src/sys/arch/arm/apple: apple_intc.c
 	src/sys/arch/arm/arm: cpu_subr.c undefined.c
 	src/sys/arch/arm/arm32: arm32_boot.c arm32_machdep.c cpu.c
 	src/sys/arch/arm/broadcom: bcm2835_intr.c
 	src/sys/arch/arm/cortex: gicv3_its.c gicv3_its.h gtmr.c
 	src/sys/arch/arm/include: cpu.h locore.h undefined.h
 	src/sys/arch/arm/pic: pic.c
 	src/sys/arch/arm/vfp: vfp_init.c

 Log Message:
 Rework Arm (32bit and 64bit) AP startup so that cpu_hatch doesn't sleep.

 The AP initialisation code in cpu_init_secondary_processor will read and
 initialise the required system registers and state for the BP to attach
 and report.

 Rework the interrupt handler code for this new sequence. Thankfully,
 this removes a bunch of code for bcm2836mp.

 The VFP detection handler on <= armv7 relies on the global undefined
 handler being in place until the BP attaches vfp. That is, after the
 APs have been spun up.

 gicv3_its.c has a serialisation issue which is protected against in
 the gicv3_its_cpu_init, which is called from cpu_hatch, with a spin
 lock. The serialisation issue needs addressing more completely.

 Tested on RPI3, Apple M1, QEMU, and lx2k

 Fixes PR port-arm/56264:
    diagnostic assertion "l->l_stat == LSONPROC" failed on RPI3


 To generate a diff of this commit:
 cvs rdiff -u -r1.62 -r1.63 src/sys/arch/aarch64/aarch64/aarch64_machdep.c
 cvs rdiff -u -r1.66 -r1.67 src/sys/arch/aarch64/aarch64/cpu.c
 cvs rdiff -u -r1.31 -r1.32 src/sys/arch/aarch64/aarch64/cpufunc.c
 cvs rdiff -u -r1.41 -r1.42 src/sys/arch/aarch64/aarch64/db_machdep.c
 cvs rdiff -u -r1.81 -r1.82 src/sys/arch/aarch64/aarch64/locore.S
 cvs rdiff -u -r1.42 -r1.43 src/sys/arch/aarch64/include/cpu.h
 cvs rdiff -u -r1.21 -r1.22 src/sys/arch/aarch64/include/cpufunc.h
 cvs rdiff -u -r1.14 -r1.15 src/sys/arch/aarch64/include/db_machdep.h
 cvs rdiff -u -r1.3 -r1.4 src/sys/arch/arm/apple/apple_intc.c
 cvs rdiff -u -r1.3 -r1.4 src/sys/arch/arm/arm/cpu_subr.c
 cvs rdiff -u -r1.71 -r1.72 src/sys/arch/arm/arm/undefined.c
 cvs rdiff -u -r1.43 -r1.44 src/sys/arch/arm/arm32/arm32_boot.c
 cvs rdiff -u -r1.140 -r1.141 src/sys/arch/arm/arm32/arm32_machdep.c
 cvs rdiff -u -r1.151 -r1.152 src/sys/arch/arm/arm32/cpu.c
 cvs rdiff -u -r1.41 -r1.42 src/sys/arch/arm/broadcom/bcm2835_intr.c
 cvs rdiff -u -r1.32 -r1.33 src/sys/arch/arm/cortex/gicv3_its.c
 cvs rdiff -u -r1.7 -r1.8 src/sys/arch/arm/cortex/gicv3_its.h
 cvs rdiff -u -r1.45 -r1.46 src/sys/arch/arm/cortex/gtmr.c
 cvs rdiff -u -r1.119 -r1.120 src/sys/arch/arm/include/cpu.h
 cvs rdiff -u -r1.36 -r1.37 src/sys/arch/arm/include/locore.h
 cvs rdiff -u -r1.14 -r1.15 src/sys/arch/arm/include/undefined.h
 cvs rdiff -u -r1.72 -r1.73 src/sys/arch/arm/pic/pic.c
 cvs rdiff -u -r1.75 -r1.76 src/sys/arch/arm/vfp/vfp_init.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: analyzed->feedback
State-Changed-By: skrll@NetBSD.org
State-Changed-When: Sun, 31 Oct 2021 16:34:42 +0000
State-Changed-Why:
fix committed. ok to close?


State-Changed-From-To: feedback->closed
State-Changed-By: rin@NetBSD.org
State-Changed-When: Thu, 04 Nov 2021 07:00:54 +0000
State-Changed-Why:
Nick, thank you so much for your great work!

I've confirmed that the fix works just fine for
earmv7hfeb (RPI3, Cubietruck) and aarch64eb (RPI3, ROCKPro64).


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.