NetBSD Problem Report #53809
From martin@duskware.de Tue Dec 25 11:00:25 2018
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 9C9707A16A
for <gnats-bugs@gnats.NetBSD.org>; Tue, 25 Dec 2018 11:00:25 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: kernel locks up
X-Send-Pr-Version: 3.95
>Number: 53809
>Category: port-alpha
>Synopsis: kernel locks up
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: thorpej
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Dec 25 11:05:00 +0000 2018
>Closed-Date: Sun Jul 25 21:53:30 +0000 2021
>Last-Modified: Sun Jul 25 21:53:30 +0000 2021
>Originator: Martin Husemann
>Release: NetBSD 8.99.29
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD gemini.duskware.de 8.99.29 NetBSD 8.99.29 (GENERIC-$Revision: 1.386 $) #114: Tue Dec 25 10:34:46 CET 2018 martin@seven-days-to-the-wolves.aprisoft.de:/work/src/sys/arch/alpha/compile/GENERIC.MP alpha
Architecture: alpha
Machine: alpha
>Description:
Sometime in the last three weeks my test machine started locking up randomly
very frequently. Often I can just boot it, let it sit at the login prompt
for a few minutes and it hangs. I can no get into ddb so usefull information
is hard to extract.
I have no idea what change triggered it. Booting a netbsd-8 kernel always
works and does not lock up, so I am pretty sure it is not my hardware
failing.
Full dmesg of this machine at https://netbsd.org/~martin/alpha-atf/dmesg.txt
>How-To-Repeat:
s/a
>Fix:
n/a
>Release-Note:
>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-alpha/53809: kernel locks up
Date: Wed, 2 Jan 2019 07:10:33 +0100
With a DEBUG kernel I get:
[ 2334.1883937] panic: pmap_emulate_reference: !write but not FOR|FOE
[ 2334.2606609] cpu1: Begin traceback...
[ 2334.3046063] alpha trace requires known PC =eject=
[ 2334.3612473] cpu1: End traceback...
Martin
From: Jason Thorpe <thorpej@me.com>
To: "gnats-bugs@netbsd.org" <gnats-bugs@NetBSD.org>
Cc: port-alpha-maintainer@netbsd.org,
gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org,
"martin@netbsd.org" <martin@NetBSD.org>
Subject: Re: port-alpha/53809: kernel locks up
Date: Tue, 1 Jan 2019 23:34:10 -0800
> On Jan 1, 2019, at 10:15 PM, Martin Husemann <martin@duskware.de> =
wrote:
>=20
> The following reply was made to PR port-alpha/53809; it has been noted =
by GNATS.
>=20
> From: Martin Husemann <martin@duskware.de>
> To: gnats-bugs@NetBSD.org
> Cc:=20
> Subject: Re: port-alpha/53809: kernel locks up
> Date: Wed, 2 Jan 2019 07:10:33 +0100
>=20
> With a DEBUG kernel I get:
>=20
> [ 2334.1883937] panic: pmap_emulate_reference: !write but not FOR|FOE
What this panic indicates is that pmap_emulate_reference() was called =
with either ALPHA_MMCSR_FOR ("fault on read") or ALPHA_MMCSR_FOE ("fault =
on execute"), but that the PTE for the faulting address does not have =
the FOR or FOE bits set. This is, of course, an inconsistency... but =
looking more closely, I think that this particular DEBUG check is racy =
on an MP system and thus probably tripping unnecessarily. Consider:
Process A (cpu0) =
Process B (cpu1)
Exec libc page with printf (FOE)
Performs FOE DEBUG check Exec =
libc page with printf (FOE)
pmap_changebit()'s FOE to "off" Performs FOE =
DEBUG check
=
BOOM
If the pmap_changebit() call happens to clear the FOE bit in process B's =
PTE before cpu1 performs the DEBUG check, then it will fire needlessly.
Anyway, I think the DEBUG panic you're seeing is a red herring, and not =
related to the real problem -- without that DEBUG check, process B on =
cpu1 would simply do some redundant work under the correct locking =
conditions. It's only the DEBUG check that's wrong. I'm not sure it's =
possible to actually make the DEBUG check really MP-safe; once you've =
taken the fault-on-whatever on cpu1, you're doomed if you do the check. =
That DEBUG block was last touched:
1.22 (thorpej 26-Mar-98): #ifdef DEBUG =
/* These checks are more expensive */
1.22 (thorpej 26-Mar-98): if (!pmap_pte_v(pte))
1.22 (thorpej 26-Mar-98): =
panic("pmap_emulate_reference: invalid pte");
1.203 (chs 24-Aug-03): if (type =3D=3D ALPHA_MMCSR_FOW) =
{
1.22 (thorpej 26-Mar-98): if (!(*pte & (user ? =
PG_UWE : PG_UWE | PG_KWE)))
1.22 (thorpej 26-Mar-98): =
panic("pmap_emulate_reference: write but unwritable");
1.22 (thorpej 26-Mar-98): if (!(*pte & PG_FOW))
1.22 (thorpej 26-Mar-98): =
panic("pmap_emulate_reference: write but not FOW");
1.22 (thorpej 26-Mar-98): } else {
1.22 (thorpej 26-Mar-98): if (!(*pte & (user ? =
PG_URE : PG_URE | PG_KRE)))
1.22 (thorpej 26-Mar-98): =
panic("pmap_emulate_reference: !write but unreadable");
1.22 (thorpej 26-Mar-98): if (!(*pte & (PG_FOR | =
PG_FOE)))
1.22 (thorpej 26-Mar-98): =
panic("pmap_emulate_reference: !write but not FOR|FOE");
1.22 (thorpej 26-Mar-98): }
1.22 (thorpej 26-Mar-98): /* Other diagnostics? */
1.22 (thorpej 26-Mar-98): #endif
----------------------------
revision 1.22
date: 1998-03-26 02:18:03 +0000; author: thorpej; state: Exp; lines: =
+2784 -2
684;
Remove the Mach 3 pmap from the tree, replacing it with the contents of
pmap.old.<whatever>. To see the history, look at the corresponding
pmap.old.<whatever> file.
----------------------------
(Chuq's change in rev 1.203 doesn't affect the logic of the DEBUG =
check...)
...which definitely predates adding multiprocessor support to the Alpha =
pmap, so I'm not surprised that it's buggy and no one noticed before now =
because how many people run DEBUG kernels really?
Unfortunately, I don't think this helps narrow down the real problem =
you're seeing :-(
-- thorpej
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-alpha/53809: kernel locks up
Date: Wed, 2 Jan 2019 10:29:30 +0100
Another DIAGNOSTIC hit:
[ 2741.2335401] Using reserved ASN! (line 2135)
[ 2741.2335401] panic: PMAP_ACTIVATE_ASN_SANITY
[ 2741.2335401] cpu0: Begin traceback...
[ 2741.2335401] alpha trace requires known PC =eject=
[ 2741.2335401] cpu0: End traceback...
[ 2741.2335401] cpu1: shutting down...
Guess I should stop trying DEBUG :-/
Martin
From: Jason Thorpe <thorpej@me.com>
To: "gnats-bugs@netbsd.org" <gnats-bugs@NetBSD.org>
Cc: port-alpha-maintainer@netbsd.org,
gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org,
"martin@netbsd.org" <martin@NetBSD.org>
Subject: Re: port-alpha/53809: kernel locks up
Date: Wed, 2 Jan 2019 09:31:49 -0800
> On Jan 2, 2019, at 1:30 AM, Martin Husemann <martin@duskware.de> =
wrote:
>=20
> The following reply was made to PR port-alpha/53809; it has been noted =
by GNATS.
>=20
> From: Martin Husemann <martin@duskware.de>
> To: gnats-bugs@NetBSD.org
> Cc:=20
> Subject: Re: port-alpha/53809: kernel locks up
> Date: Wed, 2 Jan 2019 10:29:30 +0100
>=20
> Another DIAGNOSTIC hit:
>=20
> [ 2741.2335401] Using reserved ASN! (line 2135)
> [ 2741.2335401] panic: PMAP_ACTIVATE_ASN_SANITY
> [ 2741.2335401] cpu0: Begin traceback...
> [ 2741.2335401] alpha trace requires known PC =3Deject=3D
> [ 2741.2335401] cpu0: End traceback...
> [ 2741.2335401] cpu1: shutting down...
Hm... A call to pmap_asn_alloc() immediately precedes the call to =
PMAP_ACTIVATE() on line 2135, and one should never come out of there =
with PMAP_ASN_RESERVED, unless the pmap is referencing the =
kernel_lev1map. The the sanity check fails because the pmap is NOT =
referencing kernel_lev1map, but the pmap's ASN info for the current CPU =
has PMAP_ASN_RESERVED assigned.
Any process that's running user code should never be referencing =
kernel_lev1map, so I'm having a hard time seeing how two threads could =
be racing here (the pmap is not locked in pmap_activate()). Though, =
maybe there's some case I'm not thinking of. I'll build you a GENERIC =
kernel with a change to lock the pmap across pmap_activate(), and also =
to disable the problematic test in pmap_emulate_reference().
Stay tuned
>=20
> Guess I should stop trying DEBUG :-/
>=20
> Martin
>=20
-- thorpej
Responsible-Changed-From-To: port-alpha-maintainer->thorpej
Responsible-Changed-By: thorpej@NetBSD.org
Responsible-Changed-When: Sat, 29 Aug 2020 22:31:43 +0000
Responsible-Changed-Why:
Take.
State-Changed-From-To: open->feedback
State-Changed-By: thorpej@NetBSD.org
State-Changed-When: Sat, 29 Aug 2020 22:42:32 +0000
State-Changed-Why:
Please update to -current and ensure you have at least:
$NetBSD: pmap.c,v 1.269 2020/08/29 20:06:59 thorpej Exp $
...and try to reproduce the failure.
State-Changed-From-To: feedback->closed
State-Changed-By: thorpej@NetBSD.org
State-Changed-When: Sun, 25 Jul 2021 21:53:30 +0000
State-Changed-Why:
Feedback timeout. NetBSD is very stable in NetBSD 9.99.87 on a DS25,
in Qemu, and on an AXPpci33.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.