NetBSD Problem Report #53809

From martin@duskware.de  Tue Dec 25 11:00:25 2018
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 9C9707A16A
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 25 Dec 2018 11:00:25 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: kernel locks up
X-Send-Pr-Version: 3.95

>Number:         53809
>Category:       port-alpha
>Synopsis:       kernel locks up
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    thorpej
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Dec 25 11:05:00 +0000 2018
>Closed-Date:    Sun Jul 25 21:53:30 +0000 2021
>Last-Modified:  Sun Jul 25 21:53:30 +0000 2021
>Originator:     Martin Husemann
>Release:        NetBSD 8.99.29
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD gemini.duskware.de 8.99.29 NetBSD 8.99.29 (GENERIC-$Revision: 1.386 $) #114: Tue Dec 25 10:34:46 CET 2018 martin@seven-days-to-the-wolves.aprisoft.de:/work/src/sys/arch/alpha/compile/GENERIC.MP alpha
Architecture: alpha
Machine: alpha
>Description:

Sometime in the last three weeks my test machine started locking up randomly
very frequently. Often I can just boot it, let it sit at the login prompt
for a few minutes and it hangs. I can no get into ddb so usefull information
is hard to extract.

I have no idea what change triggered it. Booting a netbsd-8 kernel always
works and does not lock up, so I am pretty sure it is not my hardware
failing.

Full dmesg of this machine at https://netbsd.org/~martin/alpha-atf/dmesg.txt

>How-To-Repeat:

s/a

>Fix:

n/a

>Release-Note:

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-alpha/53809: kernel locks up
Date: Wed, 2 Jan 2019 07:10:33 +0100

 With a DEBUG kernel I get:

 [ 2334.1883937] panic: pmap_emulate_reference: !write but not FOR|FOE
 [ 2334.2606609] cpu1: Begin traceback...
 [ 2334.3046063] alpha trace requires known PC =eject=
 [ 2334.3612473] cpu1: End traceback...


 Martin

From: Jason Thorpe <thorpej@me.com>
To: "gnats-bugs@netbsd.org" <gnats-bugs@NetBSD.org>
Cc: port-alpha-maintainer@netbsd.org,
 gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org,
 "martin@netbsd.org" <martin@NetBSD.org>
Subject: Re: port-alpha/53809: kernel locks up
Date: Tue, 1 Jan 2019 23:34:10 -0800

 > On Jan 1, 2019, at 10:15 PM, Martin Husemann <martin@duskware.de> =
 wrote:
 >=20
 > The following reply was made to PR port-alpha/53809; it has been noted =
 by GNATS.
 >=20
 > From: Martin Husemann <martin@duskware.de>
 > To: gnats-bugs@NetBSD.org
 > Cc:=20
 > Subject: Re: port-alpha/53809: kernel locks up
 > Date: Wed, 2 Jan 2019 07:10:33 +0100
 >=20
 > With a DEBUG kernel I get:
 >=20
 > [ 2334.1883937] panic: pmap_emulate_reference: !write but not FOR|FOE

 What this panic indicates is that pmap_emulate_reference() was called =
 with either ALPHA_MMCSR_FOR ("fault on read") or ALPHA_MMCSR_FOE ("fault =
 on execute"), but that the PTE for the faulting address does not have =
 the FOR or FOE bits set.  This is, of course, an inconsistency... but =
 looking more closely, I think that this particular DEBUG check is racy =
 on an MP system and thus probably tripping unnecessarily.  Consider:

 Process A (cpu0)							=
 Process B (cpu1)
 Exec libc page with printf (FOE)
 Performs FOE DEBUG check					Exec =
 libc page with printf (FOE)
 pmap_changebit()'s FOE to "off"				Performs FOE =
 DEBUG check
 										=
 BOOM

 If the pmap_changebit() call happens to clear the FOE bit in process B's =
 PTE before cpu1 performs the DEBUG check, then it will fire needlessly.

 Anyway, I think the DEBUG panic you're seeing is a red herring, and not =
 related to the real problem -- without that DEBUG check, process B on =
 cpu1 would simply do some redundant work under the correct locking =
 conditions.  It's only the DEBUG check that's wrong.  I'm not sure it's =
 possible to actually make the DEBUG check really MP-safe; once you've =
 taken the fault-on-whatever on cpu1, you're doomed if you do the check.  =
 That DEBUG block was last touched:

 1.22         (thorpej  26-Mar-98): #ifdef DEBUG                         =
 /* These checks are more expensive */
 1.22         (thorpej  26-Mar-98):      if (!pmap_pte_v(pte))
 1.22         (thorpej  26-Mar-98):              =
 panic("pmap_emulate_reference: invalid pte");
 1.203        (chs      24-Aug-03):      if (type =3D=3D ALPHA_MMCSR_FOW) =
 {
 1.22         (thorpej  26-Mar-98):              if (!(*pte & (user ? =
 PG_UWE : PG_UWE | PG_KWE)))
 1.22         (thorpej  26-Mar-98):                      =
 panic("pmap_emulate_reference: write but unwritable");
 1.22         (thorpej  26-Mar-98):              if (!(*pte & PG_FOW))
 1.22         (thorpej  26-Mar-98):                      =
 panic("pmap_emulate_reference: write but not FOW");
 1.22         (thorpej  26-Mar-98):      } else {
 1.22         (thorpej  26-Mar-98):              if (!(*pte & (user ? =
 PG_URE : PG_URE | PG_KRE)))
 1.22         (thorpej  26-Mar-98):                      =
 panic("pmap_emulate_reference: !write but unreadable");
 1.22         (thorpej  26-Mar-98):              if (!(*pte & (PG_FOR | =
 PG_FOE)))
 1.22         (thorpej  26-Mar-98):                      =
 panic("pmap_emulate_reference: !write but not FOR|FOE");
 1.22         (thorpej  26-Mar-98):      }
 1.22         (thorpej  26-Mar-98):      /* Other diagnostics? */
 1.22         (thorpej  26-Mar-98): #endif

 ----------------------------
 revision 1.22
 date: 1998-03-26 02:18:03 +0000;  author: thorpej;  state: Exp;  lines: =
 +2784 -2
 684;
 Remove the Mach 3 pmap from the tree, replacing it with the contents of
 pmap.old.<whatever>.  To see the history, look at the corresponding
 pmap.old.<whatever> file.
 ----------------------------

 (Chuq's change in rev 1.203 doesn't affect the logic of the DEBUG =
 check...)

 ...which definitely predates adding multiprocessor support to the Alpha =
 pmap, so I'm not surprised that it's buggy and no one noticed before now =
 because how many people run DEBUG kernels really?

 Unfortunately, I don't think this helps narrow down the real problem =
 you're seeing :-(

 -- thorpej

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-alpha/53809: kernel locks up
Date: Wed, 2 Jan 2019 10:29:30 +0100

 Another DIAGNOSTIC hit:

 [ 2741.2335401] Using reserved ASN! (line 2135)
 [ 2741.2335401] panic: PMAP_ACTIVATE_ASN_SANITY
 [ 2741.2335401] cpu0: Begin traceback...
 [ 2741.2335401] alpha trace requires known PC =eject=
 [ 2741.2335401] cpu0: End traceback...
 [ 2741.2335401] cpu1: shutting down...

 Guess I should stop trying DEBUG :-/

 Martin

From: Jason Thorpe <thorpej@me.com>
To: "gnats-bugs@netbsd.org" <gnats-bugs@NetBSD.org>
Cc: port-alpha-maintainer@netbsd.org,
 gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org,
 "martin@netbsd.org" <martin@NetBSD.org>
Subject: Re: port-alpha/53809: kernel locks up
Date: Wed, 2 Jan 2019 09:31:49 -0800

 > On Jan 2, 2019, at 1:30 AM, Martin Husemann <martin@duskware.de> =
 wrote:
 >=20
 > The following reply was made to PR port-alpha/53809; it has been noted =
 by GNATS.
 >=20
 > From: Martin Husemann <martin@duskware.de>
 > To: gnats-bugs@NetBSD.org
 > Cc:=20
 > Subject: Re: port-alpha/53809: kernel locks up
 > Date: Wed, 2 Jan 2019 10:29:30 +0100
 >=20
 > Another DIAGNOSTIC hit:
 >=20
 > [ 2741.2335401] Using reserved ASN! (line 2135)
 > [ 2741.2335401] panic: PMAP_ACTIVATE_ASN_SANITY
 > [ 2741.2335401] cpu0: Begin traceback...
 > [ 2741.2335401] alpha trace requires known PC =3Deject=3D
 > [ 2741.2335401] cpu0: End traceback...
 > [ 2741.2335401] cpu1: shutting down...

 Hm... A call to pmap_asn_alloc() immediately precedes the call to =
 PMAP_ACTIVATE() on line 2135, and one should never come out of there =
 with PMAP_ASN_RESERVED, unless the pmap is referencing the =
 kernel_lev1map.  The the sanity check fails because the pmap is NOT =
 referencing kernel_lev1map, but the pmap's ASN info for the current CPU =
 has PMAP_ASN_RESERVED assigned.

 Any process that's running user code should never be referencing =
 kernel_lev1map, so I'm having a hard time seeing how two threads could =
 be racing here (the pmap is not locked in pmap_activate()).  Though, =
 maybe there's some case I'm not thinking of.  I'll build you a GENERIC =
 kernel with a change to lock the pmap across pmap_activate(), and also =
 to disable the problematic test in pmap_emulate_reference().

 Stay tuned

 >=20
 > Guess I should stop trying DEBUG :-/
 >=20
 > Martin
 >=20

 -- thorpej

Responsible-Changed-From-To: port-alpha-maintainer->thorpej
Responsible-Changed-By: thorpej@NetBSD.org
Responsible-Changed-When: Sat, 29 Aug 2020 22:31:43 +0000
Responsible-Changed-Why:
Take.


State-Changed-From-To: open->feedback
State-Changed-By: thorpej@NetBSD.org
State-Changed-When: Sat, 29 Aug 2020 22:42:32 +0000
State-Changed-Why:
Please update to -current and ensure you have at least:
$NetBSD: pmap.c,v 1.269 2020/08/29 20:06:59 thorpej Exp $

...and try to reproduce the failure.


State-Changed-From-To: feedback->closed
State-Changed-By: thorpej@NetBSD.org
State-Changed-When: Sun, 25 Jul 2021 21:53:30 +0000
State-Changed-Why:
Feedback timeout.  NetBSD is very stable in NetBSD 9.99.87 on a DS25,
in Qemu, and on an AXPpci33.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.