NetBSD Problem Report #55639

From gson@gson.org  Thu Sep  3 06:51:15 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 72ABC1A9217
	for <gnats-bugs@gnats.NetBSD.org>; Thu,  3 Sep 2020 06:51:15 +0000 (UTC)
Message-Id: <20200903065110.7E129253F75@guava.gson.org>
Date: Thu,  3 Sep 2020 09:51:10 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Assertion "anon != NULL && anon->an_ref != 0" fails on evbarm-earmv7hf
X-Send-Pr-Version: 3.95

>Number:         55639
>Category:       port-evbarm
>Synopsis:       Assertion "anon != NULL && anon->an_ref != 0" fails on evbarm-earmv7hf
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-evbarm-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Sep 03 06:55:00 +0000 2020
>Closed-Date:    Fri Nov 27 09:00:40 +0000 2020
>Last-Modified:  Fri Nov 27 09:00:40 +0000 2020
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2020.08.14.09.06.15
>Organization:
>Environment:
System: NetBSD
Architecture: arm
Machine: evbarm
>Description:

On the TNF evbarm-earmv7hf testbed, most test runs have failed to
complete since the middle of August:

  http://releng.netbsd.org/b5reports/evbarm-earmv7hf/commits-2020.08.html#2020.08.14.09.06.15

The system panics with the message:

  panic: kernel diagnostic assertion "uvm_pagelookup(uobj, offset) == NULL || ((a->ar_flags & UVM_PAGE_ARRAY_FILL_DIRTY) != 0 && !uvm_obj_page_dirty_p(pg))" failed: file "/tmp/bracket/build/2020.08.14.09.06.15-evbarm-earmv7hf/src/sys/uvm/uvm_vnode.c", line 321

The first failure happened after this commit:

  2020.08.14.09.06.14 chs src/sys/miscfs/genfs/genfs_io.c 1.100
  2020.08.14.09.06.15 chs src/sys/uvm/uvm_extern.h 1.231
  2020.08.14.09.06.15 chs src/sys/uvm/uvm_object.c 1.24
  2020.08.14.09.06.15 chs src/sys/uvm/uvm_object.h 1.39
  2020.08.14.09.06.15 chs src/sys/uvm/uvm_page.c 1.245
  2020.08.14.09.06.15 chs src/sys/uvm/uvm_page_status.c 1.6
  2020.08.14.09.06.15 chs src/sys/uvm/uvm_pager.c 1.129
  2020.08.14.09.06.15 chs src/sys/uvm/uvm_vnode.c 1.116

Since there have been some random successful runs, it is not 100%
certain that this commit is the cause, but given that both the commit
and the panic are uvm related, it does seems likely.

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-evbarm/55639: Assertion "anon != NULL && anon->an_ref != 0"
 fails on evbarm-earmv7hf
Date: Thu, 3 Sep 2020 09:00:46 +0200

 On Thu, Sep 03, 2020 at 06:55:00AM +0000, Andreas Gustafsson wrote:
 > Since there have been some random successful runs, it is not 100%
 > certain that this commit is the cause, but given that both the commit
 > and the panic are uvm related, it does seems likely.

 Several of my ARM boards suffer some kind of "UVM amap corruption" and
 the cubietruck (evbarmv7) is the one that shows it most prominently.
 Symptoms (with unpatched) kernel vary, from silent hang to assertions
 like this.

 With a debug patch from Chuck they all show amap corruption (but the patch
 slows things down and the races are less often won/lost).

 I marked these machines/test results in the releng status page for -10.
 Interesting that it now also shows up in emulation.

 Martin

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@netbsd.org
Cc: Martin Husemann <martin@duskware.de>, chs@NetBSD.org
Subject: Re: port-evbarm/55639: Assertion "anon != NULL && anon->an_ref != 0" fails on evbarm-earmv7hf
Date: Thu, 3 Sep 2020 16:46:52 +0300

 Earlier, I wrote:
 > The system panics with the message:
 >
 >   panic: kernel diagnostic assertion "uvm_pagelookup(uobj, offset) == NULL || ((a->ar_flags & UVM_PAGE_ARRAY_FILL_DIRTY) != 0 && !uvm_obj_page_dirty_p(pg))" failed: file "/tmp/bracket/build/2020.08.14.09.06.15-evbarm-earmv7hf/src/sys/uvm/uvm_vnode.c", line 321

 To clarify, the system paniced with that message testing sources from
 Aug 14 (2020.08.14.09.06.15, 2020.08.14.14.42.44, 2020.08.14.16.53.06).

 Testing sources from Aug 16 (2020.08.16.10.31.40) onwards, it's
 panicing with a different message, the one shown in the PR synopsis:

   panic: kernel diagnostic assertion "anon != NULL && anon->an_ref != 0" failed: file "/tmp/bracket/build/2020.08.16.10.31.40-evbarm-earmv7hf/src/sys/uvm/uvm_amap.c", line 747

 I will run some further tests to determine exactly when the panic
 message changed, but they will take several days to run.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Chuck Silvers <chuq@chuq.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-evbarm/55639: Assertion "anon != NULL && anon->an_ref != 0"
 fails on evbarm-earmv7hf
Date: Thu, 3 Sep 2020 21:10:09 -0700

 On Thu, Sep 03, 2020 at 06:55:00AM +0000, Andreas Gustafsson wrote:
 > >Synopsis:       Assertion "anon != NULL && anon->an_ref != 0" fails on evbarm-earmv7hf
 ...
 >   panic: kernel diagnostic assertion "uvm_pagelookup(uobj, offset) == NULL || ((a->ar_flags & UVM_PAGE_ARRAY_FILL_DIRTY) != 0 && !uvm_obj_page_dirty_p(pg))" failed: file "/tmp/bracket/build/2020.08.14.09.06.15-evbarm-earmv7hf/src/sys/uvm/uvm_vnode.c", line 321

 you're talking about two different assertions here.
 the one about "uvm_pagelookup ..." was fixed by rev 1.117 of uvm_vnode.c.
 the one about "anon != NULL ..." is completely different.

 I can reproduce the latter amap corruption problem, but only on certain
 arm boards.  a jetson tk1 does not hit it, but a cubietruck hits it quite easily.
 it's good to know that the emulated system in qemu can also hit it.
 it looks like the qemu configuration used by anita is trying to have
 two CPUs, but the second one isn't actually there:

 [   1.0000000] cpu1 at cpus0: disabled (unresponsive)

 that's helpful in that it tells us the bug is not an MP race.

 the nature of the amap corruption that I've seen on cubietruck is
 a bit-flip in one of the entries in the amap's am_slots[] array,
 which causes different symptoms depending on exactly what is in the amap.

 I wrote some debug code to fully validate an amap immediately after
 locking it and immediately before unlocking it, and this problem is
 detected by the check immediately after locking the amap,
 ie. the bit is being flipped while the amap is not locked,
 so it's very unlikely that the code that operates on amaps
 is causing the corruption.

 I wrote some more debug code to make the mappings of all of the
 amap arrays read-only while the amap is not locked, but then
 I don't hit the problem.

 today I tried running the atf tests on cubietruck again with
 the uvm/radixtree commit that you reference reverted, and I still hit
 the same assertion in amap_wipeout() that the anita harness did.
 so it appears that this is an old bug, which is perhaps made more
 more likely to trigger an assertion by recent changes.

 -Chuck

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>
Cc: Nick Hudson <nick.hudson@gmx.co.uk>
Subject: Re: port-evbarm/55639 (Assertion "anon != NULL && anon->an_ref != 0"
 fails on evbarm-earmv7hf)
Date: Sun, 22 Nov 2020 10:42:17 +0900

 -------- Forwarded Message --------
 Subject: CVS commit: src/sys/arch/arm/arm32
 Date: Sat, 21 Nov 2020 19:44:52 +0000
 From: Nick Hudson <skrll@netbsd.org>
 Reply-To: source-changes-d@NetBSD.org
 To: source-changes-full@NetBSD.org

 Module Name:	src
 Committed By:	skrll
 Date:		Sat Nov 21 19:44:52 UTC 2020

 Modified Files:
 	src/sys/arch/arm/arm32: cpuswitch.S

 Log Message:
 Ensure that r5 contains curlwp before DO_AST_AND_RESTORE_ALIGNMENT_FAULTS
 in lwp_trampoline as required by the move to make ASTs operate per-LWP
 rather than per-CPU.

 Thanks to martin@ for bisecting the amap corruption he was seeing and
 testing this fix.


 To generate a diff of this commit:
 cvs rdiff -u -r1.103 -r1.104 src/sys/arch/arm/arm32/cpuswitch.S

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.


State-Changed-From-To: open->feedback
State-Changed-By: skrll@NetBSD.org
State-Changed-When: Sun, 22 Nov 2020 07:31:40 +0000
State-Changed-Why:
Hopefully I fixed my bug that caused this.
OK to close?


From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-evbarm/55639 (Assertion "anon != NULL && anon->an_ref != 0"
 fails on evbarm-earmv7hf)
Date: Sun, 22 Nov 2020 20:07:15 +0900

 FYI, full ATF successfully completed for real HW for me: Cubietruck,
 Raspberry Pi 2b Rev2, and Zero W.

 I look forward to seeing the results of next official test run :).

 Thanks,
 rin

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@netbsd.org
Cc: Rin Okuyama <rokuyama.rk@gmail.com>, Nick Hudson <skrll@netbsd.org>
Subject: Re: port-evbarm/55639 (Assertion "anon != NULL && anon->an_ref != 0" fails on evbarm-earmv7hf)
Date: Fri, 27 Nov 2020 10:53:29 +0200

 Rin Okuyama wrote:
 >  I look forward to seeing the results of next official test run :).

 There have now been five runs since the commit of
 src/sys/arch/arm/arm32/cpuswitch.S 1.104:

   http://releng.netbsd.org/b5reports/evbarm-earmv7hf/commits-2020.11.html#2020.11.21.19.44.52

 Four of them ran to completion, one hung, and none paniced as reported
 in this PR, so it looks like this bug is fixed (though others remain).
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: feedback->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Fri, 27 Nov 2020 09:00:40 +0000
State-Changed-Why:
No such panic in the last five runs.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.