NetBSD Problem Report #56932

From www@netbsd.org  Mon Jul 18 04:52:07 2022
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 939F91A9239
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 18 Jul 2022 04:52:07 +0000 (UTC)
Message-Id: <20220718045205.EB0DD1A923C@mollari.NetBSD.org>
Date: Mon, 18 Jul 2022 04:52:05 +0000 (UTC)
From: isaki@pastel-flower.jp
Reply-To: isaki@pastel-flower.jp
To: gnats-bugs@NetBSD.org
Subject: x68k frequently hangs up after uvm change in 9.99.75
X-Send-Pr-Version: www-1.0

>Number:         56932
>Category:       port-m68k
>Synopsis:       x68k frequently hangs up after uvm change in 9.99.75
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    chs
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Jul 18 04:55:00 +0000 2022
>Closed-Date:    Sun Jul 31 17:17:37 +0000 2022
>Last-Modified:  Mon Aug 08 23:01:45 +0000 2022
>Originator:     Tetsuya Isaki
>Release:        NetBSD-current (9.99.75)
>Organization:
>Environment:
NetBSD 9.99.75 x68k
>Description:
x68k frequently hangs up after uvm change in 9.99.75.
According to bisect (between 9.99.1 and 9.99.98), the following commit
seems trigger.

> Module Name:  src
> Committed By: chs
> Date:     Wed Nov  4 01:30:19 UTC 2020
>
> Modified Files:change, that is, one previous commit in git doesn't hang up.
>   src/sys/kern: init_main.c
>   src/sys/uvm: uvm_aobj.c uvm_init.c uvm_pdaemon.c
>
> Log Message:
> In uvmpd_tryownerlock(), if the initial try-lock of the owner lock fails
> then rather than do more try-locks and eventually sleep for a tick,
> take a hold on the current owner's lock, drop the page interlock,
> and acquire the lock that we took the hold on in a blocking fashion.
> After we get the lock, check if the lock that we acquired is still
> the lock for the owner of the page that we're interested in.
> If the owner hasn't changed then can proceed with this page,
> otherwise we will skip this page and move on to a different page.
> This dramatically reduces the amount of time that the pagedaemon
> sleeps trying to get locks, since even 1 tick is an eternity to sleep
> in this context and it was easy to trigger that case in practice,
> and with this new method the pagedaemon only very rarely actually blocks
> to acquire the lock that it wants since the object locks are adaptive,
> and when the pagedaemon does block then the amount of time it spends
> sleeping will be generally be much less than 1 tick.
>
>
> To generate a diff of this commit:
> cvs rdiff -u -r1.531 -r1.532 src/sys/kern/init_main.c
> cvs rdiff -u -r1.151 -r1.152 src/sys/uvm/uvm_aobj.c
> cvs rdiff -u -r1.54 -r1.55 src/sys/uvm/uvm_init.c
> cvs rdiff -u -r1.130 -r1.131 src/sys/uvm/uvm_pdaemon.c

The hung-up happens randomly but highly reproduible.  Three times runs
are sufficient to reproduce it.  In other words, I cannot complete a
sequence of power-on, login, and shutdown successfully 3 times in a row.
The kernel built without this change, that is, one previous commit in git
doesn't hang up.

When this hung-up occurs, sometimes echo back works but sometimes it
does not.  NMI (entering DDB) works.

Here is a dmesg and output of swapctl -a.  I got it on the kernel
without this uvm change (one previous commit in git), because it's hard
to work on that kernel...

Dmesg:
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
    2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
    2018, 2019, 2020 The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 9.99.75 (GENERIC) #15: Mon Jul 18 00:25:57 JST 2022
    isaki@XXX:XXX/obj/iij/20201024/x68k/obj/sys/arch/x68k/compile/GENERIC
X68030 (m68030 CPU/MMU, m68882 FPU, 30MHz clock)
total memory = 12288 KB
avail memory = 7744 KB
entropy: no seed from bootloader
timecounter: Timecounters tick every 10.000 msec
Kernelized RAIDframe activated
mainbus0 (root)
intio0 at mainbus0 mapped at 0x38f000
ne0 at intio0 addr 0xece300 intr 0xf9: Nereid Ethernet
ne0: NE2000 (RTL8019) Ethernet
ne0: Ethernet address XX:XX:XX:XX:XX:XX
ne0: 10base2, 10baseT, 10baseT-FDX, auto, default [0x00 0x10] auto
mfp0 at intio0 addr 0xe88000 intr 0x40
clock0 at mfp0: MFP timer C
kbd0 at mfp0
powsw0 at mfp0: Front Switch
rtc0 at intio0 addr 0xe8a000: RP5C15
dmac0 at intio0 addr 0xe84000: HD63450 DMAC
dmac0: 4 channels available.
zsc0 at intio0 addr 0xe98000 intr 0x70
zstty0 at zsc0 channel 0
ms0 at zsc0 channel 1
neptune0 at intio0 addr 0xece000 intr 0xf9: no device found.
neptune1 at intio0 addr 0xece400 intr 0xf9: no device found.
opm0 at intio0 addr 0xe90000
vs0 at intio0 addr 0xe92000 using DMA ch3 intr 0x6a and 0x6b
dmac0: allocating ch 3 for vs.
vs0: MSM6258V ADPCM voice synthesizer
audio0 at vs0: playback, capture, half duplex
audio0: slinear_be:16 -> adpcm:4 1ch 15625Hz, blk 625 bytes (80ms) for playback
audio0: slinear_be:16 <- adpcm:4 1ch 15625Hz, blk 625 bytes (80ms) for recording
spkr0 at audio0: PC Speaker (synthesized)
fdc0 at intio0 addr 0xe94000 intr 0x60 using DMA ch0 intr 0x64 and 0x65
dmac0: allocating ch 0 for fdc.
fdc0: uPD72065 FDC
fd0 at fdc0 drive 0: 1.2MB/[1024bytes/sector], 77 cyl, 2 head, 8 sec
fd1 at fdc0 drive 1: 1.2MB/[1024bytes/sector], 77 cyl, 2 head, 8 sec
par0 at intio0 addr 0xe8c000 intr 0x63: parallel port (write only, interrupt)
scsirom0 at intio0 addr 0xfc0000: On-board at 0xe96020
spc0 at scsirom0
scsibus0 at spc0: 8 targets, 8 luns per target
sram0 at intio0 addr 0xed0000: 16k bytes accessible
bmd0 at intio0 addr 0xece3f0: Nereid Bank Memory Disk
bmd0: 16 MB, 0xee0000(64KB) x 256 pages
grfbus0 at mainbus0
grf0 at grfbus0: 768 x 512 16 colors builtin display
ite0 at grf0: rows 32 cols 96
grf1 at grfbus0: 768 x 512 16 colors graphic display
ite at grf1 not configured
enabling interrupts
entropy: WARNING: extracting entropy too early
timecounter: Timecounter "mfp" frequency 20000 Hz quality 100
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
scsibus0: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 0 lun 0: <CFSxSI, NAGATO, 0101> disk fixed
sd0: 955 MB, 1942 cyl, 16 head, 63 sec, 512 bytes/sect x 1957536 sectors
sd0: async, 8-bit transfers
aes: BearSSL aes_ct
aes_ccm: self-test passed
chacha: Portable C ChaCha
blake2s: self-test passed
bell0: YM2151 OPM bell emulation.
boot device: sd0
root on sd0a dumps on sd0b
root file system type: ffs
kern.module.path=/stand/x68k/9.99.75/modules
entropy: ready

# swapctl -a
Device      1K-blocks     Used    Avail Capacity  Priority
/dev/bmd0c      16384     2908    13476    18%    0
/dev/sd0b      132016        0   132016     0%    2
Total          148400     2908   145492     2%
>How-To-Repeat:
As described above, it happens randomly but I used the following
procedure: power-on, login as root, "shutdown -r now", and
boot again...
>Fix:
N/A

>Release-Note:

>Audit-Trail:
From: Chuck Silvers <chuq@chuq.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56932: x68k frequently hangs up after uvm change in 9.99.75
Date: Mon, 18 Jul 2022 03:30:38 -0700

 this system has very very little RAM, so I'll guess that the hang
 is because the system can't allocate memory.  in ddb, please do
 "show uvmexp" and look at the number of free pages, eg.:

 db{0}> show uvmexp
 Current UVM status:
   pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12, ncolors=8
   4042465 VM pages: 25244 active, 0 inactive, 547 wired, 3967290 free


 is the pagedaemon thread running after you notice the hang?
 from ddb you can "show uvmexp", then "c", wait a few seconds,
 then enter ddb again and "show uvmexp" again, and check if these
 pagedaemon counters changing between the two samples:

   daemon and swap counts:
     woke=0, revs=0, scans=0, obscans=0, anscans=0


 if these counters are not changing then the pagedaemon is stuck too,
 most likely waiting for the lock that I changed in the commit where
 this started happening.  please show a ddb stack trace from the
 pagedaemon thread to confirm this.

 if that is what is happening, then it sounds like some thread is
 sleeping forever with a page owner lock held, which could lead to
 the kind of hang that you are seeing.  in this case, please collect
 ddb stack traces from every other thread in the system, and hopefully
 it will be clear from that which thread is blocking the pagedaemon.

 -Chuck

From: Chuck Silvers <chuq@chuq.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56932: x68k frequently hangs up after uvm change in 9.99.75
Date: Thu, 21 Jul 2022 02:21:53 -0700

 On Tue, Jul 19, 2022 at 11:54:37PM +0900, Tetsuya Isaki wrote:
 > At Mon, 18 Jul 2022 10:35:01 +0000 (UTC),
 > Chuck Silvers wrote:
 > >  this system has very very little RAM, so I'll guess that the hang
 > >  is because the system can't allocate memory.  in ddb, please do
 > >  "show uvmexp" and look at the number of free pages, eg.:
 > :
 > >  is the pagedaemon thread running after you notice the hang?
 > 
 > 5 free pages, and pagedaemon seems stuck.
 ...
 > 2nd "show uvmexp" differs only about interrupts, all others are the same.
 > -  faults=26552, traps=30585, intrs=77868, ctxswitch=29271
 > -   softint=13661, syscalls=115885
 > +  faults=26552, traps=30585, intrs=79782, ctxswitch=29854
 > +   softint=13913, syscalls=115885

 ok, the pagedaemon is definitely stuck since the pagedaemon counters aren't changing.


 > >  please show a ddb stack trace from the
 > >  pagedaemon thread to confirm this.
 > 
 > Is this correct stack trace of pagedaemon?
 > |db> ps
 > |PID   LID S CPU   FLAGS    STRUCT LWP *          NAME WAIT
 > |:
 > |0      54 3   0     200          8ab440      pgdaemon tstile
 > |:
 > |db> trace/a 8ab440
 > |trace: pid 0 lid 54 at 0x39abdc8
 > |mi_switch(8ab440,7f7fbc,2d2700,8ab440,15c7be) + 2aa
 > |db>

 the pagedaemon thread having a WAIT string of "tstile" is also
 consistent with it being stuck waiting for a lock.

 mi_switch() being the innermost stack frame is probably correct,
 but the trace is incomplete... there should be other functions
 in the trace after that, with uvm_pageout() being at the end.
 it will be very hard to figure out what's going on if ddb can't
 report complete stack traces.

 try getting a ddb stack trace for a few other threads and see if
 the same problem shows up with every thread.  (it probably will.)

 do you know how to figure out what's wrong with the m68k ddb
 stack trace code, or will you need some help with that?

 -Chuck

From: Chuck Silvers <chuq@chuq.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56932: x68k frequently hangs up after uvm change in 9.99.75
Date: Sun, 24 Jul 2022 19:34:28 -0700

 On Sat, Jul 23, 2022 at 03:10:53PM +0900, Tetsuya Isaki wrote:
 > >  do you know how to figure out what's wrong with the m68k ddb
 > >  stack trace code, or will you need some help with that?
 > 
 > I don't know well about uvm and ddb area.
 > I'd appriciate your help.

 I checked in a fix for that, please try again to collect stack traces with ddb.

 -Chuck

From: Chuck Silvers <chuq@chuq.com>
To: Tetsuya Isaki <isaki@pastel-flower.jp>, gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56932: x68k frequently hangs up after uvm change in 9.99.75
Date: Tue, 26 Jul 2022 13:17:00 -0700

 --LcQpLW3uv3hruG60
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Tue, Jul 26, 2022 at 05:22:43PM +0900, Tetsuya Isaki wrote:
 > db> trace/a 84a440
 > mi_switch(84a440,7c6fbc,7e,15e836,0) + 2aa
 > sleepq_block(0,0,29a864,0,ffffffe4) + 1cc
 > turnstile_block(0,1,797e9c,29a864) + 2ce
 > rw_enter(797e9c,1,774700,797ea4,0) + e2
 > uvmpd_trylockowner(7746d0,2954c0,84a500,3979f54,87f) + 6e
 > uvmpdpol_selectvictim(3979f54,84a51c,1fcb4,40,84a440) + 1c4
 > uvm_pageout(84a440) + 16e
 > lwp_trampoline() + e

 the pagedaemon is getting stuck in the way that I expected it was.


 > db> trace/a 8bba40
 > trace: pid 726 lid 726 at 0x39ddbe4
 > mi_switch(8bba40,7c6fbc,29c300,204,8bba40) + 2aa
 > sleepq_block(0,0,29cc78,295440) + 1cc
 > uvm_wait(2564a5,797fec,41be000,1,0) + 80
 > pmap_enter_ptpage(83ce30,41be000,1) + 39a
 > pmap_enter(83ce30,41be000,6e8000,5,20) + cc
 > uvm_fault_internal(8eb910,41c1000,1,0,ffffffff) + baa
 > trap(39ddf60,8,4015162,41c1574) + 666
 > faultstkadj() + 4

 this thread is the problem.  it's calling uvm_wait() to wait
 for the pagedaemon to free memory while holding a page-owner lock
 (either a vm_anon or uvm_object lock), and this is no doubt
 the same lock that the pagedaemon is stuck on.

 the attached patch should fix this, please give it a try.

 -Chuck

 --LcQpLW3uv3hruG60
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename="diff.m68k-pmap-ptpage.1"

 Index: src/sys/arch/m68k/m68k/pmap_motorola.c
 ===================================================================
 RCS file: /home/chs/netbsd/cvs/src/sys/arch/m68k/m68k/pmap_motorola.c,v
 retrieving revision 1.76
 diff -u -p -r1.76 pmap_motorola.c
 --- src/sys/arch/m68k/m68k/pmap_motorola.c	16 Apr 2022 18:15:21 -0000	1.76
 +++ src/sys/arch/m68k/m68k/pmap_motorola.c	26 Jul 2022 18:31:40 -0000
 @@ -2585,6 +2585,10 @@ pmap_enter_ptpage(pmap_t pmap, vaddr_t v
  					   va - vm_map_min(kernel_map),
  					   NULL, UVM_PGA_ZERO)) == NULL) {
  			rw_exit(uvm_kernel_object->vmobjlock);
 +			if (can_fail) {
 +				pmap->pm_sref--;
 +				return ENOMEM;
 +			}
  			uvm_wait("ptpage");
  			rw_enter(uvm_kernel_object->vmobjlock, RW_WRITER);
  		}

 --LcQpLW3uv3hruG60--

From: "Chuck Silvers" <chs@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56932 CVS commit: src/sys/arch/m68k/m68k
Date: Sun, 31 Jul 2022 17:11:41 +0000

 Module Name:	src
 Committed By:	chs
 Date:		Sun Jul 31 17:11:41 UTC 2022

 Modified Files:
 	src/sys/arch/m68k/m68k: pmap_motorola.c

 Log Message:
 in pmap_enter_ptpage(), if we are allowed to fail then fail rather than
 waiting for memory to be available.  when we are mapping an anon or uobj page
 then we will be holding the lock for that page owner, and sleeping to wait
 for memory with a page owner lock held is illegal because the pagedaemon
 can wait for that lock, which will lead to deadlock.  fixes PR 56932.


 To generate a diff of this commit:
 cvs rdiff -u -r1.76 -r1.77 src/sys/arch/m68k/m68k/pmap_motorola.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: kern-bug-people->chs
Responsible-Changed-By: chs@NetBSD.org
Responsible-Changed-When: Sun, 31 Jul 2022 17:17:37 +0000
Responsible-Changed-Why:
I fixed this.


State-Changed-From-To: open->closed
State-Changed-By: chs@NetBSD.org
State-Changed-When: Sun, 31 Jul 2022 17:17:37 +0000
State-Changed-Why:
confirmed fixed by the reporter.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.