NetBSD Problem Report #57198

From www@netbsd.org  Wed Jan 25 03:08:21 2023
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 084FB1A9239
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 25 Jan 2023 03:08:21 +0000 (UTC)
Message-Id: <20230125030819.5DD531A923A@mollari.NetBSD.org>
Date: Wed, 25 Jan 2023 03:08:19 +0000 (UTC)
From: fission@ldx.ca
Reply-To: fission@ldx.ca
To: gnats-bugs@NetBSD.org
Subject: after heavy i/o, system stalls with many processes stuck in tstile state
X-Send-Pr-Version: www-1.0

>Number:         57198
>Category:       kern
>Synopsis:       after heavy i/o, system stalls with many processes stuck in tstile state
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jan 25 03:10:01 +0000 2023
>Last-Modified:  Wed Jan 25 06:40:01 +0000 2023
>Originator:     Tyler Mitchell
>Release:        NetBSD 10.99.2 (kernel) / 9.3 (userland)
>Organization:
>Environment:
NetBSD aldea 10.99.2 NetBSD 10.99.2 (GENERIC) #0: Mon Jan 23 13:40:04 UTC 2023  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/sparc64/compile/GENERIC sparc64
>Description:
I am running NetBSD/sparc64 as a VM using QEMU 7.2.0. Originally I installed NetBSD 9.3, but replaced the kernel with a daily HEAD build from 23-jan-2023.

The host system is macOS Ventura 13.1 (22C65) on a MacBook Pro M1 (arm64). QEMU was installed via homebrew package.

The system is mostly stable, however, after periods of heavy I/O, the system will become unresponsive (eg launching a new command will just hang forever). I have been able to reliably trigger this condition by compiling a pkgsrc package while running /etc/daily.

I left top(1) running and noticed some processes were in the "tstile" state. Here is a ddb ps and backtrace of one of those processes:

db{0}> bt
sparc_interrupt(40250320, 404150c0, 40415530, 40250320, 0, 0) at netbsd:sparc_interrupt+0x294
db{0}> ps/w
PID   LID          COMMAND     EMUL  PRI WAIT-MSG    WAIT-CHANNEL
1904 >1904             curl   netbsd   43              0
1641  1641      utmp_update   netbsd   43 tstile       104141678
424    424               sh   netbsd   43 ttyout       105e7bc38
423    423             find   netbsd   33 biowait      106fad7d0
1824  1824               sh   netbsd   40 wait         104e58758
479    479               sh   netbsd   43 wait         104e591d8
1116  1116               sh   netbsd   42 wait         103f33558
463    463             tmux   netbsd   43 kqueue       10427ea60
457    457            ninja   netbsd   42 biowait      106fad6b0
305    305             make   netbsd   38 wait         10430e418
493    493               sh   netbsd   39 wait         103f331d8
195    195             make   netbsd   39 wait         10430fc98
172    172              top   netbsd   43 select       1033da840
1331  1331               sh   netbsd   43 wait         104e58ad8
1341  1341             tmux   netbsd   43 wait         103f32758
1259  1259            login   netbsd   39 wait         103f323d8
1504  1504          chronyd   netbsd   43 select       1033da840
1173  1173             cron   netbsd   43 nanoslp      1046d2ec0
1394  1394             qmgr   netbsd   41 kqueue       10427f2a0
1148  1148           pickup   netbsd   43 kqueue       1047241a0
1322  1322           master   netbsd   43 kqueue       1047250a0
868    868             sshd   netbsd   43 select       1033da840
408    408          syslogd   netbsd   43 kqueue       104259e20
1        1             init   netbsd   38 wait         103f32058
0      475           system   netbsd   96 nfsiod       104340808
0      511           system   netbsd   96 nfsiod       104340748
0      474           system   netbsd   96 nfsiod       104340688
0      470           system   netbsd   96 nfsiod       1043405c8
0       89           system   netbsd  123 physiod      103f31108
0      103           system   netbsd  125 pooldrain    1cbc938
0      102           system   netbsd  124 biowait      1042e05d0
0      101           system   netbsd  126 pgdaemon     1cba580
0       98           system   netbsd   43 swwreboot    103f30c08
0       59           system   netbsd   96 npfgcw       103f48688
0       58           system   netbsd  222 rt_free      1033e1ac8
0       57           system   netbsd   96 unpgc        1ccafc0
0       56           system   netbsd  222 icmp6_wqinput 1033e1988
0       55           system   netbsd  222 nd6_timer    1033e1848
0       54           system   netbsd  222 carp6_wqinput 1033e1708
0       53           system   netbsd  222 carp_wqinput 1033e15c8
0       31           system   netbsd  222 icmp_wqinput 1033e1488
0       30           system   netbsd  222 rt_timer     1033e1348
0       29           system   netbsd  125 vmem_rehash  1033e1208
0       20           system   netbsd   43 entropy      1c71f68
0       19           system   netbsd   96 atath        103efc9c0
0       18           system   netbsd   96 atath        103efc3d0
0       17           system   netbsd   96 pmsreset     103ee3b94
0       16           system   netbsd   96 smtaskq      1c9ce48
0       15           system   netbsd   43 pmfsuspend   1033e0808
0       14           system   netbsd   43 pmfevent     1033e06c8
0       13           system   netbsd   96 sopendfr     1ccaf48
0       12           system   netbsd  222 ifwdog       1033e0588
0       11           system   netbsd  222 iflnkst      1033e0448
0       10           system   netbsd   43 nfssilly     1033e0308
0        9           system   netbsd  125 vdrain       1ccc0e0
0        8           system   netbsd  125 mod_unld     1cbf158
0        7           system   netbsd  127 xcall        2014010
0        6           system   netbsd  223              0
0        5           system   netbsd  220              0
0        4           system   netbsd  221              0
0        3           system   netbsd  222              0
0        2           system   netbsd    0              0
0        0           system   netbsd  125 uvm          1c5ff40
db{0}> x/Lx 104141678
104141678:      103f41ac5
db{0}> bt/a 103f41ac0
trace: pid 0 lid 102 at 0x22451ee51
sleepq_block(0, 0, 1c5ef88, 0, 1cbec00, 103f41ac0) at netbsd:sleepq_block+0xa4
cv_wait(1042e05d0, 1033ca640, 1b13c00, 1b10400, 0, 103f41ac0) at netbsd:cv_wait+0x3c
biowait(1042e04e0, 1042e0600, 1b13c00, 1b13c00, fffffffffffffffe, 1042e05d0) at netbsd:biowait+0x34
wapbl_buffered_flush(104141600, 1, 0, 1042e05a8, 0, 1042e04e0) at netbsd:wapbl_buffered_flush+0x168
wapbl_write_commit(104141600, 37ea00, 1cc000, 1c61c00, 0, 1042ea8c0) at netbsd:wapbl_write_commit+0xc0
wapbl_flush(104141600, 0, 104141678, 104141670, 1, 37ea00) at netbsd:wapbl_flush+0x710
ffs_sync(103f2d000, 2, 1033e8040, 0, 0, 0) at netbsd:ffs_sync+0x174
VFS_SYNC(103f2d000, 3, 1033e8040, 1c7c000, 1, 103f2d000) at netbsd:VFS_SYNC+0x24
sched_sync(1ccc018, e0048000, 1c61eec, 1033e41b0, 1ccc020, 0) at netbsd:sched_sync+0x120
lwp_trampoline(ffd0d92c, 1140d0, 1173b8, 1, ffe91c98, 0) at netbsd:lwp_trampoline+0x8
db{0}>

Here is the dmesg of the affected machine:

db{0}> dmesg
[   1.0000000] Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
[   1.0000000]     2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013,
[   1.0000000]     2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023
[   1.0000000]     The NetBSD Foundation, Inc.  All rights reserved.
[   1.0000000] Copyright (c) 1982, 1986, 1989, 1991, 1993
[   1.0000000]     The Regents of the University of California.  All rights reserved.

[   1.0000000] NetBSD 10.99.2 (GENERIC) #0: Mon Jan 23 13:40:04 UTC 2023
[   1.0000000]  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/sparc64/compile/GENERIC
[   1.0000000] total memory = 3840 MB
[   1.0000000] avail memory = 3750 MB
[   1.0000000] timecounter: Timecounters tick every 10.000 msec
[   1.0000000] Kernelized RAIDframe activated
[   1.0000000] mainbus0 (root): OpenBiosTeam,OpenBIOS: hostid 80800400
[   1.0000000] cpu0 at mainbus0: SUNW,UltraSPARC-IIi @ 100 MHz, CPU id 0
[   1.0000000] cpu0: manuf 17, impl 12, mask 91
[   1.0000000] cpu0: 256K instruction (64 b/l), 16K data (32 b/l), 256K external (64 b/l)
[   1.0000000] psycho0 at mainbus0
[   1.0000000] psycho0: SUNW,sabre: impl 0, version 0: ign 7c0 bus range 0 to 2; PCI bus 0
[   1.0000000] pci0 at psycho0
[   1.0000000] pci0: i/o space, memory space enabled
[   1.0000000] ppb0 at pci0 dev 1 function 1: Sun Microsystems Simba PCI Bridge (rev. 0x11)
[   1.0000000] pci1 at ppb0 bus 1
[   1.0000000] pci1: i/o space, memory space enabled
[   1.0000000] ebus0 at pci1 dev 1 function 0: Sun Microsystems PCIO Ebus2, revision 0x01
[   1.0000000] clock0 at ebus0 addr 2000-3fff: mk48t59
[   1.0000000] power at ebus0 addr 7240-7243 ipl 1 not configured
[   1.0000000] fdthree at ebus0 addr 0-ffffffff not configured
[   1.0000000] com0 at ebus0 addr 3f8-3ff ipl 2b: ns16550a, 16-byte FIFO
[   1.0000000] com0: console
[   1.0000000] pckbc0 at ebus0 addr 60-67 ipl 29autoconfiguration error: : no intr 1: selftest ok
[   1.0000000] pckbd0 at pckbc0 (kbd slot)
[   1.0000000] wskbd1 at pckbd0 mux 1
[   1.0000000] pms0 at pckbc0 (aux slot)
[   1.0000000] wsmouse0 at pms0 mux 0
[   1.0000000] hme0 at pci1 dev 1 function 1: Sun Happy Meal Ethernet, rev. 0
[   1.0000000] hme0: interrupting at ivec 3021
[   1.0000000] hme0: Ethernet address 52:54:00:80:04:00
[   1.0000000] nsphy0 at hme0 phy 1: DP83840 10/100 media interface, rev. 1
[   1.0000000] nsphy0: 100baseTX-FDX, auto
[   1.0000000] genfb0 at pci1 dev 2 function 0: vendor 1234 product 1111 (rev. 0x02)
[   1.0000000] cmdide0 at pci1 dev 3 function 0: CMD Technology PCI0646 (rev. 0x07)
[   1.0000000] cmdide0: bus-master DMA support present
[   1.0000000] cmdide0: primary channel configured to native-PCI mode, channel non-independent
[   1.0000000] cmdide0: using ivec 1820 for native-PCI interrupt
[   1.0000000] atabus0 at cmdide0 channel 0
[   1.0000000] cmdide0: secondary channel configured to native-PCI mode, channel non-independent
[   1.0000000] atabus1 at cmdide0 channel 1
[   1.0000000] ppb1 at pci0 dev 1 function 0: Sun Microsystems Simba PCI Bridge (rev. 0x11)
[   1.0000000] pci2 at ppb1 bus 2
[   1.0000000] pci2: i/o space, memory space enabled
[   1.0000000] pcons at mainbus0 not configured
[   1.0000000] WARNING: system needs entropy for security; see entropy(7)
[   1.0000000] timecounter: Timecounter "tick-counter" frequency 100000000 Hz quality 100
[   1.0000030] No counter-timer -- using %tick at 100MHz as system clock.
[   1.0000030] timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
[   4.0733830] wd0 at atabus0 drive 0
[   4.0835200] wd0: <QEMU HARDDISK>
[   4.0835200] wd0: drive supports 16-sector PIO transfers, LBA48 addressing
[   4.0835200] wd0: 4096 MB, 8322 cyl, 16 head, 63 sec, 512 bytes/sect x 8388608 sectors
[   4.1029250] wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
[   4.1143940] wd0(cmdide0:0:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA)
[   4.1143940] wd1 at atabus1 drive 0
[   4.1143940] wd1: <QEMU HARDDISK>
[   4.1143940] wd1: drive supports 16-sector PIO transfers, LBA48 addressing
[   4.1235730] wd1: 8192 MB, 16644 cyl, 16 head, 63 sec, 512 bytes/sect x 16777216 sectors
[   4.1235730] wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
[   4.1235730] wd1(cmdide0:1:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA)
[   4.1325440] swwdog0: software watchdog initialized
[   4.1414680] WARNING: 1 error while detecting hardware; check system log.
[   4.1414680] root on wd0a dumps on wd0b
[   4.1518890] /: replaying log to memory
[   4.1518890] root file system type: ffs
[   4.1518890] kern.module.path=/stand/sparc64/10.99.2/modules
[   5.2935370] /: replaying log to disk
[  16.2327440] nfs server 192.168.93.81:/a/pkgsrc: not responding
[  22.1324590] nfs server 192.168.93.81:/a/pkgsrc: is alive again
db{0}>

This is the command I use to launch the guest:

qemu-system-sparc64 -m 3840 \
        -drive file=aldea.img,format=raw \
        -drive file=aldea2.qcow2,bus=1 \
        -net nic,netdev=n0,macaddr=52:54:00:80:04:00 -netdev vde,id=n0,sock=vde0 \
        -nographic
>How-To-Repeat:
- Install NetBSD/sparc64 in QEMU.
- Generate lots of disk I/O.
- Notice that the system becomes unresponsive to new I/O requests, and does not recover from this condition.
>Fix:

>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57198: after heavy i/o, system stalls with many processes stuck in tstile state
Date: Wed, 25 Jan 2023 06:36:39 -0000 (UTC)

 fission@ldx.ca writes:

 >biowait(1042e04e0, 1042e0600, 1b13c00, 1b13c00, fffffffffffffffe, 1042e05d0) at netbsd:biowait+0x34
 >wapbl_buffered_flush(104141600, 1, 0, 1042e05a8, 0, 1042e04e0) at netbsd:wapbl_buffered_flush+0x168
 >wapbl_write_commit(104141600, 37ea00, 1cc000, 1c61c00, 0, 1042ea8c0) at netbsd:wapbl_write_commit+0xc0
 >wapbl_flush(104141600, 0, 104141678, 104141670, 1, 37ea00) at netbsd:wapbl_flush+0x710
 >ffs_sync(103f2d000, 2, 1033e8040, 0, 0, 0) at netbsd:ffs_sync+0x174
 >VFS_SYNC(103f2d000, 3, 1033e8040, 1c7c000, 1, 103f2d000) at netbsd:VFS_SYNC+0x24
 >sched_sync(1ccc018, e0048000, 1c61eec, 1033e41b0, 1ccc020, 0) at netbsd:sched_sync+0x120
 >lwp_trampoline(ffd0d92c, 1140d0, 1173b8, 1, ffe91c98, 0) at netbsd:lwp_trampoline+0x8


 This means your disk device isn't finishing an I/O request.

 All the tstile'd processes are just victims of this.



 >[   1.0000000] cmdide0 at pci1 dev 3 function 0: CMD Technology PCI0646 (rev. 0x07)

 I'd suspect a bug in the cmdide driver, or the emulation of that hardware.

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.