NetBSD Problem Report #55272

From martin@duskware.de  Mon May 18 15:53:49 2020
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 374D81A9217
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 18 May 2020 15:53:49 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: userland watchdog process may be outstalled
X-Send-Pr-Version: 3.95

>Number:         55272
>Category:       kern
>Synopsis:       userland watchdog process may be outstalled
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon May 18 15:55:00 +0000 2020
>Last-Modified:  Fri Feb 19 18:40:01 +0000 2021
>Originator:     Martin Husemann
>Release:        NetBSD 9.99.63
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD unpluged.duskware.de 9.99.63 NetBSD 9.99.63 (UNPLUGED) #322: Mon May 18 14:20:23 CEST 2020 martin@seven-days-to-the-wolves.aprisoft.de:/work/src/sys/arch/evbarm/compile/UNPLUGED evbarm
Architecture: earm
Machine: evbarm
>Description:
Since some weeks a userland watchdog with something like:

	wdogctl=YES wdogctl_flags="-u mvsoctmr0"

in /etc/rc.conf may be outstalled by running busy tests, like the libarchive
test program. The timer has 21 second period.

>How-To-Repeat:
See above, just run atf tests (cd /usr/tests && atf-run)

>Fix:
n/a

>Audit-Trail:
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55272 CVS commit: src/tests/lib/libarchive
Date: Tue, 16 Jun 2020 07:59:07 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue Jun 16 07:59:07 UTC 2020

 Modified Files:
 	src/tests/lib/libarchive: t_libarchive.sh

 Log Message:
 PR kern/55272: skip this test on uniprocessor machines, it is too dangerous
 and can kill the host kernel if a userland watchdog is running


 To generate a diff of this commit:
 cvs rdiff -u -r1.4 -r1.5 src/tests/lib/libarchive/t_libarchive.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/55272: userland watchdog process may be outstalled
Date: Sat, 20 Jun 2020 13:39:09 +0200

 I have seen this now also on other machines, even SMP ones.

 I have a macppc, dual 800 MHz G4 machine, 1.5 GB RAM, it uses a software
 watchdog too.

 With -current I can reproducably kill it by doing a full ATF test run. It
 dies here:

 fs/msdosfs/t_snapshot (687/846): 2 test cases
     snapshot: [1.077320s] Passed.
     snapshotstress: [4.312312s] Passed.
 [5.391946s]

 fs/nfs/t_mountd (688/846): 1 test cases
     mountdhup: [1.079431s] Expected failure: PR kern/5844: op failed with EACCES
 [1.085206s]

 fs/nfs/t_rquotad (689/846): 6 test cases
     get_nfs_be_1_both: 

 [ 9641.0182210] swwdog: 60 second timer expired
 [ 9641.0182210] panic: watchdog timer expired
 [ 9641.0182210] cpu0: Begin traceback...
 [ 9641.0182210] 0x1000fdf0: at vpanic+0x12c
 [ 9641.0482453] 0x1000fe20: at panic+0x50
 [ 9641.0582418] 0x1000fe60: at swwdog_panic+0x90
 [ 9641.0682466] 0x1000fe70: at callout_softclock+0x418
 [ 9641.0782523] 0x1000feb0: at softint_dispatch+0x140
 [ 9641.0882560] 0x1000ff20: at softint_fast_dispatch+0xdc
 [ 9641.0882560] saved LR(0xfb3ffb79) is invalid.cpu0: End traceback...
 [ 9641.0882560] halting CPU 1
 [ 9641.2083150] dumpsys: TBD
 [ 9641.2083150] rebooting


 However, the test itself suceeds when run in isolation:

 # cd /usr/tests/fs/nfs && atf-run t_rquotad|atf-report
 Tests root: /usr/tests/fs/nfs

 t_rquotad (1/1): 6 test cases
     get_nfs_be_1_both: [3.183507s] Passed.
     get_nfs_be_1_group: [2.943784s] Passed.
     get_nfs_be_1_user: [2.711919s] Passed.
     get_nfs_le_1_both: [3.143645s] Passed.
     get_nfs_le_1_group: [2.876600s] Passed.
     get_nfs_le_1_user: [2.667976s] Passed.
 [17.542653s]

 Summary for 1 test programs:
     6 passed test cases.
     0 failed test cases.
     0 expected failed test cases.
     0 skipped test cases.


 The lockup (and starvation of the userland watchdog tickle) only happen
 with left over/locked up rump_server processes from previous tests.

 This is a showstopper for the netbsd-10 branch.

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/55272: userland watchdog process may be outstalled
Date: Fri, 19 Feb 2021 19:33:15 +0100

 This still happens in regular test runs for me and it seems to be a regression
 from netbsd-9. I consider it a netbsd-10 showstopper.

 Martin

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55272 CVS commit: src/tests/lib/libarchive
Date: Fri, 19 Feb 2021 18:36:50 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Fri Feb 19 18:36:50 UTC 2021

 Modified Files:
 	src/tests/lib/libarchive: t_libarchive.sh

 Log Message:
 PR kern/55272: do not skip this test on single cpu machines - it is not
 the only test causing the watchdog starvation and we better find and
 fix the real issue.


 To generate a diff of this commit:
 cvs rdiff -u -r1.7 -r1.8 src/tests/lib/libarchive/t_libarchive.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.