NetBSD Problem Report #55466
From martin@duskware.de Mon Jul 6 15:36:33 2020
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id EAD331A9213
for <gnats-bugs@gnats.NetBSD.org>; Mon, 6 Jul 2020 15:36:32 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: can not complete a full test run
X-Send-Pr-Version: 3.95
>Number: 55466
>Category: kern
>Synopsis: can not complete a full test run
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Jul 06 15:40:00 +0000 2020
>Closed-Date: Wed May 17 11:26:54 +0000 2023
>Last-Modified: Wed May 17 11:26:54 +0000 2023
>Originator: Martin Husemann
>Release: NetBSD 9.99.69
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD space-truckin.duskware.de 9.99.69 NetBSD 9.99.69 (GENERIC) #90: Mon Jul 6 11:46:24 CEST 2020 martin@seven-days-to-the-wolves.aprisoft.de:/work/src/sys/arch/evbarm/compile/GENERIC evbarm
Architecture: earmv7hfeb
Machine: evbarm
>Description:
When doing a full test run there is a high chance it locks up here:
sbin/ifconfig/t_bridge (542/873): 1 test cases
manybridges:
top shows a rump_server process busy looping and console is totaly
dead.
Killing the rump_server process (with -9) unlocks the console again.
If not doing a full test run, but just:
cd /usr/tests/sbin/ifconfig && atf-run t_bridge | atf-report
everything works as expected:
Tests root: /usr/tests/sbin/ifconfig
t_bridge (1/1): 1 test cases
manybridges: [38.488706s] Passed.
[38.490496s]
Summary for 1 test programs:
1 passed test cases.
0 failed test cases.
0 expected failed test cases.
0 skipped test cases.
Note that this test does not use RUMP at all, so the leftover rump_server
process must be from some earlier test program.
>How-To-Repeat:
s/a
>Fix:
n/a
>Release-Note:
>Audit-Trail:
From: Jukka Ruohonen <jruohonen@iki.fi>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/55466: can not complete a full test run
Date: Mon, 6 Jul 2020 19:42:49 +0300
On Mon, Jul 06, 2020 at 03:40:01PM +0000, martin@NetBSD.org wrote:
> Note that this test does not use RUMP at all, so the leftover rump_server
> process must be from some earlier test program.
With respect to the recent Qemu run:
http://releng.netbsd.org/b5reports/evbarm-aarch64/2020/2020.07.05.19.40.27/test.tps
While the test passed, the clean-up routine contained this:
tc-so:Burnt down bridge65274
tc-se:[1] Killed ifconfig "bridge${bridge}" destroy >/dev/null ...
tc-so:Burnt down bridge65273
Later on, when the subsequent t_repeated_link_addr is executed, the system
hangs with:
[...]
tc-so:Restored state of vioif0 to up
tc-so:Skipping lo0
tc-so:Skipping bridge65273
tc-se:t_repeated_link_addr: ERROR: Unreachable
Where 'bridge65273' is a left-over from the previous test.
- Jukka
From: Martin Husemann <martin@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55466: can not complete a full test run
Date: Mon, 6 Jul 2020 17:13:43 +0000
On another machine the full test run completed, but it also had a
leftover rump_server process.
Most threads were in pthread_cond_timedwait(), this one was special:
[Switching to thread 95 (LWP 1947 of process 14722)]
#0 0x00000000407e32d8 in lwproc_proc_free (p=0x411be040)
at /usr/src/lib/librump/../../sys/rump/librump/rumpkern/lwproc.c:173
173 LIST_INSERT_HEAD(&initproc->p_children, child, p_sibling);
(gdb) bt
#0 0x00000000407e32d8 in lwproc_proc_free (p=0x411be040)
at /usr/src/lib/librump/../../sys/rump/librump/rumpkern/lwproc.c:173
#1 lwproc_freelwp (l=<optimized out>)
at /usr/src/lib/librump/../../sys/rump/librump/rumpkern/lwproc.c:341
#2 rump_lwproc_switch (newlwp=<optimized out>)
at /usr/src/lib/librump/../../sys/rump/librump/rumpkern/lwproc.c:521
#3 0x00000000407e3700 in lwproc_makelwp (p=0x41feab00,
doswitch=<optimized out>, procmake=<optimized out>)
at /usr/src/lib/librump/../../sys/rump/librump/rumpkern/lwproc.c:385
#4 0x00000000407e3860 in rump_lwproc_newlwp (pid=<optimized out>)
at /usr/src/lib/librump/../../sys/rump/librump/rumpkern/lwproc.c:439
#5 0x0000000040e07c2c in lwproc_newlwp (pid=427)
at /usr/src/lib/librumpuser/rumpuser_sp.c:212
#6 serv_handlesyscall (rhdr=0x638b6fc8, rhdr=0x638b6fc8, data=0x638b8460 "",
spc=0x40f0f608) at /usr/src/lib/librumpuser/rumpuser_sp.c:684
#7 serv_workbouncer (arg=<optimized out>)
at /usr/src/lib/librumpuser/rumpuser_sp.c:767
#8 0x000000004100f328 in pthread__create_tramp (cookie=0x59003400)
at /usr/src/lib/libpthread/pthread.c:560
#9 0x0000000041266698 in _lwp_kill () from /usr/lib/libc.so.12
Martin
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/55466: can not complete a full test run
Date: Thu, 14 Oct 2021 17:49:27 +0200
While I have not seen this on the original double core evbearmv7 machine
in a while, it now hits me quite reproducably on a dual core macppc
machine.
This is a showstopper for netbsd-10.
Martin
State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 17 May 2023 11:20:34 +0000
State-Changed-Why:
Is this still bust?
State-Changed-From-To: feedback->closed
State-Changed-By: martin@NetBSD.org
State-Changed-When: Wed, 17 May 2023 11:26:54 +0000
State-Changed-Why:
Left over busy looping rump_server processes show up every now and then,
but this PR does not help diagnose the individual test failures or
bogus tests causing this (or: rump bugs?)
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.