NetBSD Problem Report #50350
From gson@gson.org Wed Oct 21 08:12:13 2015
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id 0A85EA6531
for <gnats-bugs@gnats.NetBSD.org>; Wed, 21 Oct 2015 08:12:13 +0000 (UTC)
Message-Id: <20151021081207.77781744628@guava.gson.org>
Date: Wed, 21 Oct 2015 11:12:07 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
X-Send-Pr-Version: 3.95
>Number: 50350
>Category: kern
>Synopsis: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Oct 21 08:15:01 +0000 2015
>Last-Modified: Fri Aug 28 19:15:31 +0000 2020
>Originator: Andreas Gustafsson
>Release: NetBSD-current > 2013-08-14
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
I'm running the ATF tests on a physical amd64 host which I recently
upgraded from an old single-core AMD Athlon64 to a slightly less
old Intel Core 2 Quad Q6600 quad-core machine.
When I did that, the stress_short and stress_long test cases of the
rump/rumpkern/t_sp test, which had reliably succeeded on the
single-core machine, started failing in every test run.
I also tried running the t_sp test on two other PCs, both of the same
model, Compaq DC7900, but with different CPUs, one being a dual-core
E8400 and the other a quad-core Q6600, and it only failed on the
quad-core one.
Bisection shows that the tests pass on the Q6600 for source dates
older 2013-08-14, when the following change was committed:
src/lib/librumpuser/rumpuser.c 1.54
Change the default value of rump kernels CPUs to 2. It used to be
the number of host cores, but that value is overkill for most uses,
especially with massively multicore hosts. Dozens of useless virtual
CPUs are relatively speaking expensive in terms of bootstrap time and
memory footprint. On the other end of the spectrum, defaulting to 2
might shake out some bugs from the qemu test runs.
It looks like defaulting to 2 managed to shake out some bugs from
bare-metal test runs instead of qemu ones.
Log output from one failing test is here:
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2015/2015.10.09.17.21.45/test.html#rump_rumpkern_t_sp_stress_long
>How-To-Repeat:
On a quad-core amd64 machine, run
cd /usr/tests/rump/rumpkern
atf-run t_sp | atf-report
>Fix:
>Release-Note:
>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Fri, 12 Oct 2018 15:29:36 +0300
This is strange.
Some time ago, I replaced the Core 2 Quad system of the original bug
report by a HP DL360 G7 with dual Xeon L5630 CPUs, and the behavior of
the rump/rumpkern/t_sp/stress_{long,short} test cases did not change:
they still consistently failed with a timeout as reported.
But recently, I noticed that they had started to consistently pass on
the same hardware, and they passed even when I tested an old version
of -current where they had previously failed.
The only thing that has changed, that I can think of, is that I
updated the system firmware from the version dated 2011-01-28 to
2018-05-21, which includes fixes for various speculative execution
vulnerabilities as well as other bug fixes.
Has anyone else run the ATF tests on machine with 4 or more cores
recently? If so, did the rump/rumpkern/t_sp/stress_{long,short} test
cases pass or fail?
--
Andreas Gustafsson, gson@gson.org
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Sun, 21 Oct 2018 10:26:17 +0300
Earlier, I wrote:
> But recently, I noticed that they had started to consistently pass on
> the same hardware, and they passed even when I tested an old version
> of -current where they had previously failed.
To be precise, that was running the tests under NetBSD/amd64.
But if I run them under NetBSD/i386, they still fail.
--
Andreas Gustafsson, gson@gson.org
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Wed, 6 Mar 2019 18:50:24 +0200
The rump/rumpkern/t_sp/stress_{long,short} tests are now failing for
me again on real amd64 hardware (dual Xeon L5630). Bisection shows
the new failures began during the period of build breakage between
source dates 2019.03.01.08.15.23 and 2019.03.01.11.06.57:
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2019.03.html#2019.03.01.11.06.57
--
Andreas Gustafsson, gson@gson.org
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: hannken@NetBSD.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Thu, 7 Mar 2019 10:12:46 +0200
I did another bisection in an alternate reality where the build was
not broken on 2019.03.01.08.15.23, by patching the affected version
so that they built. This narrowed down the point where the tests
started failing to the following commit:
2019.03.01.09.02.03 hannken src/sys/kern/kern_exit.c 1.274
2019.03.01.09.02.03 hannken src/sys/kern/kern_lwp.c 1.196
2019.03.01.09.02.03 hannken src/sys/kern/vfs_trans.c 1.57
2019.03.01.09.02.03 hannken src/sys/sys/fstrans.h 1.13
2019.03.01.09.02.03 hannken src/sys/sys/lwp.h 1.181
Given the history of these test failures (that they have been failing
on i386 since 2013, are not failing in qemu, and mysteriously stopped
failing on amd64 for me at one point), I figure this is probably not a
case of the above commit introducing a new bug or being incorrect as
such, but rather somehow retriggering the existing bug. Perhaps the
nature of the changes in the commit might offer some clue as to what
that bug is.
--
Andreas Gustafsson, gson@gson.org
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: Andreas Gustafsson <gson@gson.org>
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
Quad
Date: Thu, 7 Mar 2019 11:53:15 +0100
I'm quite sure the fstrans commit only triggers a hidden bug here.
Running these tests on my test machine (guest under CentOS7/kvm with
16 virtual cpus and 16 GB ram) I get timeouts after 300 seconds.
After "cpuctl offline 4 5 6 7 8 9 10 11 12 13 14 15" the tests
pass so I suppose there is still some hidden dependency on the
number of host cpus.
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
From: Andreas Gustafsson <gson@gson.org>
To: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Fri, 8 Mar 2019 10:24:36 +0200
J. Hannken-Illjes wrote:
> I'm quite sure the fstrans commit only triggers a hidden bug here.
That's what I thought, too, but now that I've looked at the rumpkern
process with gdb, I'm not so sure anymore.
Here's what I did:
Build a release from 2019.03.07.11.09.48 sources with -V MKDEBUG=yes COPTS=-g
Boot it and log in
# cd /usr/tests/rump/rumpkern/
# ./t_sp stress_short
(let it run for a couple of minutes)
^Z
# bg
# gdb rump_server
(gdb) attach <pid of rump_server_process>
(gdb) info threads
This showed a total of 162 LWPs, most of which were
sleeping in ___lwp_park60, but one was in rumpns_membar_sync()
called from fstrans_alloc_lwp_info():
(gdb) thread 108
#0 0x000076b82d6aaa17 in rumpns_membar_sync () from /usr/lib/librump.so.0
#1 0x000076b82c634b13 in fstrans_alloc_lwp_info (mp=mp@entry=0x76b82d5ac000)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_trans.c:341
#2 0x000076b82c6352e9 in fstrans_get_lwp_info (do_alloc=true,
mp=0x76b82d5ac000)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_trans.c:410
#3 _fstrans_start (wait=1, lock_type=FSTRANS_SHARED, mp=0x76b82d5ac000)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_trans.c:454
#4 fstrans_start (mp=0x76b82d5ac000)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_trans.c:492
#5 0x000076b82d65f3ff in vop_pre (vp=0x76b82c501d20, vp=0x76b82c501d20,
op=FST_YES, mpsafe=<synthetic pointer>, mp=<synthetic pointer>)
at /usr/src/lib/librump/../../sys/rump/../kern/vnode_if.c:77
#6 VOP_LOCK (vp=0x76b82c501d20, flags=<optimized out>)
at /usr/src/lib/librump/../../sys/rump/../kern/vnode_if.c:1281
#7 0x000076b82c62f442 in vn_lock (vp=0x76b82c501d20, flags=131074)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_vnops.c:1043
#8 0x000076b82c642215 in namei_start (startdir_ret=0x76b819ccfba8, isnfsd=0,
state=0x76b819ccfc50)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_lookup.c:681
#9 namei_oneroot (isnfsd=0, inhibitmagic=0, neverfollow=0,
state=0x76b819ccfc50)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_lookup.c:1156
#10 namei_tryemulroot (state=state@entry=0x76b819ccfc50,
neverfollow=neverfollow@entry=0, inhibitmagic=inhibitmagic@entry=0,
isnfsd=isnfsd@entry=0)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_lookup.c:1510
#11 0x000076b82c6446f2 in namei (ndp=0x76b819ccfdc8)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_lookup.c:1546
#12 0x000076b82c62fcf3 in vn_open (ndp=0x76b819ccfdc8, fmode=3, cmode=0)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_vnops.c:176
#13 0x000076b82c63995a in do_open (l=l@entry=0x76b82d105000, dvp=0x0,
pb=0x76b82d591240, open_flags=open_flags@entry=2,
open_mode=open_mode@entry=0, fd=fd@entry=0x76b819ccfeec)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_syscalls.c:1590
#14 0x000076b82c639aa9 in do_sys_openat (l=0x76b82d105000,
fdat=fdat@entry=-100, path=<optimized out>, flags=2, mode=0,
fd=fd@entry=0x76b819ccfeec)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_syscalls.c:1674
#15 0x000076b82c639bfc in sys_open (l=<optimized out>, uap=<optimized out>,
retval=0x76b819ccff10)
at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_syscalls.c:1695
#16 0x000076b82da01359 in sy_call (rval=0x76b819ccff10, uap=0x76b82b99f0c0,
l=0x76b82d105000, sy=0x76b82d8eec38 <rumpns_sysent+120>)
at /usr/src/sys/rump/kern/lib/libsysproxy/../../../../sys/syscallvar.h:65
#17 sy_invoke (code=5, rval=0x76b819ccff10, uap=0x76b82b99f0c0,
l=0x76b82d105000, sy=0x76b82d8eec38 <rumpns_sysent+120>)
at /usr/src/sys/rump/kern/lib/libsysproxy/../../../../sys/syscallvar.h:94
#18 hyp_syscall (num=5, arg=0x76b82b99f0c0, retval=0x76b819ccff90)
at /usr/src/sys/rump/kern/lib/libsysproxy/sysproxy.c:72
#19 0x000076b82d206bd0 in rumpsyscall (regrv=0x76b819ccff80,
data=0x76b82b99f0c0, sysnum=5)
at /usr/src/lib/librumpuser/rumpuser_sp.c:267
#20 serv_handlesyscall (rhdr=0x76b82b9a3308, rhdr=0x76b82b9a3308,
data=0x76b82b99f0c0 "\265\031\340\322", spc=0x76b82d40c3b0)
at /usr/src/lib/librumpuser/rumpuser_sp.c:690
#21 serv_workbouncer (arg=<optimized out>)
at /usr/src/lib/librumpuser/rumpuser_sp.c:773
#22 0x000076b82ce0b7d8 in pthread__create_tramp (cookie=0x76b81db2c000)
at /usr/src/lib/libpthread/pthread.c:593
#23 0x000076b82ca8b350 in ?? () from /usr/lib/libc.so.12
fstrans_alloc_lwp_info() was looping over the fstrans_fli_head list,
and tracing it manually, I was unable to get to the end of the list,
so I wrote a gdb loop to print the entire list:
set pager off
set var $n = fstrans_fli_head.lh_first
while $n
set var $n = $n->fli_list.le_next
print $n
end
This printed:
$608 = (struct fstrans_lwp_info *) 0x76b8193b0740
$609 = (struct fstrans_lwp_info *) 0x76b8193b0790
$610 = (struct fstrans_lwp_info *) 0x76b8193b07e0
[...]
$218906 = (struct fstrans_lwp_info *) 0x76b82d53cd20
$218907 = (struct fstrans_lwp_info *) 0x76b82d53ce10
$218908 = (struct fstrans_lwp_info *) 0x0
Thats 218,300 elements. Is the list really supposed to get that large?
--
Andreas Gustafsson, gson@gson.org
From: "Juergen Hannken-Illjes" <hannken@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/50350 CVS commit: src/sys/rump/librump/rumpkern
Date: Sat, 9 Mar 2019 09:02:38 +0000
Module Name: src
Committed By: hannken
Date: Sat Mar 9 09:02:38 UTC 2019
Modified Files:
src/sys/rump/librump/rumpkern: emul.c lwproc.c
Log Message:
Rumpkernel has its own thread deallocation. Add missing fstrans_lwp_dtor()
to lwproc_freelwp().
PR bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
To generate a diff of this commit:
cvs rdiff -u -r1.189 -r1.190 src/sys/rump/librump/rumpkern/emul.c
cvs rdiff -u -r1.40 -r1.41 src/sys/rump/librump/rumpkern/lwproc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: gson@gson.org (Andreas Gustafsson)
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Sat, 25 Jan 2020 10:05:58 +0200
The rump/rumpkern/t_sp/stress_{long,short} are once again failing on
my bare metal amd64 testbed (currently a dual Intel Xeon L5630).
The problem reappeared with these commits:
commit 2020.01.12.22.03.22 ad src/sys/kern/kern_exec.c 1.488
commit 2020.01.12.22.03.22 ad src/sys/kern/kern_runq.c 1.58
commit 2020.01.12.22.03.23 ad src/sys/sys/lwp.h 1.196
commit 2020.01.12.22.03.23 ad src/sys/sys/sched.h 1.87
For logs, see:
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.01.html#2020.01.12.22.03.23
--
Andreas Gustafsson, gson@gson.org
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: Andreas Gustafsson <gson@gson.org>, joerg@netbsd.org, martin@netbsd.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
Quad
Date: Sat, 25 Jan 2020 13:39:56 +0000
On Sat, Jan 25, 2020 at 08:10:01AM +0000, Andreas Gustafsson wrote:
> The rump/rumpkern/t_sp/stress_{long,short} are once again failing on
> my bare metal amd64 testbed (currently a dual Intel Xeon L5630).
> The problem reappeared with these commits:
>
> commit 2020.01.12.22.03.22 ad src/sys/kern/kern_exec.c 1.488
> commit 2020.01.12.22.03.22 ad src/sys/kern/kern_runq.c 1.58
> commit 2020.01.12.22.03.23 ad src/sys/sys/lwp.h 1.196
> commit 2020.01.12.22.03.23 ad src/sys/sys/sched.h 1.87
The distribution of new LWPs has changed, and it's exposing very old bugs in
libpthread and/or rump. We have identified and fixed some but it looks like
there are still more. I first ran into this with libmicro back in November
and Joerg has been seeing it on pbulk runs. I'll dig in further when I get
a chance.
Andrew
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
Quad
Date: Sat, 25 Jan 2020 17:31:32 +0000
This may help. I'm trying a run now.
----- Forwarded message from Andrew Doran <ad@netbsd.org> -----
Date: Sat, 25 Jan 2020 15:41:52 +0000
From: Andrew Doran <ad@netbsd.org>
To: source-changes@NetBSD.org
Subject: CVS commit: src
X-Mailer: log_accum
Module Name: src
Committed By: ad
Date: Sat Jan 25 15:41:52 UTC 2020
Modified Files:
src/lib/libpthread: pthread.c
src/sys/compat/netbsd32: netbsd32_lwp.c
src/sys/kern: sys_lwp.c
src/sys/sys: lwp.h
Log Message:
- Fix a race between the kernel and libpthread, where a new thread can start
life without its self->pt_lid being filled in.
- Fix an error path in _lwp_create(). If the new LID can't be copied out,
then get rid of the new LWP (i.e. either succeed or fail, not both).
- Mark l_dopreempt and l_nopreempt volatile in struct lwp.
To generate a diff of this commit:
cvs rdiff -u -r1.154 -r1.155 src/lib/libpthread/pthread.c
cvs rdiff -u -r1.19 -r1.20 src/sys/compat/netbsd32/netbsd32_lwp.c
cvs rdiff -u -r1.71 -r1.72 src/sys/kern/sys_lwp.c
cvs rdiff -u -r1.197 -r1.198 src/sys/sys/lwp.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
----- End forwarded message -----
From: "Andrew Doran" <ad@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/50350 CVS commit: src/lib/libpthread
Date: Sat, 25 Jan 2020 17:58:28 +0000
Module Name: src
Committed By: ad
Date: Sat Jan 25 17:58:28 UTC 2020
Modified Files:
src/lib/libpthread: pthread_mutex.c
Log Message:
pthread__mutex_unlock_slow(): ignore the DEFERRED bit. It's only purpose
is to get the thread to go through the slow path. If there are waiters,
process them there and then. Should not affect well behaved apps. Maybe
of help for:
PR bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
To generate a diff of this commit:
cvs rdiff -u -r1.66 -r1.67 src/lib/libpthread/pthread_mutex.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Andreas Gustafsson <gson@gson.org>
To: Andrew Doran <ad@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Sun, 26 Jan 2020 11:10:05 +0200
Andrew Doran wrote:
> This may help. I'm trying a run now.
[...]
> cvs rdiff -u -r1.154 -r1.155 src/lib/libpthread/pthread.c
> cvs rdiff -u -r1.19 -r1.20 src/sys/compat/netbsd32/netbsd32_lwp.c
> cvs rdiff -u -r1.71 -r1.72 src/sys/kern/sys_lwp.c
> cvs rdiff -u -r1.197 -r1.198 src/sys/sys/lwp.h
Alas, it did not help; the stress_long and stress_short test cases are
still failing.
--
Andreas Gustafsson, gson@gson.org
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
Quad
Date: Mon, 27 Jan 2020 12:21:04 +0000
Away from the insane asylum that is rump, I found a case where a thread has
called exit() but one of the other threads in the process hasn't gotten the
message:
PID LID USERNAME PRI STATE TIME WCPU CPU NAME COMMAND
919 11 root 27 CPU/2 0:06 99.11% 99.02% - rwlock
919 38 root 69 lwpwai/1 0:06 0.00% 0.00% - rwlock
This is likely connected. Fixing this may not solve all the problems, but
can only make the situation better.
Andrew
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
Quad
Date: Mon, 27 Jan 2020 22:50:47 +0000
Progress of a sort. With today's changes the rump stress tests survive more
often for me, but I do see other intermittent failures. I will try to catch
and examine one.
Failed test cases:
lib/libc/sys/t_ptrace_wait6:resume1, lib/librumpclient/t_exec:threxec, net/sys/t_rfc6056:inet6
Summary for 850 test programs:
8505 passed test cases.
3 failed test cases.
46 expected failed test cases.
325 skipped test cases.
From: Andreas Gustafsson <gson@gson.org>
To: Andrew Doran <ad@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Tue, 28 Jan 2020 10:19:22 +0200
Andrew Doran wrote:
> Progress of a sort. With today's changes the rump stress tests survive more
> often for me, but I do see other intermittent failures. I will try to catch
> and examine one.
>
> Failed test cases:
> lib/libc/sys/t_ptrace_wait6:resume1, lib/librumpclient/t_exec:threxec, net/sys/t_rfc6056:inet6
The resume1 failures are also known as PR 54893. The threxec test
triggered a panic in the most recent run on my testbed:
threxec: [ 3069.7081653] panic: kernel diagnostic assertion "prio < PRI_COUNT" failed: file "/tmp/bracket/build/2020.01.27.23.26.15-amd64-baremetal/src/sys/kern/kern_runq.c", line 177
The full console log is here:
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.01.27.23.26.15/test.log
I have not seen t_rfc6056:inet6 fail, but maybe it would have if the
system hadn't paniced first.
--
Andreas Gustafsson, gson@gson.org
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
Quad
Date: Tue, 28 Jan 2020 12:54:32 +0000
On Tue, Jan 28, 2020 at 08:20:02AM +0000, Andreas Gustafsson wrote:
> The resume1 failures are also known as PR 54893. The threxec test
> triggered a panic in the most recent run on my testbed:
>
> threxec: [ 3069.7081653] panic: kernel diagnostic assertion "prio < PRI_COUNT" failed: file "/tmp/bracket/build/2020.01.27.23.26.15-amd64-baremetal/src/sys/kern/kern_runq.c", line 177
Wow that's amazing. I also ran into that one today and was able to poke
about with ddb. Something is corrupting l_priority by the look of it. I'll
see what can be done.
Andrew
From: Andrew Doran <ad@netbsd.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
Quad
Date: Tue, 28 Jan 2020 13:39:03 +0000
Note to self. The child processes forked by the stressful rump program take
a SIGUSR1, notice it, and start to exit. It looks like one or two of them
get stuck in the rump client library. Here's one that's one the way out but
is stuck:
0 1966 1992 0 8 4 43 0 57572 1852 parked I pts/1 0:00.38 | | |-- /usr/tests/rump/rumpkern/h_client/h_stresscli 1
0 1966 1992 0 7 4 43 0 57572 1852 - Z pts/1 0:01.49 | | |-- /usr/tests/rump/rumpkern/h_client/h_stresscli 1
0 1966 1992 0 6 4 85 0 57572 1852 kqueue I pts/1 0:00.00 | | |-- /usr/tests/rump/rumpkern/h_client/h_stresscli 1
0 1966 1992 0 1 4 85 0 57572 1852 lwpwait I pts/1 0:00.00 | | |-- /usr/tests/rump/rumpkern/h_client/h_stresscli 1
Responsible-Changed-From-To: bin-bug-people->kern-bug-people
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Fri, 28 Aug 2020 19:15:31 +0000
Responsible-Changed-Why:
Seems to be a kernel/scheduler bug
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.