NetBSD Problem Report #50350

From gson@gson.org  Wed Oct 21 08:12:13 2015
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 0A85EA6531
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 21 Oct 2015 08:12:13 +0000 (UTC)
Message-Id: <20151021081207.77781744628@guava.gson.org>
Date: Wed, 21 Oct 2015 11:12:07 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
X-Send-Pr-Version: 3.95

>Number:         50350
>Category:       kern
>Synopsis:       rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Oct 21 08:15:01 +0000 2015
>Last-Modified:  Fri Aug 28 19:15:31 +0000 2020
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current > 2013-08-14
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

I'm running the ATF tests on a physical amd64 host which I recently
upgraded from an old single-core AMD Athlon64 to a slightly less
old Intel Core 2 Quad Q6600 quad-core machine.

When I did that, the stress_short and stress_long test cases of the
rump/rumpkern/t_sp test, which had reliably succeeded on the
single-core machine, started failing in every test run.

I also tried running the t_sp test on two other PCs, both of the same
model, Compaq DC7900, but with different CPUs, one being a dual-core
E8400 and the other a quad-core Q6600, and it only failed on the
quad-core one.

Bisection shows that the tests pass on the Q6600 for source dates
older 2013-08-14, when the following change was committed:

  src/lib/librumpuser/rumpuser.c 1.54

    Change the default value of rump kernels CPUs to 2.  It used to be
    the number of host cores, but that value is overkill for most uses,
    especially with massively multicore hosts.  Dozens of useless virtual
    CPUs are relatively speaking expensive in terms of bootstrap time and
    memory footprint.  On the other end of the spectrum, defaulting to 2
    might shake out some bugs from the qemu test runs.

It looks like defaulting to 2 managed to shake out some bugs from
bare-metal test runs instead of qemu ones.

Log output from one failing test is here:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2015/2015.10.09.17.21.45/test.html#rump_rumpkern_t_sp_stress_long

>How-To-Repeat:

On a quad-core amd64 machine, run

  cd /usr/tests/rump/rumpkern
  atf-run t_sp | atf-report

>Fix:

>Release-Note:

>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Fri, 12 Oct 2018 15:29:36 +0300

 This is strange.

 Some time ago, I replaced the Core 2 Quad system of the original bug
 report by a HP DL360 G7 with dual Xeon L5630 CPUs, and the behavior of
 the rump/rumpkern/t_sp/stress_{long,short} test cases did not change:
 they still consistently failed with a timeout as reported.

 But recently, I noticed that they had started to consistently pass on
 the same hardware, and they passed even when I tested an old version
 of -current where they had previously failed.

 The only thing that has changed, that I can think of, is that I
 updated the system firmware from the version dated 2011-01-28 to
 2018-05-21, which includes fixes for various speculative execution
 vulnerabilities as well as other bug fixes.

 Has anyone else run the ATF tests on machine with 4 or more cores
 recently?  If so, did the rump/rumpkern/t_sp/stress_{long,short} test
 cases pass or fail?
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Sun, 21 Oct 2018 10:26:17 +0300

 Earlier, I wrote:
 > But recently, I noticed that they had started to consistently pass on
 > the same hardware, and they passed even when I tested an old version
 > of -current where they had previously failed.

 To be precise, that was running the tests under NetBSD/amd64.
 But if I run them under NetBSD/i386, they still fail.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Wed, 6 Mar 2019 18:50:24 +0200

 The rump/rumpkern/t_sp/stress_{long,short} tests are now failing for
 me again on real amd64 hardware (dual Xeon L5630).  Bisection shows
 the new failures began during the period of build breakage between
 source dates 2019.03.01.08.15.23 and 2019.03.01.11.06.57:

   http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2019.03.html#2019.03.01.11.06.57

 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: hannken@NetBSD.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Thu, 7 Mar 2019 10:12:46 +0200

 I did another bisection in an alternate reality where the build was
 not broken on 2019.03.01.08.15.23, by patching the affected version
 so that they built.  This narrowed down the point where the tests
 started failing to the following commit:

   2019.03.01.09.02.03 hannken src/sys/kern/kern_exit.c 1.274
   2019.03.01.09.02.03 hannken src/sys/kern/kern_lwp.c 1.196
   2019.03.01.09.02.03 hannken src/sys/kern/vfs_trans.c 1.57
   2019.03.01.09.02.03 hannken src/sys/sys/fstrans.h 1.13
   2019.03.01.09.02.03 hannken src/sys/sys/lwp.h 1.181

 Given the history of these test failures (that they have been failing
 on i386 since 2013, are not failing in qemu, and mysteriously stopped
 failing on amd64 for me at one point), I figure this is probably not a
 case of the above commit introducing a new bug or being incorrect as
 such, but rather somehow retriggering the existing bug.  Perhaps the
 nature of the changes in the commit might offer some clue as to what
 that bug is.
 -- 
 Andreas Gustafsson, gson@gson.org

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: Andreas Gustafsson <gson@gson.org>
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
 Quad
Date: Thu, 7 Mar 2019 11:53:15 +0100

 I'm quite sure the fstrans commit only triggers a hidden bug here.

 Running these tests on my test machine (guest under CentOS7/kvm with
 16 virtual cpus and 16 GB ram) I get timeouts after 300 seconds.

 After "cpuctl offline 4 5 6 7 8 9 10 11 12 13 14 15" the tests
 pass so I suppose there is still some hidden dependency on the
 number of host cpus.

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

From: Andreas Gustafsson <gson@gson.org>
To: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Fri, 8 Mar 2019 10:24:36 +0200

 J. Hannken-Illjes wrote:
 > I'm quite sure the fstrans commit only triggers a hidden bug here.

 That's what I thought, too, but now that I've looked at the rumpkern
 process with gdb, I'm not so sure anymore.

 Here's what I did:

   Build a release from 2019.03.07.11.09.48 sources with -V MKDEBUG=yes COPTS=-g
   Boot it and log in
   # cd /usr/tests/rump/rumpkern/
   # ./t_sp stress_short
   (let it run for a couple of minutes)
   ^Z
   # bg
   # gdb rump_server
   (gdb) attach <pid of rump_server_process>
   (gdb) info threads

 This showed a total of 162 LWPs, most of which were
 sleeping in ___lwp_park60, but one was in rumpns_membar_sync()
 called from fstrans_alloc_lwp_info():

   (gdb) thread 108
   #0  0x000076b82d6aaa17 in rumpns_membar_sync () from /usr/lib/librump.so.0
   #1  0x000076b82c634b13 in fstrans_alloc_lwp_info (mp=mp@entry=0x76b82d5ac000)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_trans.c:341
   #2  0x000076b82c6352e9 in fstrans_get_lwp_info (do_alloc=true,
       mp=0x76b82d5ac000)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_trans.c:410
   #3  _fstrans_start (wait=1, lock_type=FSTRANS_SHARED, mp=0x76b82d5ac000)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_trans.c:454
   #4  fstrans_start (mp=0x76b82d5ac000)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_trans.c:492
   #5  0x000076b82d65f3ff in vop_pre (vp=0x76b82c501d20, vp=0x76b82c501d20,
       op=FST_YES, mpsafe=<synthetic pointer>, mp=<synthetic pointer>)
       at /usr/src/lib/librump/../../sys/rump/../kern/vnode_if.c:77
   #6  VOP_LOCK (vp=0x76b82c501d20, flags=<optimized out>)
       at /usr/src/lib/librump/../../sys/rump/../kern/vnode_if.c:1281
   #7  0x000076b82c62f442 in vn_lock (vp=0x76b82c501d20, flags=131074)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_vnops.c:1043
   #8  0x000076b82c642215 in namei_start (startdir_ret=0x76b819ccfba8, isnfsd=0,
       state=0x76b819ccfc50)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_lookup.c:681
   #9  namei_oneroot (isnfsd=0, inhibitmagic=0, neverfollow=0,
       state=0x76b819ccfc50)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_lookup.c:1156
   #10 namei_tryemulroot (state=state@entry=0x76b819ccfc50,
       neverfollow=neverfollow@entry=0, inhibitmagic=inhibitmagic@entry=0,
       isnfsd=isnfsd@entry=0)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_lookup.c:1510
   #11 0x000076b82c6446f2 in namei (ndp=0x76b819ccfdc8)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_lookup.c:1546
   #12 0x000076b82c62fcf3 in vn_open (ndp=0x76b819ccfdc8, fmode=3, cmode=0)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_vnops.c:176
   #13 0x000076b82c63995a in do_open (l=l@entry=0x76b82d105000, dvp=0x0,
       pb=0x76b82d591240, open_flags=open_flags@entry=2,
       open_mode=open_mode@entry=0, fd=fd@entry=0x76b819ccfeec)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_syscalls.c:1590
   #14 0x000076b82c639aa9 in do_sys_openat (l=0x76b82d105000,
       fdat=fdat@entry=-100, path=<optimized out>, flags=2, mode=0,
       fd=fd@entry=0x76b819ccfeec)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_syscalls.c:1674
   #15 0x000076b82c639bfc in sys_open (l=<optimized out>, uap=<optimized out>,
       retval=0x76b819ccff10)
       at /usr/src/lib/librumpvfs/../../sys/rump/../kern/vfs_syscalls.c:1695
   #16 0x000076b82da01359 in sy_call (rval=0x76b819ccff10, uap=0x76b82b99f0c0,
       l=0x76b82d105000, sy=0x76b82d8eec38 <rumpns_sysent+120>)
       at /usr/src/sys/rump/kern/lib/libsysproxy/../../../../sys/syscallvar.h:65
   #17 sy_invoke (code=5, rval=0x76b819ccff10, uap=0x76b82b99f0c0,
       l=0x76b82d105000, sy=0x76b82d8eec38 <rumpns_sysent+120>)
       at /usr/src/sys/rump/kern/lib/libsysproxy/../../../../sys/syscallvar.h:94
   #18 hyp_syscall (num=5, arg=0x76b82b99f0c0, retval=0x76b819ccff90)
       at /usr/src/sys/rump/kern/lib/libsysproxy/sysproxy.c:72
   #19 0x000076b82d206bd0 in rumpsyscall (regrv=0x76b819ccff80,
       data=0x76b82b99f0c0, sysnum=5)
       at /usr/src/lib/librumpuser/rumpuser_sp.c:267
   #20 serv_handlesyscall (rhdr=0x76b82b9a3308, rhdr=0x76b82b9a3308,
       data=0x76b82b99f0c0 "\265\031\340\322", spc=0x76b82d40c3b0)
       at /usr/src/lib/librumpuser/rumpuser_sp.c:690
   #21 serv_workbouncer (arg=<optimized out>)
       at /usr/src/lib/librumpuser/rumpuser_sp.c:773
   #22 0x000076b82ce0b7d8 in pthread__create_tramp (cookie=0x76b81db2c000)
       at /usr/src/lib/libpthread/pthread.c:593
   #23 0x000076b82ca8b350 in ?? () from /usr/lib/libc.so.12

 fstrans_alloc_lwp_info() was looping over the fstrans_fli_head list,
 and tracing it manually, I was unable to get to the end of the list,
 so I wrote a gdb loop to print the entire list:

   set pager off
   set var $n = fstrans_fli_head.lh_first
   while $n
       set var $n = $n->fli_list.le_next
       print $n
   end

 This printed:

   $608 = (struct fstrans_lwp_info *) 0x76b8193b0740
   $609 = (struct fstrans_lwp_info *) 0x76b8193b0790
   $610 = (struct fstrans_lwp_info *) 0x76b8193b07e0
   [...]
   $218906 = (struct fstrans_lwp_info *) 0x76b82d53cd20
   $218907 = (struct fstrans_lwp_info *) 0x76b82d53ce10
   $218908 = (struct fstrans_lwp_info *) 0x0

 Thats 218,300 elements.  Is the list really supposed to get that large?
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Juergen Hannken-Illjes" <hannken@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/50350 CVS commit: src/sys/rump/librump/rumpkern
Date: Sat, 9 Mar 2019 09:02:38 +0000

 Module Name:	src
 Committed By:	hannken
 Date:		Sat Mar  9 09:02:38 UTC 2019

 Modified Files:
 	src/sys/rump/librump/rumpkern: emul.c lwproc.c

 Log Message:
 Rumpkernel has its own thread deallocation.  Add missing fstrans_lwp_dtor()
 to lwproc_freelwp().

 PR bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad


 To generate a diff of this commit:
 cvs rdiff -u -r1.189 -r1.190 src/sys/rump/librump/rumpkern/emul.c
 cvs rdiff -u -r1.40 -r1.41 src/sys/rump/librump/rumpkern/lwproc.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: gson@gson.org (Andreas Gustafsson)
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Sat, 25 Jan 2020 10:05:58 +0200

 The rump/rumpkern/t_sp/stress_{long,short} are once again failing on
 my bare metal amd64 testbed (currently a dual Intel Xeon L5630).
 The problem reappeared with these commits:

   commit 2020.01.12.22.03.22 ad src/sys/kern/kern_exec.c 1.488
   commit 2020.01.12.22.03.22 ad src/sys/kern/kern_runq.c 1.58
   commit 2020.01.12.22.03.23 ad src/sys/sys/lwp.h 1.196
   commit 2020.01.12.22.03.23 ad src/sys/sys/sched.h 1.87

 For logs, see:

   http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.01.html#2020.01.12.22.03.23

 -- 
 Andreas Gustafsson, gson@gson.org

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: Andreas Gustafsson <gson@gson.org>, joerg@netbsd.org, martin@netbsd.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
 Quad
Date: Sat, 25 Jan 2020 13:39:56 +0000

 On Sat, Jan 25, 2020 at 08:10:01AM +0000, Andreas Gustafsson wrote:

 >  The rump/rumpkern/t_sp/stress_{long,short} are once again failing on
 >  my bare metal amd64 testbed (currently a dual Intel Xeon L5630).
 >  The problem reappeared with these commits:
 >  
 >    commit 2020.01.12.22.03.22 ad src/sys/kern/kern_exec.c 1.488
 >    commit 2020.01.12.22.03.22 ad src/sys/kern/kern_runq.c 1.58
 >    commit 2020.01.12.22.03.23 ad src/sys/sys/lwp.h 1.196
 >    commit 2020.01.12.22.03.23 ad src/sys/sys/sched.h 1.87

 The distribution of new LWPs has changed, and it's exposing very old bugs in
 libpthread and/or rump.  We have identified and fixed some but it looks like
 there are still more.  I first ran into this with libmicro back in November
 and Joerg has been seeing it on pbulk runs.  I'll dig in further when I get
 a chance.

 Andrew

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
 Quad
Date: Sat, 25 Jan 2020 17:31:32 +0000

 This may help.  I'm trying a run now.

 ----- Forwarded message from Andrew Doran <ad@netbsd.org> -----

 Date: Sat, 25 Jan 2020 15:41:52 +0000
 From: Andrew Doran <ad@netbsd.org>
 To: source-changes@NetBSD.org
 Subject: CVS commit: src
 X-Mailer: log_accum

 Module Name:	src
 Committed By:	ad
 Date:		Sat Jan 25 15:41:52 UTC 2020

 Modified Files:
 	src/lib/libpthread: pthread.c
 	src/sys/compat/netbsd32: netbsd32_lwp.c
 	src/sys/kern: sys_lwp.c
 	src/sys/sys: lwp.h

 Log Message:
 - Fix a race between the kernel and libpthread, where a new thread can start
   life without its self->pt_lid being filled in.

 - Fix an error path in _lwp_create().  If the new LID can't be copied out,
   then get rid of the new LWP (i.e. either succeed or fail, not both).

 - Mark l_dopreempt and l_nopreempt volatile in struct lwp.


 To generate a diff of this commit:
 cvs rdiff -u -r1.154 -r1.155 src/lib/libpthread/pthread.c
 cvs rdiff -u -r1.19 -r1.20 src/sys/compat/netbsd32/netbsd32_lwp.c
 cvs rdiff -u -r1.71 -r1.72 src/sys/kern/sys_lwp.c
 cvs rdiff -u -r1.197 -r1.198 src/sys/sys/lwp.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.


 ----- End forwarded message -----

From: "Andrew Doran" <ad@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/50350 CVS commit: src/lib/libpthread
Date: Sat, 25 Jan 2020 17:58:28 +0000

 Module Name:	src
 Committed By:	ad
 Date:		Sat Jan 25 17:58:28 UTC 2020

 Modified Files:
 	src/lib/libpthread: pthread_mutex.c

 Log Message:
 pthread__mutex_unlock_slow(): ignore the DEFERRED bit.  It's only purpose
 is to get the thread to go through the slow path.  If there are waiters,
 process them there and then.  Should not affect well behaved apps.  Maybe
 of help for:

 PR bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad


 To generate a diff of this commit:
 cvs rdiff -u -r1.66 -r1.67 src/lib/libpthread/pthread_mutex.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andreas Gustafsson <gson@gson.org>
To: Andrew Doran <ad@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Sun, 26 Jan 2020 11:10:05 +0200

 Andrew Doran wrote:
 >  This may help.  I'm trying a run now.
 [...]
 >  cvs rdiff -u -r1.154 -r1.155 src/lib/libpthread/pthread.c
 >  cvs rdiff -u -r1.19 -r1.20 src/sys/compat/netbsd32/netbsd32_lwp.c
 >  cvs rdiff -u -r1.71 -r1.72 src/sys/kern/sys_lwp.c
 >  cvs rdiff -u -r1.197 -r1.198 src/sys/sys/lwp.h

 Alas, it did not help; the stress_long and stress_short test cases are
 still failing.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
 Quad
Date: Mon, 27 Jan 2020 12:21:04 +0000

 Away from the insane asylum that is rump, I found a case where a thread has
 called exit() but one of the other threads in the process hasn't gotten the
 message:

   PID   LID USERNAME PRI STATE      TIME   WCPU    CPU NAME      COMMAND
   919    11 root      27 CPU/2      0:06 99.11% 99.02% -         rwlock
   919    38 root      69 lwpwai/1   0:06  0.00%  0.00% -         rwlock

 This is likely connected.  Fixing this may not solve all the problems, but
 can only make the situation better.

 Andrew

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
 Quad
Date: Mon, 27 Jan 2020 22:50:47 +0000

 Progress of a sort.  With today's changes the rump stress tests survive more
 often for me, but I do see other intermittent failures.  I will try to catch
 and examine one.

 Failed test cases:
     lib/libc/sys/t_ptrace_wait6:resume1, lib/librumpclient/t_exec:threxec, net/sys/t_rfc6056:inet6

 Summary for 850 test programs:
     8505 passed test cases.
     3 failed test cases.
     46 expected failed test cases.
     325 skipped test cases.

From: Andreas Gustafsson <gson@gson.org>
To: Andrew Doran <ad@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad
Date: Tue, 28 Jan 2020 10:19:22 +0200

 Andrew Doran wrote:
 >  Progress of a sort.  With today's changes the rump stress tests survive more
 >  often for me, but I do see other intermittent failures.  I will try to catch
 >  and examine one.
 >  
 >  Failed test cases:
 >      lib/libc/sys/t_ptrace_wait6:resume1, lib/librumpclient/t_exec:threxec, net/sys/t_rfc6056:inet6

 The resume1 failures are also known as PR 54893.  The threxec test
 triggered a panic in the most recent run on my testbed:

   threxec: [ 3069.7081653] panic: kernel diagnostic assertion "prio < PRI_COUNT" failed: file "/tmp/bracket/build/2020.01.27.23.26.15-amd64-baremetal/src/sys/kern/kern_runq.c", line 177

 The full console log is here:

   http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.01.27.23.26.15/test.log

 I have not seen t_rfc6056:inet6 fail, but maybe it would have if the
 system hadn't paniced first.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
 Quad
Date: Tue, 28 Jan 2020 12:54:32 +0000

 On Tue, Jan 28, 2020 at 08:20:02AM +0000, Andreas Gustafsson wrote:

 >  The resume1 failures are also known as PR 54893.  The threxec test
 >  triggered a panic in the most recent run on my testbed:
 >  
 >    threxec: [ 3069.7081653] panic: kernel diagnostic assertion "prio < PRI_COUNT" failed: file "/tmp/bracket/build/2020.01.27.23.26.15-amd64-baremetal/src/sys/kern/kern_runq.c", line 177

 Wow that's amazing.  I also ran into that one today and was able to poke
 about with ddb.  Something is corrupting l_priority by the look of it.  I'll
 see what can be done.

 Andrew

From: Andrew Doran <ad@netbsd.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: bin/50350: rump/rumpkern/t_sp/stress_{long,short} fail on Core 2
 Quad
Date: Tue, 28 Jan 2020 13:39:03 +0000

 Note to self.  The child processes forked by the stressful rump program take
 a SIGUSR1, notice it, and start to exit.  It looks like one or two of them
 get stuck in the rump client library.  Here's one that's one the way out but
 is stuck:

    0 1966 1992       0   8    4  43  0   57572  1852 parked   I    pts/1 0:00.38 |           |             |-- /usr/tests/rump/rumpkern/h_client/h_stresscli 1
    0 1966 1992       0   7    4  43  0   57572  1852 -        Z    pts/1 0:01.49 |           |             |-- /usr/tests/rump/rumpkern/h_client/h_stresscli 1
    0 1966 1992       0   6    4  85  0   57572  1852 kqueue   I    pts/1 0:00.00 |           |             |-- /usr/tests/rump/rumpkern/h_client/h_stresscli 1
    0 1966 1992       0   1    4  85  0   57572  1852 lwpwait  I    pts/1 0:00.00 |           |             |-- /usr/tests/rump/rumpkern/h_client/h_stresscli 1

Responsible-Changed-From-To: bin-bug-people->kern-bug-people
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Fri, 28 Aug 2020 19:15:31 +0000
Responsible-Changed-Why:
Seems to be a kernel/scheduler bug


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.