NetBSD Problem Report #53884

From gson@gson.org  Wed Jan 16 11:39:42 2019
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id D56607A1C1
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 16 Jan 2019 11:39:42 +0000 (UTC)
Message-Id: <20190116113936.A98439892D7@guava.gson.org>
Date: Wed, 16 Jan 2019 13:39:36 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: fs/vfs/t_rmdirrace:lfs_race test case fails randomly on real hardware
X-Send-Pr-Version: 3.95

>Number:         53884
>Category:       kern
>Synopsis:       fs/vfs/t_rmdirrace:lfs_race test case fails randomly on real hardware
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    dholland
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jan 16 11:40:00 +0000 2019
>Closed-Date:    Fri Jan 25 09:19:25 +0000 2019
>Last-Modified:  Fri Jan 25 09:19:25 +0000 2019
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2015.10.13.20.49.39
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

Running the ATF tests on real amd64 hardware (an 8-core HP DL360 G7), the
lfs_race test case of the fs/vfs/t_rmdirrace test program randomly fails
in about half the runs, with

  panic: kernel diagnostic assertion "fs->lfs_cleaner_thread == curlwp" failed

Log output from the latest failure:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2019/2019.01.15.14.23.56/test.html#fs_vfs_t_rmdirrace_lfs_race

The first recorded failure was on 2015.10.13.20.49.39:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2015/2015.10.13.20.49.39/test.html#fs_vfs_t_rmdirrace_lfs_race

The test case is not failing on the qemu-based TNF testbed, which
emulates a uniprocessor system.

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->dholland
Responsible-Changed-By: dholland@NetBSD.org
Responsible-Changed-When: Fri, 18 Jan 2019 10:12:42 +0000
Responsible-Changed-Why:
lfs


From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53884: fs/vfs/t_rmdirrace:lfs_race test case fails randomly
 on real hardware
Date: Fri, 18 Jan 2019 10:23:36 +0000

 On Wed, Jan 16, 2019 at 11:40:00AM +0000, Andreas Gustafsson wrote:
  >   panic: kernel diagnostic assertion "fs->lfs_cleaner_thread == curlwp" failed

 The only way this should be possible is to have two cleaners running
 at once, which... I wouldn't put past the rump test framework, I
 guess, but it doesn't seem that likely and also shouldn't vary under
 the number of cpus on the system. Or so one would think.

 Is it possible to check for extra lfs_cleanerd processes the next time
 this happens?

 -- 
 David A. Holland
 dholland@netbsd.org

From: Andreas Gustafsson <gson@gson.org>
To: David Holland <dholland-bugs@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53884: fs/vfs/t_rmdirrace:lfs_race test case fails randomly
 on real hardware
Date: Fri, 18 Jan 2019 12:58:51 +0200

 David,

 You wrote:
 >   >   panic: kernel diagnostic assertion "fs->lfs_cleaner_thread == curlwp" failed
 >  
 >  The only way this should be possible is to have two cleaners running
 >  at once, which... I wouldn't put past the rump test framework, I
 >  guess, but it doesn't seem that likely and also shouldn't vary under
 >  the number of cpus on the system. Or so one would think.
 >  
 >  Is it possible to check for extra lfs_cleanerd processes the next time
 >  this happens?

 There are no lfs_cleanerd processes running after atf-run has exited,
 if that's what you mean.  The "anita test" command runs "ps -glaxw"
 after the tests, prefixing each line with "ps-post-test", and there
 were no unexpected processes shown in the console log from the last
 run where the lfs_race test case failed:

   ps-post-test UID   PID PPID  CPU PRI NI	  VSZ	 RSS WCHAN   STAT TTY	   TIME COMMAND
   ps-post-test   0     0	  0    0   0  0	    0 126868 -	     OKl  ?	0:33.26 [system]
   ps-post-test   0     1	  0   76  85  0 12104	1316 wait    Ss	  ?	0:00.09 init
   ps-post-test   0   167	  1    0  85  0 24188	2136 kqueue  Ss	  ?	0:00.11 /usr/sbin/syslogd -s
   ps-post-test   0   306	  1    0  85  0 12124	1136 kqueue  Is	  ?	0:00.00 /usr/sbin/powerd
   ps-post-test   0   586	  1    0  85  0 51068	2492 kqueue  Is	  ?	0:00.02 /usr/libexec/postfix/master -w
   ps-post-test  12   588	586    0  85  0 51584	3780 kqueue  I	  ?	0:00.02 pickup -l -t unix -u
   ps-post-test  12   595	586    0  85  0 49828	3808 kqueue  I	  ?	0:00.01 qmgr -l -t unix -u
   ps-post-test   0   596	  1    0  85  0 14232	1200 kqueue  Is	  ?	0:00.00 /usr/sbin/inetd -l
   ps-post-test   0   607	  1    0  85  0 12132	1376 nanoslp Is	  ?	0:00.01 /usr/sbin/cron
   ps-post-test   0    77	536 2728  85  0	 7876	1180 pipe_rd S+	  tty00 0:00.00 sed s/^/ps-post-test /
   ps-post-test   0   536	633 2728  85  0 12380	1732 wait    S	  tty00 0:00.03 /bin/sh
   ps-post-test   0   633	  1   60  85  0 66104	4164 wait    Is	  tty00 0:00.02 login
   ps-post-test   0 23032	536 2728  43  0 12188	1244 -	     O+	  tty00 0:00.00 ps -glaxw
   ps-post-test   0   465	  1   60  85  0 12128	1296 ttyraw  Is+  ttyE1 0:00.00 /usr/libexec/getty Pc ttyE1
   ps-post-test   0   634	  1   60  85  0 12260	1296 ttyraw  Is+  ttyE2 0:00.00 /usr/libexec/getty Pc ttyE2
   ps-post-test   0   527	  1   60  85  0 12128	1292 ttyraw  Is+  ttyE3 0:00.00 /usr/libexec/getty Pc ttyE3

 If you would like to modify the kernel or test to provide more
 debugging information and don't want to commit the changes, you can
 send me a patch.

 This may be unrelated, but I see that on the qemu-based TNF testbed
 (where the test passes), lfs_cleanerd sometimes logs the error message
 "couldn't open device rrumpfs for reading":

   babylon5.netbsd.org$ cd /bracket/amd64/results/
   babylon5.netbsd.org$ zgrep 'lfs_race:' 2019/*/test.log.gz
   2019/2019.01.01.10.09.26/test.log.gz:    lfs_race: [25.983353s] Passed.
   2019/2019.01.01.14.01.46/test.log.gz:    lfs_race: [25.433668s] Passed.
   2019/2019.01.02.03.04.26/test.log.gz:    lfs_race: [25.344210s] Passed.
   2019/2019.01.02.09.04.09/test.log.gz:    lfs_race: [25.431315s] Passed.
   2019/2019.01.02.22.58.44/test.log.gz:    lfs_race: Jan  3 22:27:14  lfs_cleanerd[27084]: couldn't open device rrumpfs for reading
   2019/2019.01.03.15.33.06/test.log.gz:    lfs_race: [25.992201s] Passed.
   2019/2019.01.04.10.25.39/test.log.gz:    lfs_race: [24.285902s] Passed.
   2019/2019.01.05.05.40.00/test.log.gz:    lfs_race: [25.598234s] Passed.
   2019/2019.01.05.10.51.06/test.log.gz:    lfs_race: Jan  6 05:42:53  lfs_cleanerd[27176]: couldn't open device rrumpfs for reading
   2019/2019.01.06.11.20.53/test.log.gz:    lfs_race: [26.268811s] Passed.
   2019/2019.01.06.18.56.52/test.log.gz:    lfs_race: Jan  9 14:56:42  lfs_cleanerd[26388]: couldn't open device rrumpfs for reading
   2019/2019.01.07.05.01.10/test.log.gz:    lfs_race: [25.582419s] Passed.
   2019/2019.01.08.06.29.35/test.log.gz:    lfs_race: [23.349863s] Passed.
   2019/2019.01.09.04.02.26/test.log.gz:    lfs_race: [23.915846s] Passed.
   2019/2019.01.10.19.00.17/test.log.gz:    lfs_race: [24.829028s] Passed.
   2019/2019.01.11.02.44.49/test.log.gz:    lfs_race: [25.661166s] Passed.
   2019/2019.01.11.08.30.19/test.log.gz:    lfs_race: [23.367517s] Passed.
   2019/2019.01.11.15.43.51/test.log.gz:    lfs_race: [25.536107s] Passed.
   2019/2019.01.11.23.10.41/test.log.gz:    lfs_race: [25.789131s] Passed.
   2019/2019.01.12.17.25.09/test.log.gz:    lfs_race: [25.472970s] Passed.
   2019/2019.01.13.06.59.15/test.log.gz:    lfs_race: [26.537625s] Passed.
   2019/2019.01.13.10.01.07/test.log.gz:    lfs_race: [25.320557s] Passed.
   2019/2019.01.13.10.43.22/test.log.gz:    lfs_race: Jan 15 05:31:26  lfs_cleanerd[5727]: couldn't open device rrumpfs for reading
   2019/2019.01.13.16.48.51/test.log.gz:    lfs_race: Jan 16 04:26:48  lfs_cleanerd[20726]: couldn't open device rrumpfs for reading
   2019/2019.01.14.03.30.25/test.log.gz:    lfs_race: [24.650734s] Passed.
   2019/2019.01.14.21.29.56/test.log.gz:    lfs_race: [23.333059s] Passed.
   2019/2019.01.15.14.23.56/test.log.gz:    lfs_race: [25.056744s] Passed.
   2019/2019.01.16.08.32.24/test.log.gz:    lfs_race: [26.300848s] Passed.
   2019/2019.01.16.13.54.17/test.log.gz:    lfs_race: [25.688629s] Passed.

 -- 
 Andreas Gustafsson, gson@gson.org

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53884: fs/vfs/t_rmdirrace:lfs_race test case fails randomly
 on real hardware
Date: Fri, 18 Jan 2019 12:12:33 +0100

 On Fri, Jan 18, 2019 at 11:00:02AM +0000, Andreas Gustafsson wrote:
 >    2019/2019.01.06.18.56.52/test.log.gz:    lfs_race: Jan  9 14:56:42  lfs_cleanerd[26388]: couldn't open device rrumpfs for reading

 I see those as well on the console on all my test runs.

 But I am confused: are we talking about a cleaner process inside rump,
 and would the host ps show that?

 Martin

From: Andreas Gustafsson <gson@gson.org>
To: Martin Husemann <martin@duskware.de>, dholland@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53884: fs/vfs/t_rmdirrace:lfs_race test case fails randomly
 on real hardware
Date: Fri, 18 Jan 2019 18:41:33 +0200

 Martin wrote:
 >  But I am confused: are we talking about a cleaner process inside rump,
 >  and would the host ps show that?

 Good question.  I had a look at the code, and apparently the test is running
 entirely within the rump kernel process and creating a cleaner thread rather
 than a cleaner process, so it would not show up in the host ps.
 See lfs_fstest_mount() and cleaner() in src/tests/fs/common/fstest_lfs.c.

 I also managed to reproduce the failure on 8.0 by manually running just the
 t_rmdirrace test program rather than the full ATF test suite, as in

  cd /usr/tests/fs/vfs
  atf-run t_rmdirrace | tee /tmp/log.tps | atf-report

 It failed on the 3rd or 4th run.

 I also tried

   cd /usr/tests/fs/vfs
   chown -r gson .
   ./t_rmdirrace lfs_race

 and it failed the first time, leaving a t_rmdirrace.core that I can
 examine using gdb, with symbols even.  What should I look for?
 -- 
 Andreas Gustafsson, gson@gson.org

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53884: fs/vfs/t_rmdirrace:lfs_race test case fails randomly
 on real hardware
Date: Fri, 18 Jan 2019 17:48:49 +0000

 On Fri, Jan 18, 2019 at 04:45:01PM +0000, Andreas Gustafsson wrote:
  >  and it failed the first time, leaving a t_rmdirrace.core that I can
  >  examine using gdb, with symbols even.  What should I look for?

 So, at one place deep inside it wants to know if it's running in the
 cleaner; so in lfs_bmapv, which is one of the cleaner syscalls that
 nothing else should get to, it does the assign-and-assert that fails.

 So the first thing to check is: is fs->lfs_cleaner_thread some other
 lwp (if so, what, and what's that one doing, and how did the current
 one that asserted get to lfs_bmapv?) or is it trash?

 If it's trash it would be handy to know if it starts out trash (the
 structure it's in is kmem_zalloc'd, but maybe that's borked in rump)
 or if it gets overwritten later, or any other info that can be
 discovered...


 -- 
 David A. Holland
 dholland@netbsd.org

From: Andreas Gustafsson <gson@gson.org>
To: David Holland <dholland-bugs@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53884: fs/vfs/t_rmdirrace:lfs_race test case fails randomly
 on real hardware
Date: Fri, 18 Jan 2019 20:01:57 +0200

 David Holland wrote:
 >  So the first thing to check is: is fs->lfs_cleaner_thread some other
 >  lwp (if so, what, and what's that one doing, and how did the current
 >  one that asserted get to lfs_bmapv?) or is it trash?

 (gdb) print fs->lfs_cleaner_thread
 $7 = (struct lwp *) 0x78282b5b0000

 (gdb) print *fs->lfs_cleaner_thread
 $8 = {l_runq = {tqe_next = 0x0, tqe_prev = 0x0}, l_sched = {info = 0x0, timeslice = 0}, 
   l_cpu = 0x7828394fc9c0 <rump_bootcpu>, l_mutex = 0x7828394fc988 <rumpns_unruntime_lock>, 
   l_ctxswtch = 0, l_addr = 0x0, l_md = {md_regs = 0x0, md_gc_pmap = 0x0, md_gc_ptp = 0x0, 
     md_flags = 0, md_astpending = 0}, l_flag = 1048576, l_stat = 2, l_rtime = {sec = 0, frac = 0}, 
   l_stime = {sec = 0, frac = 0}, l_swtime = 0, l_rticks = 0, l_rticksum = 0, l_slpticks = 0, 
   l_slpticksum = 0, l_biglocks = 0, l_class = 0, l_kpriority = 0, l_kpribase = 0, l_priority = 0, 
   l_inheritedprio = 0, l_protectprio = 0, l_auxprio = 0, l_protectdepth = 0, l_pi_lenders = {
     slh_first = 0x0}, l_ncsw = 5, l_nivcsw = 0, l_cpticks = 0, l_pctcpu = 0, l_estcpu = 0, 
   l_psid = 0, l_target_cpu = 0x7828394fc9c0 <rump_bootcpu>, l_lwpctl = 0x0, l_lcpage = 0x0, 
   l_affinity = 0x0, l_ts = 0x0, l_syncobj = 0x0, l_sleepchain = {tqe_next = 0x0, tqe_prev = 0x0}, 
   l_wchan = 0x0, l_wmesg = 0x0, l_sleepq = 0x0, l_sleeperr = 0, l_slptime = 0, l_timeout_ch = {
     _c_store = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, l_emap_gen = 0, l_waitcv = {
     cv_opaque = {0x0, 0x0, 0x0}}, l_list = {le_next = 0x78283c505000, le_prev = 0x78282b5b09b0}, 
   l_ctxlink = 0x0, l_proc = 0x78283d59bb40, l_sibling = {le_next = 0x78283d598000, 
     le_prev = 0x78282b5b09d0}, l_waiter = 0, l_waitingfor = 0, l_prflag = 0, l_refcnt = 0, 
   l_lid = 83289, l_name = 0x0, l_selflag = 0, l_selwait = {slh_first = 0x0}, l_selret = 0, 
   l_selrec = 0, l_selcluster = 0x0, l_selbits = 0x0, l_selni = 0, l_sigrestore = 0, 
   l_sigwaitset = {__bits = {0, 0, 0, 0}}, l_sigcv = {cv_opaque = {0x0, 0x0, 0x0}}, 
   l_sigwaited = 0x0, l_sigpendset = 0x0, l_sigwaiter = {le_next = 0x0, le_prev = 0x0}, l_sigstk = {
     ss_sp = 0x0, ss_size = 0, ss_flags = 0}, l_sigmask = {__bits = {0, 0, 0, 0}}, l_sigpend = {
     sp_info = {tqh_first = 0x0, tqh_last = 0x0}, sp_set = {__bits = {0, 0, 0, 0}}}, 
   l_sigoldmask = {__bits = {0, 0, 0, 0}}, l_specdataref = {specdataref_container = 0x0, 
     specdataref_lock = {u = {mtxa_owner = 132113907130496}}}, l_ktrcsw = {tv_sec = 0, 
     tv_nsec = 0}, l_private = 0x0, l_switchto = 0x0, l_cred = 0x78283e948f00, 
   l_fd = 0x78283d59d9c0, l_emuldata = 0x0, l_cv_signalled = 0, l_shlocks = 0, l_exlocks = 0, 
   l_unused = 0, l_blcnt = 0, l_nopreempt = 0, l_dopreempt = 0, l_pflag = 0, l_dupfd = 0, 
   l_sysent = 0x0, l_ru = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, 
       tv_usec = 0}, ru_maxrss = 0, ru_ixrss = 0, ru_idrss = 0, ru_isrss = 0, ru_minflt = 0, 
     ru_majflt = 0, ru_nswap = 0, ru_inblock = 0, ru_oublock = 0, ru_msgsnd = 0, ru_msgrcv = 0, 
     ru_nsignals = 0, ru_nvcsw = 0, ru_nivcsw = 0}, l_pfailtime = 0, l_pfailaddr = 0, 
   l_pfaillock = 0, l_ld_locks = {tqh_first = 0x0, tqh_last = 0x78282b5b03f8}, l_tcgen = 0, 
   l_syscall_time = 0, l_syscall_counter = 0x0, l_dtrace = 0x0}

 I have so far been unable to figure out how to print the value that
 curlwp refers to, because curlwp is a macro that expands into a
 function call that disappears into a maze of #ifdefs.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: David Holland <dholland-bugs@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53884: fs/vfs/t_rmdirrace:lfs_race test case fails randomly
 on real hardware
Date: Fri, 18 Jan 2019 20:19:26 +0200

 > and how did the current one that asserted get to lfs_bmapv?

 (gdb) where
 #0  0x00007828372fe4aa in _lwp_kill () from /usr/lib/libc.so.12
 #1  0x00007828372fdd52 in abort () at /usr/src/lib/libc/stdlib/abort.c:74
 #2  0x0000782838e079dd in rumpuser_exit (rv=rv@entry=-1) at /usr/src/lib/librumpuser/rumpuser.c:236
 #3  0x00007828392a1245 in cpu_reboot (howto=<optimized out>, bootstr=<optimized out>)
     at /usr/src/lib/librump/../../sys/rump/librump/rumpkern/emul.c:385
 #4  0x0000782839264b00 in vpanic (fmt=0x78283ce445e8 "kernel %sassertion \"%s\" failed: file \"%s\", line %d ", ap=0x78282caef5c8)
     at /usr/src/lib/librump/../../sys/rump/../kern/subr_prf.c:342
 #5  0x000078283924ec4d in kern_assert (fmt=<optimized out>) at /usr/src/lib/librump/../../sys/rump/../lib/libkern/kern_assert.c:51
 #6  0x000078283ce33d23 in lfs_bmapv (l=<optimized out>, fsidp=<optimized out>, blkiov=0x78283e538148, blkcnt=6)
     at /usr/src/sys/rump/fs/lib/liblfs/../../../../ufs/lfs/lfs_syscalls.c:662
 #7  0x000078283ce21106 in lfs_fcntl (v=0x78282caef750) at /usr/src/sys/rump/fs/lib/liblfs/../../../../ufs/lfs/lfs_vnops.c:2000
 #8  0x00007828392582aa in VOP_FCNTL (vp=0x78283d5a6d28, command=<optimized out>, data=<optimized out>, fflag=<optimized out>, 
     cred=<optimized out>) at /usr/src/lib/librump/../../sys/rump/../kern/vnode_if.c:648
 #9  0x0000782839248ead in fcntl_forfs (fd=<optimized out>, arg=<optimized out>, cmd=-1341076464, fp=0x78283d59ed80)
     at /usr/src/lib/librump/../../sys/rump/../kern/sys_descrip.c:221
 #10 sys_fcntl (l=<optimized out>, uap=<optimized out>, retval=<optimized out>)
     at /usr/src/lib/librump/../../sys/rump/../kern/sys_descrip.c:376
 #11 0x00007828392a1f2f in sy_call (rval=0x78282caef8f0, uap=0x78282caef900, l=0x78282b5bb800, sy=<optimized out>)
     at /usr/src/lib/librump/../../sys/rump/../sys/syscallvar.h:65
 #12 sy_invoke (code=92, rval=0x78282caef8f0, uap=0x78282caef900, l=0x78282b5bb800, sy=<optimized out>)
     at /usr/src/lib/librump/../../sys/rump/../sys/syscallvar.h:94
 #13 rump_syscall (num=num@entry=92, data=data@entry=0x78282caef900, dlen=dlen@entry=24, retval=retval@entry=0x78282caef8f0)
     at /usr/src/lib/librump/../../sys/rump/librump/rumpkern/rump.c:762
 #14 0x0000782839296e81 in rump___sysimpl_fcntl (fd=<optimized out>, cmd=<optimized out>, arg=<optimized out>)
     at /usr/src/lib/librump/../../sys/rump/librump/rumpkern/rump_syscalls.c:1292
 #15 0x000000007720d6af in toss_old_blocks () at /usr/src/tests/fs/vfs/t_rmdirrace.c:106
 #16 0x000000007720e73b in lfs_cleaner_main () at /usr/src/tests/fs/vfs/t_rmdirrace.c:106
 #17 0x000000007720a001 in ?? () at /usr/src/tests/fs/vfs/t_rmdirrace.c:106
 #18 0x0000782838a0b8b5 in pthread__create_tramp (cookie=0x78283b9d4000) at /usr/src/lib/libpthread/pthread.c:593
 #19 0x0000782837281b60 in ?? () from /usr/lib/libc.so.12
 Backtrace stopped: Cannot access memory at address 0x78282caf0000

 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: dholland@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53884: fs/vfs/t_rmdirrace:lfs_race test case fails randomly on real hardware
Date: Sun, 20 Jan 2019 14:21:19 +0200

 I can make the test fail reliably by increasing the sleep(10) in
 src/tests/fs/vfs/t_rmdirrace.c to sleep(100).

 My current theory is that the test is falling victim to the
 following rump behavior documented in rump_lwproc(3):

      An implicit thread context is created every time the rump kernel
      is entered and disbanded upon exit.

 and this comment in rump_schedule():

 	/*
 	 * If there is no dedicated lwp, allocate a temp one and
 	 * set it to be free'd upon unschedule().  Use lwp0 context
 	 * for reserving the necessary resources.  Don't optimize
 	 * for this case -- anyone who cares about performance will
 	 * start a real thread.
 	 */

 That is, every call to lfs_bmapv() from the cleaner thread gets a new
 rump kernel lwp object (despite being made from the same host pthread),
 so second call will always fail the assertion that the lwp is the same
 as in the first call.

 I'm now testing a patch that adds a call to rump_pub_lwproc_newlwp()
 from the cleaner thread, like those that already exist in the worker
 threads of the tmpfs test in tests/fs/tmpfs/t_renamerace.c.
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53884 CVS commit: src/tests/fs/common
Date: Sun, 20 Jan 2019 14:50:58 +0000

 Module Name:	src
 Committed By:	gson
 Date:		Sun Jan 20 14:50:58 UTC 2019

 Modified Files:
 	src/tests/fs/common: fstest_lfs.c

 Log Message:
 Call rump_pub_lwproc_newlwp() in the lfs cleaner thread to make its
 rump system calls all use the same lwp.  Hopefully this will fix
 PR kern/53884.  Also call rump_pub_lwproc_releaselwp() at the end.


 To generate a diff of this commit:
 cvs rdiff -u -r1.5 -r1.6 src/tests/fs/common/fstest_lfs.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53884: fs/vfs/t_rmdirrace:lfs_race test case fails randomly
 on real hardware
Date: Sun, 20 Jan 2019 19:15:30 +0000

 On Sun, Jan 20, 2019 at 12:25:01PM +0000, Andreas Gustafsson wrote:
  >  My current theory is that the test is falling victim to the
  >  following rump behavior documented in rump_lwproc(3):
  >   
  >       An implicit thread context is created every time the rump kernel
  >       is entered and disbanded upon exit.

 That's very likely.

 -- 
 David A. Holland
 dholland@netbsd.org

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Fri, 25 Jan 2019 09:19:25 +0000
State-Changed-Why:
No failures in the last 18 runs.  Thanks to dholland for the analysis.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.