NetBSD Problem Report #53931

From gson@gson.org  Fri Feb  1 15:28:41 2019
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 1DADA7A1AC
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  1 Feb 2019 15:28:41 +0000 (UTC)
Message-Id: <20190201152835.7C8A89892C2@guava.gson.org>
Date: Fri,  1 Feb 2019 17:28:35 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: posix_fadvise_reg test case fails randomly on real hardware
X-Send-Pr-Version: 3.95

>Number:         53931
>Notify-List:    riastradh@NetBSD.org
>Category:       kern
>Synopsis:       posix_fadvise_reg test case fails randomly on real hardware
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    riastradh
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 01 15:30:00 +0000 2019
>Last-Modified:  Sun Apr 13 23:51:28 +0000 2025
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

The posix_fadvise_reg test case of the lib/libc/sys/t_posix_fadvise
test program is failing randomly on real amd64 hardware, with six
failures in the last 30 runs on my bare metal testbed.  It fails
with the message

  t_posix_fadvise.c:135: errno != 999: got: Operation already in progress

Log output from the latest failure:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2019/2019.01.31.00.27.52/test.html#lib_libc_sys_t_posix_fadvise_posix_fadvise_reg

The first recorded failure was with source date 2015.10.30.03.08.56:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2015/2015.10.30.03.08.56/test.html#lib_libc_sys_t_posix_fadvise_posix_fadvise_reg

It's passing reliably on the qemu-based TNF testbed, with no failures
in 2018 nor any in 2019 so far.

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53931: posix_fadvise_reg test case fails randomly on real
 hardware
Date: Sat, 2 Feb 2019 18:45:56 +0000

 On Fri, Feb 01, 2019 at 03:30:00PM +0000, Andreas Gustafsson wrote:
  > The posix_fadvise_reg test case of the lib/libc/sys/t_posix_fadvise
  > test program is failing randomly on real amd64 hardware, with six
  > failures in the last 30 runs on my bare metal testbed.  It fails
  > with the message
  > 
  >   t_posix_fadvise.c:135: errno != 999: got: Operation already in progress

 The system call cannot generate EINPROGRESS, and furthermore, the
 system call does not touch errno (it is one of the broken POSIX
 innovations that returns an errno value instead) so something in the
 rump plumbing must be doing it.

 Does rump actually have a means for handling these broken syscalls
 correctly, and if so, is posix_fadvise tagged appropriately?

 It is bizarre that the behavior would depend on the nature of the
 underlying hardware though.

 -- 
 David A. Holland
 dholland@netbsd.org

From: Andreas Gustafsson <gson@gson.org>
To: David Holland <dholland-bugs@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53931: posix_fadvise_reg test case fails randomly on real
 hardware
Date: Sun, 3 Feb 2019 00:09:47 +0200

 David Holland wrote:
 >   >   t_posix_fadvise.c:135: errno != 999: got: Operation already in progress
 >  
 >  The system call cannot generate EINPROGRESS, and furthermore, the
 >  system call does not touch errno (it is one of the broken POSIX
 >  innovations that returns an errno value instead) so something in the
 >  rump plumbing must be doing it.

 Quite possible.  Thanks for the analysis.

 >  Does rump actually have a means for handling these broken syscalls
 >  correctly, and if so, is posix_fadvise tagged appropriately?

 I don't know.

 >  It is bizarre that the behavior would depend on the nature of the
 >  underlying hardware though.

 If it's some kind of race condition, it's hardly surprising if it
 happens on multiprocessor but not on a (software emulation of a)
 uniprocessor.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53931: posix_fadvise_reg test case fails randomly on real hardware
Date: Sun, 3 Feb 2019 22:30:10 +0200

 Earlier, I wrote:
 > The first recorded failure was with source date 2015.10.30.03.08.56

 I have now reproduced the failure using sources from 2015.10.02.03.08.26,
 using an 8-core machine, but only on the 94th run of the test.

 The failure is probably even older than that, but may not be showing
 up in the existing reports for older versions because the tests 
 were run on a uniprocessor until around 2015-10-09.

 There's probably no point in trying to bisect this, because it's
 likely to be old enough that the version where it first appeared no
 longer builds on a NetBSD-8 host.
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53931 CVS commit: src/sys/sys
Date: Sun, 6 Apr 2025 19:13:06 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sun Apr  6 19:13:06 UTC 2025

 Modified Files:
 	src/sys/sys: ktrace.h

 Log Message:
 sys/ktrace.h: Need sys/param.h for MAXCOMLEN.

 Found while preparing to diagnose:

 PR kern/53931: posix_fadvise_reg test case fails randomly on real
 hardware


 To generate a diff of this commit:
 cvs rdiff -u -r1.70 -r1.71 src/sys/sys/ktrace.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53931 CVS commit: src/tests/lib/libc/sys
Date: Sun, 6 Apr 2025 19:18:01 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sun Apr  6 19:18:01 UTC 2025

 Modified Files:
 	src/tests/lib/libc/sys: t_posix_fadvise.c

 Log Message:
 t_posix_fadvise: Don't check whether errno is preserved.

 I can find no guarantee in POSIX about posix_fadvise preserving
 errno; until such language is found I'm going to assume there is no
 such guarantee.

 What is happening is that, sometimes, rump_sys_posix_fadvise waits on
 a mutex or condvar, which uses _lwp_park internally, which sometimes
 wakes up early with EALREADY because a wakeup was already pending for
 the thread by the time it entered _lwp_park.  And that EALREADY is
 delivered by _lwp_park via errno.

 PR kern/53931: posix_fadvise_reg test case fails randomly on real
 hardware


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.4 src/tests/lib/libc/sys/t_posix_fadvise.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: PR/53931 CVS commit: src/tests/lib/libc/sys
Date: Mon, 7 Apr 2025 07:38:57 +0200

 On Sun, Apr 06, 2025 at 07:20:01PM +0000, Taylor R Campbell wrote:
 >  I can find no guarantee in POSIX about posix_fadvise preserving
 >  errno; until such language is found I'm going to assume there is no
 >  such guarantee.

 The errno definition seems to say it:

 [...] and shall otherwise be defined only after a call to a function for
 which it is explicitly stated to be set and until it is changed by the
 next function call or if the application assigns it a value

 where I would read "next function call" not from a compiler's perspective
 but function being a posix defined system interface. Maybe this should
 be clarified.

 https://pubs.opengroup.org/onlinepubs/9799919799/functions/errno.html

 Martin

Responsible-Changed-From-To: kern-bug-people->kre
Responsible-Changed-By: riastradh@NetBSD.org
Responsible-Changed-When: Sun, 13 Apr 2025 00:31:02 +0000
Responsible-Changed-Why:
Need a ruling from an Austin Group whisperer: is posix_fadvise (and
any other POSIX function that returns an error code instead of setting
errno and returning -1, like pthread_*) _allowed_ to set errno, or
_required_ to leave errno as it was on entry when it returns?

If it's allowed to set errno, we can close this -- test has been fixed.

If it's forbidden to set errno, we need to do a lot of work to make
libpthread and librumpuser save/restore errno on ~every public
function.


From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/53931 (posix_fadvise_reg test case fails randomly on real hardware)
Date: Sun, 13 Apr 2025 08:57:41 +0700

     Date:        Sun, 13 Apr 2025 00:31:02 +0000 (UTC)
     From:        riastradh@NetBSD.org
     Message-ID:  <20250413003103.307A81A923F@mollari.NetBSD.org>

   | Need a ruling from an Austin Group whisperer:

 Not so much me any more (for now anyway) - I tested running (munnari.oz.au)
 without IPv4 connectivity for a while last year .. just IPv6.   Most things
 I care about didn't mind at all (incl NetBSD lists, etc, and IETF stuff).
 Some gnu lists more or less bounced me, but kept sending mail up until
 after my little experiment was over indicating why, and how to get reinstated
 (well one, that's all I'm on...)

 But the Austin group simply removed me from the mailing list, and
 so far I haven't found the magic formula to get reinstated.  So, I
 can't ask.

 But:

   | is posix_fadvise (and
   | any other POSIX function that returns an error code instead of setting
   | errno and returning -1, like pthread_*) _allowed_ to set errno, or
   | _required_ to leave errno as it was on entry when it returns?

 I believe the answer to that can be inferred from XSH 2.3 Error Numbers

 	Some functions provide the error number in a variable accessed
 	through the symbol errno, defined by including the <errno.h> header.
 	The value of errno should only be examined when it is indicated to
 	be valid by a function's return value.

 That is, effectively, unless a function is defined to return a value
 in errno, and the function returns a value which indicates that has happened
 (typically -1, or NULL) then the state of errno is undefined after any of
 the defined functions are called.

 There are some functions which expressly indicate that errno must not
 be altered by the function, but not a lot.

 posix_fadvise() as best I can tell isn't such a function.  I haven't
 checked all the pthread_*() functions though - there are many!

 So:

   | If it's allowed to set errno, we can close this -- test has been fixed.

 I am fairly sure that's the case, almost anything is allowed to alter
 errno - usually by calling some other function which returns an error,
 but where that error is not fatal to the call in question.

 kre

Responsible-Changed-From-To: kre->riastradh
Responsible-Changed-By: kre@NetBSD.org
Responsible-Changed-When: Sun, 13 Apr 2025 23:51:28 +0000
Responsible-Changed-Why:
I have done my bit, nothing more I can do,
returning responsibility for this to riastradh


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2025 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.