NetBSD Problem Report #53931

From gson@gson.org  Fri Feb  1 15:28:41 2019
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 1DADA7A1AC
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  1 Feb 2019 15:28:41 +0000 (UTC)
Message-Id: <20190201152835.7C8A89892C2@guava.gson.org>
Date: Fri,  1 Feb 2019 17:28:35 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: posix_fadvise_reg test case fails randomly on real hardware
X-Send-Pr-Version: 3.95

>Number:         53931
>Category:       kern
>Synopsis:       posix_fadvise_reg test case fails randomly on real hardware
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 01 15:30:00 +0000 2019
>Last-Modified:  Sun Feb 03 20:35:01 +0000 2019
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

The posix_fadvise_reg test case of the lib/libc/sys/t_posix_fadvise
test program is failing randomly on real amd64 hardware, with six
failures in the last 30 runs on my bare metal testbed.  It fails
with the message

  t_posix_fadvise.c:135: errno != 999: got: Operation already in progress

Log output from the latest failure:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2019/2019.01.31.00.27.52/test.html#lib_libc_sys_t_posix_fadvise_posix_fadvise_reg

The first recorded failure was with source date 2015.10.30.03.08.56:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2015/2015.10.30.03.08.56/test.html#lib_libc_sys_t_posix_fadvise_posix_fadvise_reg

It's passing reliably on the qemu-based TNF testbed, with no failures
in 2018 nor any in 2019 so far.

>How-To-Repeat:

>Fix:

>Audit-Trail:
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53931: posix_fadvise_reg test case fails randomly on real
 hardware
Date: Sat, 2 Feb 2019 18:45:56 +0000

 On Fri, Feb 01, 2019 at 03:30:00PM +0000, Andreas Gustafsson wrote:
  > The posix_fadvise_reg test case of the lib/libc/sys/t_posix_fadvise
  > test program is failing randomly on real amd64 hardware, with six
  > failures in the last 30 runs on my bare metal testbed.  It fails
  > with the message
  > 
  >   t_posix_fadvise.c:135: errno != 999: got: Operation already in progress

 The system call cannot generate EINPROGRESS, and furthermore, the
 system call does not touch errno (it is one of the broken POSIX
 innovations that returns an errno value instead) so something in the
 rump plumbing must be doing it.

 Does rump actually have a means for handling these broken syscalls
 correctly, and if so, is posix_fadvise tagged appropriately?

 It is bizarre that the behavior would depend on the nature of the
 underlying hardware though.

 -- 
 David A. Holland
 dholland@netbsd.org

From: Andreas Gustafsson <gson@gson.org>
To: David Holland <dholland-bugs@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53931: posix_fadvise_reg test case fails randomly on real
 hardware
Date: Sun, 3 Feb 2019 00:09:47 +0200

 David Holland wrote:
 >   >   t_posix_fadvise.c:135: errno != 999: got: Operation already in progress
 >  
 >  The system call cannot generate EINPROGRESS, and furthermore, the
 >  system call does not touch errno (it is one of the broken POSIX
 >  innovations that returns an errno value instead) so something in the
 >  rump plumbing must be doing it.

 Quite possible.  Thanks for the analysis.

 >  Does rump actually have a means for handling these broken syscalls
 >  correctly, and if so, is posix_fadvise tagged appropriately?

 I don't know.

 >  It is bizarre that the behavior would depend on the nature of the
 >  underlying hardware though.

 If it's some kind of race condition, it's hardly surprising if it
 happens on multiprocessor but not on a (software emulation of a)
 uniprocessor.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53931: posix_fadvise_reg test case fails randomly on real hardware
Date: Sun, 3 Feb 2019 22:30:10 +0200

 Earlier, I wrote:
 > The first recorded failure was with source date 2015.10.30.03.08.56

 I have now reproduced the failure using sources from 2015.10.02.03.08.26,
 using an 8-core machine, but only on the 94th run of the test.

 The failure is probably even older than that, but may not be showing
 up in the existing reports for older versions because the tests 
 were run on a uniprocessor until around 2015-10-09.

 There's probably no point in trying to bisect this, because it's
 likely to be old enough that the version where it first appeared no
 longer builds on a NetBSD-8 host.
 -- 
 Andreas Gustafsson, gson@gson.org

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.