NetBSD Problem Report #46464

From gson@gson.org  Fri May 18 15:38:32 2012
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id F00A663B882
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 18 May 2012 15:38:31 +0000 (UTC)
Message-Id: <20120518153738.1267D75E8C@guava.gson.org>
Date: Fri, 18 May 2012 18:37:38 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: lib/librumphijack/t_tcpip:ssh test case randomly fails
X-Send-Pr-Version: 3.95

>Number:         46464
>Category:       kern
>Synopsis:       lib/librumphijack/t_tcpip:ssh test case randomly fails
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri May 18 15:40:00 +0000 2012
>Closed-Date:    Wed Aug 05 13:03:37 +0000 2015
>Last-Modified:  Wed Aug 05 13:03:37 +0000 2015
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current as of 2012.05.16.19.12.59
>Organization:
>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:

The "ssh" test case of the lib/librumphijack/t_tcpip test has been
randomly failing for a long time, perhaps its entire existence (it
was committed on 2011.02.14.15.14.00, and the first recorded failure
on my test system was ten days later at 2011.02.24.18.33.06).

The output from a recent failure can be seen at:

  http://releng.netbsd.org/b5reports/i386/build/2012.05.16.11.45.08/test.html#lib_librumphijack_t_tcpip_ssh

The above log shows the test failing with the error message:

  Timeout, server 127.0.0.1 not responding.

Interestingly, this message is only printed in case of a keep-alive
timeout, and keep-alive timeouts can only happen if ssh is configured
with a non-zero ServerAliveInterval, but in this test, ssh is using a
configuration with the default ServerAliveInterval of zero,
so there should be no way this can happen.

To debug this, I added some printfs to print the arguments and return value of
the select() system call in src/crypto/external/bsd/openssh/dist/clientloop.c,
and found that when the error occurs, select() is returning zero (indicating
a timeout) even though its "timeout" argument is a NULL pointer.  That's
not supposed to happen, is it?  So to me this looks like a bug in select(),
or maybe just in its rump implementation.

My patch adding the printfs is at

  http://www.gson.org/netbsd/bugs/atf-ssh-test/ssh-select-debug.patch

and here is an excerpt from a /tmp/select.log file written by the patched
ssh, showing select() returning 0 even though the timeout argument is
NULL:

  select nfds=129 tvp=0x0
    read 5
    read 128
  select ret = 0

>How-To-Repeat:

  cd /usr/tests/lib/librumphijack/
  while atf-run t_tcpip; do true; done

This may fail in a test case other than the ssh one; if so, retry it.

>Fix:

>Release-Note:

>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: Antti Kantee <pooka@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly fails
Date: Sat, 19 Jul 2014 12:11:23 +0300

 Hi Antti,

 I have now tracked down the cause of PR 46464 to an assertion failure
 in librumphijack:

   assertion "(pfd=5Frump[nfds].revents & POLLIN) =3D=3D 0" failed: file=
  "/tmp/bracket/build/2014.07.16.07.41.43-i386-debug/src/lib/librumphija=
 ck/hijack.c", line 2073, function "=5Fsys=5F=5F=5Fpollts50"

 What is the intent of the assertion=3F  It seems to be asserting that
 the two threads can't simultaneously notify each other, but it looks
 like they can.

 To reproduce the failure, including the assertion failure message,
 paste the following into a terminal running a root shell:

 cd /usr/tests/lib/librumphijack
 ed t=5Ftcpip
 /sbin.sshd/
 s/sshd/sshd -d -o LogLevel=3DDEBUG3 /
 s/$/ 2>\/tmp\/sshd.log \&/
 a
 echo $! >sshd.pid
 =2E
 /save:ssh.out/
 s/out/out -e ignore/
 w
 q
 while atf-run t=5Ftcpip; do true; done
 cat /tmp/sshd.log

 --=20
 Andreas Gustafsson, gson@gson.org

From: Antti Kantee <pooka@NetBSD.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly
 fails
Date: Sat, 19 Jul 2014 10:13:18 +0000

 On 19/07/14 09:11, Andreas Gustafsson wrote:
 > Hi Antti,
 >
 > I have now tracked down the cause of PR 46464 to an assertion failure
 > in librumphijack:
 >
 >    assertion "(pfd_rump[nfds].revents & POLLIN) == 0" failed: file "/tmp/bracket/build/2014.07.16.07.41.43-i386-debug/src/lib/librumphijack/hijack.c", line 2073, function "_sys___pollts50"
 >
 > What is the intent of the assertion?  It seems to be asserting that
 > the two threads can't simultaneously notify each other, but it looks
 > like they can.

 Andreas,

 I am almost sure you are correct.  The rest of the code seems to handle 
 two polls finishing correctly, too.  That assert and the same one in 
 another branch a few lines lower should be removed.

 thanks!

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: Antti Kantee <pooka@NetBSD.org>
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly fails
Date: Sat, 19 Jul 2014 12:27:29 +0300

 The gnats database copy of my last message got trashed by MIME,
 rendering the recipe for reproducing the assertion failure unusable.
 I have now converted it into a shell script which can be downloaded
 from:

   http://www.gson.org/netbsd/bugs/46464/reproduce.sh

 -- 
 Andreas Gustafsson, gson@gson.org

From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46464 CVS commit: src/lib/librumphijack
Date: Sat, 19 Jul 2014 14:14:22 +0000

 Module Name:	src
 Committed By:	gson
 Date:		Sat Jul 19 14:14:21 UTC 2014

 Modified Files:
 	src/lib/librumphijack: hijack.c

 Log Message:
 Do not assert that the two threads do not simultanously notify each
 other, because sometimes they do.  Should fix PR kern/46464.  OK pooka.


 To generate a diff of this commit:
 cvs rdiff -u -r1.107 -r1.108 src/lib/librumphijack/hijack.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andreas Gustafsson <gson@gson.org>
To: Antti Kantee <pooka@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly fails
Date: Sun, 20 Jul 2014 11:41:00 +0300

 Antti Kantee wrote:
 > I am almost sure you are correct.  The rest of the code seems to handle 
 > two polls finishing correctly, too.  That assert and the same one in 
 > another branch a few lines lower should be removed.

 I removed the asserts, but the test is still failing; now the "rv_host > 0"
 assertion is triggering instead:

   assertion "rv_host > 0" failed: file "/tmp/bracket/build/2014.07.20.00.48.51-i386/src/lib/librumphijack/hijack.c", line 2073, function "_sys___pollts50"

 -- 
 Andreas Gustafsson, gson@gson.org

From: Antti Kantee <pooka@NetBSD.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly
 fails
Date: Sun, 20 Jul 2014 10:48:36 +0000

 On 20/07/14 08:41, Andreas Gustafsson wrote:
 > Antti Kantee wrote:
 >> I am almost sure you are correct.  The rest of the code seems to handle
 >> two polls finishing correctly, too.  That assert and the same one in
 >> another branch a few lines lower should be removed.
 >
 > I removed the asserts, but the test is still failing; now the "rv_host > 0"
 > assertion is triggering instead:
 >
 >    assertion "rv_host > 0" failed: file "/tmp/bracket/build/2014.07.20.00.48.51-i386/src/lib/librumphijack/hijack.c", line 2073, function "_sys___pollts50"

 Can NetBSD's poll() return with POLLIN set in revents, but not 
 reflecting that in the return value?  Is that legal behaviour?

 Or is there something else fishy going on?

From: Andreas Gustafsson <gson@gson.org>
To: Antti Kantee <pooka@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly
 fails
Date: Sun, 20 Jul 2014 16:30:40 +0300

 Antti Kantee wrote:
 > Can NetBSD's poll() return with POLLIN set in revents, but not 
 > reflecting that in the return value?  Is that legal behaviour?
 > 
 > Or is there something else fishy going on?

 I added some debug logging to hostpoll(), after the op_pollts() call:

 	if ((parg->pfds[parg->nfds-1].revents & POLLIN) && !(rv > 0)) {
 		fprintf(stderr, "this should not happen; rv = %d, errno = %d\n", (int)rv, errno);
 	}

 and got the output:

     this should not happen; rv = -1, errno = 4

 errno 4 is EINTR.  As far as I can see, POSIX is silent on what
 revents is allowed to contain on an error return.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Antti Kantee <pooka@NetBSD.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly
 fails
Date: Sun, 20 Jul 2014 23:17:33 +0000

 This is a multi-part message in MIME format.
 --------------010505010900080406060102
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 Content-Transfer-Encoding: 7bit

 On 20/07/14 13:30, Andreas Gustafsson wrote:
 > Antti Kantee wrote:
 >> Can NetBSD's poll() return with POLLIN set in revents, but not
 >> reflecting that in the return value?  Is that legal behaviour?
 >>
 >> Or is there something else fishy going on?
 >
 > I added some debug logging to hostpoll(), after the op_pollts() call:
 >
 > 	if ((parg->pfds[parg->nfds-1].revents & POLLIN) && !(rv > 0)) {
 > 		fprintf(stderr, "this should not happen; rv = %d, errno = %d\n", (int)rv, errno);
 > 	}
 >
 > and got the output:
 >
 >      this should not happen; rv = -1, errno = 4
 >
 > errno 4 is EINTR.  As far as I can see, POSIX is silent on what
 > revents is allowed to contain on an error return.

 Ok, so how about the attached patch where we trust the return value over 
 the contents of the vector.

 --------------010505010900080406060102
 Content-Type: text/x-patch;
  name="hijack.diff"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
  filename="hijack.diff"

 ? hijack.diff
 Index: hijack.c
 ===================================================================
 RCS file: /cvsroot/src/lib/librumphijack/hijack.c,v
 retrieving revision 1.108
 diff -u -r1.108 hijack.c
 --- hijack.c	19 Jul 2014 14:14:21 -0000	1.108
 +++ hijack.c	20 Jul 2014 23:13:33 -0000
 @@ -2069,12 +2069,10 @@
  		errno_host = parg.errnum;

  		/* strip cross-thread notification from real results */
 -		if (pfd_host[nfds].revents & POLLIN) {
 -			assert(rv_host > 0);
 +		if (rv_host > 0 && pfd_host[nfds].revents & POLLIN) {
  			rv_host--;
  		}
 -		if (pfd_rump[nfds].revents & POLLIN) {
 -			assert(rv_rump > 0);
 +		if (rv_rump > 0 && pfd_rump[nfds].revents & POLLIN) {
  			rv_rump--;
  		}


 --------------010505010900080406060102--

From: Andreas Gustafsson <gson@gson.org>
To: Antti Kantee <pooka@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly fails
Date: Mon, 21 Jul 2014 12:20:44 +0300

 Antti Kantee wrote:
 > Ok, so how about the attached patch where we trust the return value over 
 > the contents of the vector.

 With this patch, I have now run the t_tcpip test more than 200 times
 in a row without failures.  I will commit it.  Thank you!
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46464 CVS commit: src/lib/librumphijack
Date: Mon, 21 Jul 2014 14:23:43 +0000

 Module Name:	src
 Committed By:	gson
 Date:		Mon Jul 21 14:23:43 UTC 2014

 Modified Files:
 	src/lib/librumphijack: hijack.c

 Log Message:
 Ignore the contents of revents when poll() returns failure.
 This is one more part to the fix for PR kern/46464.  Patch
 from pooka.


 To generate a diff of this commit:
 cvs rdiff -u -r1.108 -r1.109 src/lib/librumphijack/hijack.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Wed, 05 Aug 2015 13:03:37 +0000
State-Changed-Why:
Fixed by src/lib/librumphijack/hijack.c 1.109.
The current failures are unrelated.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.