NetBSD Problem Report #46464
From gson@gson.org Fri May 18 15:38:32 2012
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id F00A663B882
for <gnats-bugs@gnats.NetBSD.org>; Fri, 18 May 2012 15:38:31 +0000 (UTC)
Message-Id: <20120518153738.1267D75E8C@guava.gson.org>
Date: Fri, 18 May 2012 18:37:38 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: lib/librumphijack/t_tcpip:ssh test case randomly fails
X-Send-Pr-Version: 3.95
>Number: 46464
>Category: kern
>Synopsis: lib/librumphijack/t_tcpip:ssh test case randomly fails
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri May 18 15:40:00 +0000 2012
>Closed-Date: Wed Aug 05 13:03:37 +0000 2015
>Last-Modified: Wed Aug 05 13:03:37 +0000 2015
>Originator: Andreas Gustafsson
>Release: NetBSD-current as of 2012.05.16.19.12.59
>Organization:
>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:
The "ssh" test case of the lib/librumphijack/t_tcpip test has been
randomly failing for a long time, perhaps its entire existence (it
was committed on 2011.02.14.15.14.00, and the first recorded failure
on my test system was ten days later at 2011.02.24.18.33.06).
The output from a recent failure can be seen at:
http://releng.netbsd.org/b5reports/i386/build/2012.05.16.11.45.08/test.html#lib_librumphijack_t_tcpip_ssh
The above log shows the test failing with the error message:
Timeout, server 127.0.0.1 not responding.
Interestingly, this message is only printed in case of a keep-alive
timeout, and keep-alive timeouts can only happen if ssh is configured
with a non-zero ServerAliveInterval, but in this test, ssh is using a
configuration with the default ServerAliveInterval of zero,
so there should be no way this can happen.
To debug this, I added some printfs to print the arguments and return value of
the select() system call in src/crypto/external/bsd/openssh/dist/clientloop.c,
and found that when the error occurs, select() is returning zero (indicating
a timeout) even though its "timeout" argument is a NULL pointer. That's
not supposed to happen, is it? So to me this looks like a bug in select(),
or maybe just in its rump implementation.
My patch adding the printfs is at
http://www.gson.org/netbsd/bugs/atf-ssh-test/ssh-select-debug.patch
and here is an excerpt from a /tmp/select.log file written by the patched
ssh, showing select() returning 0 even though the timeout argument is
NULL:
select nfds=129 tvp=0x0
read 5
read 128
select ret = 0
>How-To-Repeat:
cd /usr/tests/lib/librumphijack/
while atf-run t_tcpip; do true; done
This may fail in a test case other than the ssh one; if so, retry it.
>Fix:
>Release-Note:
>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: Antti Kantee <pooka@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly fails
Date: Sat, 19 Jul 2014 12:11:23 +0300
Hi Antti,
I have now tracked down the cause of PR 46464 to an assertion failure
in librumphijack:
assertion "(pfd=5Frump[nfds].revents & POLLIN) =3D=3D 0" failed: file=
"/tmp/bracket/build/2014.07.16.07.41.43-i386-debug/src/lib/librumphija=
ck/hijack.c", line 2073, function "=5Fsys=5F=5F=5Fpollts50"
What is the intent of the assertion=3F It seems to be asserting that
the two threads can't simultaneously notify each other, but it looks
like they can.
To reproduce the failure, including the assertion failure message,
paste the following into a terminal running a root shell:
cd /usr/tests/lib/librumphijack
ed t=5Ftcpip
/sbin.sshd/
s/sshd/sshd -d -o LogLevel=3DDEBUG3 /
s/$/ 2>\/tmp\/sshd.log \&/
a
echo $! >sshd.pid
=2E
/save:ssh.out/
s/out/out -e ignore/
w
q
while atf-run t=5Ftcpip; do true; done
cat /tmp/sshd.log
--=20
Andreas Gustafsson, gson@gson.org
From: Antti Kantee <pooka@NetBSD.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly
fails
Date: Sat, 19 Jul 2014 10:13:18 +0000
On 19/07/14 09:11, Andreas Gustafsson wrote:
> Hi Antti,
>
> I have now tracked down the cause of PR 46464 to an assertion failure
> in librumphijack:
>
> assertion "(pfd_rump[nfds].revents & POLLIN) == 0" failed: file "/tmp/bracket/build/2014.07.16.07.41.43-i386-debug/src/lib/librumphijack/hijack.c", line 2073, function "_sys___pollts50"
>
> What is the intent of the assertion? It seems to be asserting that
> the two threads can't simultaneously notify each other, but it looks
> like they can.
Andreas,
I am almost sure you are correct. The rest of the code seems to handle
two polls finishing correctly, too. That assert and the same one in
another branch a few lines lower should be removed.
thanks!
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: Antti Kantee <pooka@NetBSD.org>
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly fails
Date: Sat, 19 Jul 2014 12:27:29 +0300
The gnats database copy of my last message got trashed by MIME,
rendering the recipe for reproducing the assertion failure unusable.
I have now converted it into a shell script which can be downloaded
from:
http://www.gson.org/netbsd/bugs/46464/reproduce.sh
--
Andreas Gustafsson, gson@gson.org
From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/46464 CVS commit: src/lib/librumphijack
Date: Sat, 19 Jul 2014 14:14:22 +0000
Module Name: src
Committed By: gson
Date: Sat Jul 19 14:14:21 UTC 2014
Modified Files:
src/lib/librumphijack: hijack.c
Log Message:
Do not assert that the two threads do not simultanously notify each
other, because sometimes they do. Should fix PR kern/46464. OK pooka.
To generate a diff of this commit:
cvs rdiff -u -r1.107 -r1.108 src/lib/librumphijack/hijack.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Andreas Gustafsson <gson@gson.org>
To: Antti Kantee <pooka@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly fails
Date: Sun, 20 Jul 2014 11:41:00 +0300
Antti Kantee wrote:
> I am almost sure you are correct. The rest of the code seems to handle
> two polls finishing correctly, too. That assert and the same one in
> another branch a few lines lower should be removed.
I removed the asserts, but the test is still failing; now the "rv_host > 0"
assertion is triggering instead:
assertion "rv_host > 0" failed: file "/tmp/bracket/build/2014.07.20.00.48.51-i386/src/lib/librumphijack/hijack.c", line 2073, function "_sys___pollts50"
--
Andreas Gustafsson, gson@gson.org
From: Antti Kantee <pooka@NetBSD.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly
fails
Date: Sun, 20 Jul 2014 10:48:36 +0000
On 20/07/14 08:41, Andreas Gustafsson wrote:
> Antti Kantee wrote:
>> I am almost sure you are correct. The rest of the code seems to handle
>> two polls finishing correctly, too. That assert and the same one in
>> another branch a few lines lower should be removed.
>
> I removed the asserts, but the test is still failing; now the "rv_host > 0"
> assertion is triggering instead:
>
> assertion "rv_host > 0" failed: file "/tmp/bracket/build/2014.07.20.00.48.51-i386/src/lib/librumphijack/hijack.c", line 2073, function "_sys___pollts50"
Can NetBSD's poll() return with POLLIN set in revents, but not
reflecting that in the return value? Is that legal behaviour?
Or is there something else fishy going on?
From: Andreas Gustafsson <gson@gson.org>
To: Antti Kantee <pooka@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly
fails
Date: Sun, 20 Jul 2014 16:30:40 +0300
Antti Kantee wrote:
> Can NetBSD's poll() return with POLLIN set in revents, but not
> reflecting that in the return value? Is that legal behaviour?
>
> Or is there something else fishy going on?
I added some debug logging to hostpoll(), after the op_pollts() call:
if ((parg->pfds[parg->nfds-1].revents & POLLIN) && !(rv > 0)) {
fprintf(stderr, "this should not happen; rv = %d, errno = %d\n", (int)rv, errno);
}
and got the output:
this should not happen; rv = -1, errno = 4
errno 4 is EINTR. As far as I can see, POSIX is silent on what
revents is allowed to contain on an error return.
--
Andreas Gustafsson, gson@gson.org
From: Antti Kantee <pooka@NetBSD.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly
fails
Date: Sun, 20 Jul 2014 23:17:33 +0000
This is a multi-part message in MIME format.
--------------010505010900080406060102
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
On 20/07/14 13:30, Andreas Gustafsson wrote:
> Antti Kantee wrote:
>> Can NetBSD's poll() return with POLLIN set in revents, but not
>> reflecting that in the return value? Is that legal behaviour?
>>
>> Or is there something else fishy going on?
>
> I added some debug logging to hostpoll(), after the op_pollts() call:
>
> if ((parg->pfds[parg->nfds-1].revents & POLLIN) && !(rv > 0)) {
> fprintf(stderr, "this should not happen; rv = %d, errno = %d\n", (int)rv, errno);
> }
>
> and got the output:
>
> this should not happen; rv = -1, errno = 4
>
> errno 4 is EINTR. As far as I can see, POSIX is silent on what
> revents is allowed to contain on an error return.
Ok, so how about the attached patch where we trust the return value over
the contents of the vector.
--------------010505010900080406060102
Content-Type: text/x-patch;
name="hijack.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="hijack.diff"
? hijack.diff
Index: hijack.c
===================================================================
RCS file: /cvsroot/src/lib/librumphijack/hijack.c,v
retrieving revision 1.108
diff -u -r1.108 hijack.c
--- hijack.c 19 Jul 2014 14:14:21 -0000 1.108
+++ hijack.c 20 Jul 2014 23:13:33 -0000
@@ -2069,12 +2069,10 @@
errno_host = parg.errnum;
/* strip cross-thread notification from real results */
- if (pfd_host[nfds].revents & POLLIN) {
- assert(rv_host > 0);
+ if (rv_host > 0 && pfd_host[nfds].revents & POLLIN) {
rv_host--;
}
- if (pfd_rump[nfds].revents & POLLIN) {
- assert(rv_rump > 0);
+ if (rv_rump > 0 && pfd_rump[nfds].revents & POLLIN) {
rv_rump--;
}
--------------010505010900080406060102--
From: Andreas Gustafsson <gson@gson.org>
To: Antti Kantee <pooka@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly fails
Date: Mon, 21 Jul 2014 12:20:44 +0300
Antti Kantee wrote:
> Ok, so how about the attached patch where we trust the return value over
> the contents of the vector.
With this patch, I have now run the t_tcpip test more than 200 times
in a row without failures. I will commit it. Thank you!
--
Andreas Gustafsson, gson@gson.org
From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/46464 CVS commit: src/lib/librumphijack
Date: Mon, 21 Jul 2014 14:23:43 +0000
Module Name: src
Committed By: gson
Date: Mon Jul 21 14:23:43 UTC 2014
Modified Files:
src/lib/librumphijack: hijack.c
Log Message:
Ignore the contents of revents when poll() returns failure.
This is one more part to the fix for PR kern/46464. Patch
from pooka.
To generate a diff of this commit:
cvs rdiff -u -r1.108 -r1.109 src/lib/librumphijack/hijack.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Wed, 05 Aug 2015 13:03:37 +0000
State-Changed-Why:
Fixed by src/lib/librumphijack/hijack.c 1.109.
The current failures are unrelated.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.