NetBSD Problem Report #47563
From martin@duskware.de Wed Feb 13 13:24:13 2013
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id 0BB6063E5A2
for <gnats-bugs@gnats.NetBSD.org>; Wed, 13 Feb 2013 13:24:13 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: test cases do not timeout and hang
X-Send-Pr-Version: 3.95
>Number: 47563
>Category: kern
>Synopsis: test cases do not timeout and hang
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Feb 13 13:25:01 +0000 2013
>Last-Modified: Sun Feb 17 09:15:02 +0000 2013
>Originator: Martin Husemann
>Release: NetBSD 6.99.16
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD thirdstage.duskware.de 6.99.16 NetBSD 6.99.16 (MODULAR) #43: Wed Feb 13 10:58:39 CET 2013 martin@night-porter.duskware.de:/usr/src/sys/arch/sparc64/compile/MODULAR sparc64
Architecture: sparc64
Machine: sparc64
>Description:
Automatic test runs fail, hang forever and the atf timeout does not work.
When I manually kill rump_allserver, atf-report says:
dev/sysmon/t_swsensor (93/558): 5 test cases
alarm_sensor: [6595.641156s] Failed: Test case timed out after 60 seconds
This is reproducable.
Last successfull test run (with -current mostly unbuildable in between) was
on 2013-02-05.
>How-To-Repeat:
cd /usr/tests/dev/sysmon
atf-run t_swsensor | atf-report
>Fix:
n/a
>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: Antti Kantee <pooka@NetBSD.org>, Julio Merino <jmmv@NetBSD.org>
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Fri, 15 Feb 2013 10:50:26 +0100
Just for the record: this still fails the same with -current and newer atf:
dev/sysmon/t_swsensor (93/558): 5 test cases
alarm_sensor: [583.955578s] Failed: Test case timed out after 60 seconds
This makes tests runs impossible - I'm urgently looking for ideas what could
be at fault. Any ideas?
Note there are two bugs in here, a hint for either the missing atf timeout
or the rump failure would be much appreciated!
Martin
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: Antti Kantee <pooka@NetBSD.org>, Julio Merino <jmmv@NetBSD.org>
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Fri, 15 Feb 2013 17:32:43 +0100
A bit more data: I can reproduce the failure on various sparc64 SMP
machines. It seems to not affect the uniprocessor automatic test runs
(including the sparc ones), and seems to not be reproducable on x86.
The test case, when run as "atf-test t_swsensor" alone, does not output
anything exciting:
...
tps-count: 1
tp-start: 1360944907.365325, t_swsensor, 5
tc-start: 1360944907.365395, alarm_sensor
... and there i hangs untill I manually kill rump_allserver. A ktrace shows
not a lot activity (and neither the rump process nor atf-run gather a lot
of cpu usage):
2765 12 rump_allserver CALL ___lwp_park50(0x49f158e8,0,0x4a504098,0x4a50$
2765 6 rump_allserver RET ___lwp_park50 -1 errno 60 Connection timed o$
2765 1 rump_allserver RET __nanosleep50 0, -87720/0xfffffffffffea958
2765 1 rump_allserver CALL _lwp_unpark(7,0x48830998)
2765 6 rump_allserver CALL ___lwp_park50(0x493096e8,0,0x4c204098,0x4c20$
2765 6 rump_allserver RET ___lwp_park50 -1 errno 60 Connection timed o$
2765 1 rump_allserver RET _lwp_unpark 0, 1216547224/0x48830998
2765 6 rump_allserver CALL ___lwp_park50(0x493096e8,0,0x4c204098,0x4c20$
2765 1 rump_allserver CALL _lwp_unpark_all(0x482109f8,1,0x48830698)
2765 1 rump_allserver RET _lwp_unpark_all 0, 1
2765 7 rump_allserver RET ___lwp_park50 0
2765 1 rump_allserver CALL ___lwp_park50(0,0,0x488309d8,0x488309d8)
2765 9 rump_allserver RET ___lwp_park50 0
2765 7 rump_allserver CALL _lwp_unpark_all(0x4885e0f8,1,0x48830618)
2765 7 rump_allserver RET _lwp_unpark_all 0, 1
2765 9 rump_allserver CALL ___lwp_park50(0,0,0x48830698,0x48830698)
2765 7 rump_allserver CALL ___lwp_park50(0,0,0x48830998,0x48830998)
2765 3 rump_allserver RET ___lwp_park50 0
2765 3 rump_allserver CALL _lwp_unpark(1,0x488309d8)
2765 3 rump_allserver RET _lwp_unpark 0, 1216547288/0x488309d8
and sometimes a timeout on the atf-run side:
2765 1 rump_allserver CALL __nanosleep50(0xfffffffffffea968,0xfffffffff$
2765 12 rump_allserver RET ___lwp_park50 -1 errno 60 Connection timed o$
2483 1 atf-run RET poll 0, 2
2765 12 rump_allserver CALL ___lwp_park50(0x49f158e8,0,0x4a504098,0x4a50$
2765 12 rump_allserver RET ___lwp_park50 -1 errno 60 Connection timed o$
2483 1 atf-run CALL poll(0x413120a0,2,0xfa)
2765 12 rump_allserver CALL ___lwp_park50(0x49f158e8,0,0x4a504098,0x4a50$
2765 1 rump_allserver RET __nanosleep50 0, -87720/0xfffffffffffea958
This makes it even more interesting that atf-run does not seem to timeout
the testcase after 60 seconds.
The atf-run process is in:
#0 0x000000000011de30 in atf::atf_run::muxer::read_one(unsigned long, int, std::string&, bool) ()
#1 0x000000000011e23c in atf::atf_run::muxer::flush() ()
#2 0x0000000000129a98 in atf::atf_run::run_test_case(atf::fs::path const&, std::string const&, std::string const&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, atf::fs::path const&, atf::fs::path const&, atf::atf_run::atf_tps_writer&) ()
#3 0x000000000010ba98 in atf_run::run_test_program(atf::fs::path const&, atf::atf_run::atf_tps_writer&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&) ()
#4 0x000000000010dab4 in atf_run::run_test(atf::fs::path const&, atf::atf_run::atf_tps_writer&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&) ()
#5 0x000000000010df0c in atf_run::main() ()
#6 0x000000004052c190 in atf::application::app::run(int, char* const*) ()
from /usr/lib/libatf-c++.so.0
#7 0x0000000000107ea0 in main ()
Unfortunately gdb is non-cooperative for the rump_allserver process - guess
I'll look at that first now ;-)
Martin
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: Antti Kantee <pooka@NetBSD.org>, Julio Merino <jmmv@NetBSD.org>,
rmind@NetBSD.org, gson@NetBSD.org
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sat, 16 Feb 2013 21:37:29 +0100
So it will only strike on SMP machines. The rump_allserver process does:
#0 0x00000000482094ec in __nanosleep50 () from /usr/lib/libpthread.so.1
#1 0x0000000048009db0 in rumpuser_nanosleep () from /usr/lib/librumpuser.so.0
#2 0x0000000047decc80 in rumpns_kpause () from /usr/lib/librump.so.0
#3 0x0000000047d92894 in rumpns_pserialize_perform ()
from /usr/lib/librump.so.0
#4 0x0000000046f20c8c in rumpns_npf_config_reload ()
from /usr/lib/librumpnet_npf.so.0
#5 0x0000000046f209ac in rumpns_npf_config_init ()
from /usr/lib/librumpnet_npf.so.0
#6 0x0000000046f215a8 in ?? () from /usr/lib/librumpnet_npf.so.0
#7 0x0000000046f215a8 in ?? () from /usr/lib/librumpnet_npf.so.0
Can we please urgently fix the pserialize vs. rump issue?
This leaves the open question why atf does not timeout - I have not idea
on that, ktrace shows it dropping out of poll() regularily.
Thanks,
Martin
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: Antti Kantee <pooka@NetBSD.org>, Julio Merino <jmmv@NetBSD.org>,
rmind@NetBSD.org, gson@NetBSD.org
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sat, 16 Feb 2013 21:53:51 +0100
On Sat, Feb 16, 2013 at 09:37:29PM +0100, Martin Husemann wrote:
> This leaves the open question why atf does not timeout - I have not idea
> on that, ktrace shows it dropping out of poll() regularily.
I verified that atf-run/timer.cpp marks the timer as fired and invokes
the callback.
Martin
From: Martin Husemann <martin@duskware.de>
To: julio@meroh.net
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sat, 16 Feb 2013 22:32:25 +0100
On Sat, Feb 16, 2013 at 04:29:12PM -0500, julio@meroh.net wrote:
> In all honesty, the code in atf-run that manages timeouts and muxes
> the output of the tests is quite retarded. I won't be surprised if
> there are bugs lingering around... but it's hard to explain why such a
> bug would only trigger in sparc64 if that's the case.
No, all is fine, timeout triggers but it does not manage to shut down the
test case.
> What happens to the rump_allserver process? Does it crash? Maybe
> attempting to reproduce its behavior under a different platform would
> trigger the same atf-run bug, if any?
It deadlocks in the rump kernel. It is killable from the host, but the
test case has no code to automatically do that.
Martin
From: julio@meroh.net
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sat, 16 Feb 2013 16:36:34 -0500
On Feb 16, 2013, at 16:32 , Martin Husemann <martin@duskware.de> wrote:
> On Sat, Feb 16, 2013 at 04:29:12PM -0500, julio@meroh.net wrote:
>=20
>> In all honesty, the code in atf-run that manages timeouts and muxes
>> the output of the tests is quite retarded. I won't be surprised if
>> there are bugs lingering around... but it's hard to explain why such =
a
>> bug would only trigger in sparc64 if that's the case.
>=20
> No, all is fine, timeout triggers but it does not manage to shut down =
the
> test case.
Is the test case sent the term signal? And the kill signal? Does it =
become a zombie?
Is rump_allserver in the same process group as the test case? Does it =
receive the signal and die too?
Lastly, maybe you can try changing timeout_callback() in timer.cpp so =
that it simply reads:
void
impl::child_timer::timeout_callback(void)
{
m_terminate =3D true;
::kill(-m_pid, SIGKILL);
}
That term/kill duality is... unnecessary.=
From: julio@meroh.net
To: martin@duskware.de
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sat, 16 Feb 2013 16:29:12 -0500
On Feb 16, 2013, at 15:55 , Martin Husemann <martin@duskware.de> wrote:
> The following reply was made to PR kern/47563; it has been noted by =
GNATS.
>=20
> From: Martin Husemann <martin@duskware.de>
> To: gnats-bugs@NetBSD.org
> Cc: Antti Kantee <pooka@NetBSD.org>, Julio Merino <jmmv@NetBSD.org>,
> rmind@NetBSD.org, gson@NetBSD.org
> Subject: Re: kern/47563: test cases do not timeout and hang
> Date: Sat, 16 Feb 2013 21:53:51 +0100
>=20
> On Sat, Feb 16, 2013 at 09:37:29PM +0100, Martin Husemann wrote:
>> This leaves the open question why atf does not timeout - I have not =
idea
>> on that, ktrace shows it dropping out of poll() regularily.
>=20
> I verified that atf-run/timer.cpp marks the timer as fired and invokes
> the callback.
In all honesty, the code in atf-run that manages timeouts and muxes the =
output of the tests is quite retarded. I won't be surprised if there =
are bugs lingering around... but it's hard to explain why such a bug =
would only trigger in sparc64 if that's the case.
What happens to the rump_allserver process? Does it crash? Maybe =
attempting to reproduce its behavior under a different platform would =
trigger the same atf-run bug, if any?=
From: Jukka Ruohonen <jruohonen@iki.fi>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sun, 17 Feb 2013 06:21:58 +0200
On Sat, Feb 16, 2013 at 08:40:06PM +0000, Martin Husemann wrote:
> Can we please urgently fix the pserialize vs. rump issue?
For urgent cases, the test case should be bluntly marked as broken.
- Jukka.
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sun, 17 Feb 2013 10:13:14 +0100
On Sun, Feb 17, 2013 at 04:25:03AM +0000, Jukka Ruohonen wrote:
> For urgent cases, the test case should be bluntly marked as broken.
The pserialize issue breaks all fss related tests and all others that
try to start a rump environment including the npf rump components (like
rump_allserver).
It only works on non SMP machines (pserialize is a NO-OP there) or by accident
on x86.
IMHO the proper fix for this should be discussed publically ASAP, instead
privately or not at all. According to our commit rules, if taken formally,
both the fss and npf changes which made those subsystems use pserialize
would need to be backed out untill a real solution is found.
Martin
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.