NetBSD Problem Report #47563

From martin@duskware.de  Wed Feb 13 13:24:13 2013
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id 0BB6063E5A2
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 13 Feb 2013 13:24:13 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: test cases do not timeout and hang
X-Send-Pr-Version: 3.95

>Number:         47563
>Category:       kern
>Synopsis:       test cases do not timeout and hang
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Feb 13 13:25:01 +0000 2013
>Last-Modified:  Sun Feb 17 09:15:02 +0000 2013
>Originator:     Martin Husemann
>Release:        NetBSD 6.99.16
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD thirdstage.duskware.de 6.99.16 NetBSD 6.99.16 (MODULAR) #43: Wed Feb 13 10:58:39 CET 2013 martin@night-porter.duskware.de:/usr/src/sys/arch/sparc64/compile/MODULAR sparc64
Architecture: sparc64
Machine: sparc64
>Description:

Automatic test runs fail, hang forever and the atf timeout does not work.
When I manually kill rump_allserver, atf-report says:

dev/sysmon/t_swsensor (93/558): 5 test cases
    alarm_sensor: [6595.641156s] Failed: Test case timed out after 60 seconds

This is reproducable.
Last successfull test run (with -current mostly unbuildable in between) was
on 2013-02-05.

>How-To-Repeat:

cd /usr/tests/dev/sysmon
atf-run t_swsensor | atf-report

>Fix:
n/a

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: Antti Kantee <pooka@NetBSD.org>, Julio Merino <jmmv@NetBSD.org>
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Fri, 15 Feb 2013 10:50:26 +0100

 Just for the record: this still fails the same with -current and newer atf:

 dev/sysmon/t_swsensor (93/558): 5 test cases
     alarm_sensor: [583.955578s] Failed: Test case timed out after 60 seconds

 This makes tests runs impossible - I'm urgently looking for ideas what could
 be at fault. Any ideas?

 Note there are two bugs in here, a hint for either the missing atf timeout
 or the rump failure would be much appreciated!

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: Antti Kantee <pooka@NetBSD.org>, Julio Merino <jmmv@NetBSD.org>
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Fri, 15 Feb 2013 17:32:43 +0100

 A bit more data: I can reproduce the failure on various sparc64 SMP
 machines. It seems to not affect the uniprocessor automatic test runs
 (including the sparc ones), and seems to not be reproducable on x86.

 The test case, when run as "atf-test t_swsensor" alone, does not output
 anything exciting:

 ...
 tps-count: 1
 tp-start: 1360944907.365325, t_swsensor, 5
 tc-start: 1360944907.365395, alarm_sensor

 ... and there i hangs untill I manually kill rump_allserver. A ktrace shows
 not a lot activity (and neither the rump process nor atf-run gather a lot
 of cpu usage):

   2765     12 rump_allserver CALL  ___lwp_park50(0x49f158e8,0,0x4a504098,0x4a50$
   2765      6 rump_allserver RET   ___lwp_park50 -1 errno 60 Connection timed o$
   2765      1 rump_allserver RET   __nanosleep50 0, -87720/0xfffffffffffea958   
   2765      1 rump_allserver CALL  _lwp_unpark(7,0x48830998)
   2765      6 rump_allserver CALL  ___lwp_park50(0x493096e8,0,0x4c204098,0x4c20$
   2765      6 rump_allserver RET   ___lwp_park50 -1 errno 60 Connection timed o$
   2765      1 rump_allserver RET   _lwp_unpark 0, 1216547224/0x48830998
   2765      6 rump_allserver CALL  ___lwp_park50(0x493096e8,0,0x4c204098,0x4c20$
   2765      1 rump_allserver CALL  _lwp_unpark_all(0x482109f8,1,0x48830698)
   2765      1 rump_allserver RET   _lwp_unpark_all 0, 1
   2765      7 rump_allserver RET   ___lwp_park50 0
   2765      1 rump_allserver CALL  ___lwp_park50(0,0,0x488309d8,0x488309d8)
   2765      9 rump_allserver RET   ___lwp_park50 0
   2765      7 rump_allserver CALL  _lwp_unpark_all(0x4885e0f8,1,0x48830618)
   2765      7 rump_allserver RET   _lwp_unpark_all 0, 1
   2765      9 rump_allserver CALL  ___lwp_park50(0,0,0x48830698,0x48830698)
   2765      7 rump_allserver CALL  ___lwp_park50(0,0,0x48830998,0x48830998)
   2765      3 rump_allserver RET   ___lwp_park50 0
   2765      3 rump_allserver CALL  _lwp_unpark(1,0x488309d8)
   2765      3 rump_allserver RET   _lwp_unpark 0, 1216547288/0x488309d8

 and sometimes a timeout on the atf-run side:

   2765      1 rump_allserver CALL  __nanosleep50(0xfffffffffffea968,0xfffffffff$
   2765     12 rump_allserver RET   ___lwp_park50 -1 errno 60 Connection timed o$
   2483      1 atf-run  RET   poll 0, 2
   2765     12 rump_allserver CALL  ___lwp_park50(0x49f158e8,0,0x4a504098,0x4a50$
   2765     12 rump_allserver RET   ___lwp_park50 -1 errno 60 Connection timed o$
   2483      1 atf-run  CALL  poll(0x413120a0,2,0xfa)   
   2765     12 rump_allserver CALL  ___lwp_park50(0x49f158e8,0,0x4a504098,0x4a50$
   2765      1 rump_allserver RET   __nanosleep50 0, -87720/0xfffffffffffea958


 This makes it even more interesting that atf-run does not seem to timeout
 the testcase after 60 seconds.

 The atf-run process is in:

 #0  0x000000000011de30 in atf::atf_run::muxer::read_one(unsigned long, int, std::string&, bool) ()
 #1  0x000000000011e23c in atf::atf_run::muxer::flush() ()
 #2  0x0000000000129a98 in atf::atf_run::run_test_case(atf::fs::path const&, std::string const&, std::string const&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, atf::fs::path const&, atf::fs::path const&, atf::atf_run::atf_tps_writer&) ()
 #3  0x000000000010ba98 in atf_run::run_test_program(atf::fs::path const&, atf::atf_run::atf_tps_writer&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&) ()
 #4  0x000000000010dab4 in atf_run::run_test(atf::fs::path const&, atf::atf_run::atf_tps_writer&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&) ()
 #5  0x000000000010df0c in atf_run::main() ()
 #6  0x000000004052c190 in atf::application::app::run(int, char* const*) ()
    from /usr/lib/libatf-c++.so.0
 #7  0x0000000000107ea0 in main ()

 Unfortunately gdb is non-cooperative for the rump_allserver process - guess
 I'll look at that first now ;-)

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: Antti Kantee <pooka@NetBSD.org>, Julio Merino <jmmv@NetBSD.org>,
	rmind@NetBSD.org, gson@NetBSD.org
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sat, 16 Feb 2013 21:37:29 +0100

 So it will only strike on SMP machines. The rump_allserver process does:

 #0  0x00000000482094ec in __nanosleep50 () from /usr/lib/libpthread.so.1
 #1  0x0000000048009db0 in rumpuser_nanosleep () from /usr/lib/librumpuser.so.0
 #2  0x0000000047decc80 in rumpns_kpause () from /usr/lib/librump.so.0
 #3  0x0000000047d92894 in rumpns_pserialize_perform ()
    from /usr/lib/librump.so.0
 #4  0x0000000046f20c8c in rumpns_npf_config_reload ()
    from /usr/lib/librumpnet_npf.so.0
 #5  0x0000000046f209ac in rumpns_npf_config_init ()
    from /usr/lib/librumpnet_npf.so.0
 #6  0x0000000046f215a8 in ?? () from /usr/lib/librumpnet_npf.so.0
 #7  0x0000000046f215a8 in ?? () from /usr/lib/librumpnet_npf.so.0

 Can we please urgently fix the pserialize vs. rump issue?

 This leaves the open question why atf does not timeout - I have not idea
 on that, ktrace shows it dropping out of poll() regularily.

 Thanks,

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: Antti Kantee <pooka@NetBSD.org>, Julio Merino <jmmv@NetBSD.org>,
	rmind@NetBSD.org, gson@NetBSD.org
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sat, 16 Feb 2013 21:53:51 +0100

 On Sat, Feb 16, 2013 at 09:37:29PM +0100, Martin Husemann wrote:
 > This leaves the open question why atf does not timeout - I have not idea
 > on that, ktrace shows it dropping out of poll() regularily.

 I verified that atf-run/timer.cpp marks the timer as fired and invokes
 the callback.

 Martin

From: Martin Husemann <martin@duskware.de>
To: julio@meroh.net
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sat, 16 Feb 2013 22:32:25 +0100

 On Sat, Feb 16, 2013 at 04:29:12PM -0500, julio@meroh.net wrote:

 > In all honesty, the code in atf-run that manages timeouts and muxes
 > the output of the tests is quite retarded.  I won't be surprised if
 > there are bugs lingering around... but it's hard to explain why such a
 > bug would only trigger in sparc64 if that's the case.

 No, all is fine, timeout triggers but it does not manage to shut down the
 test case.

 > What happens to the rump_allserver process?  Does it crash?  Maybe
 > attempting to reproduce its behavior under a different platform would
 > trigger the same atf-run bug, if any?

 It deadlocks in the rump kernel. It is killable from the host, but the
 test case has no code to automatically do that.

 Martin

From: julio@meroh.net
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sat, 16 Feb 2013 16:36:34 -0500

 On Feb 16, 2013, at 16:32 , Martin Husemann <martin@duskware.de> wrote:

 > On Sat, Feb 16, 2013 at 04:29:12PM -0500, julio@meroh.net wrote:
 >=20
 >> In all honesty, the code in atf-run that manages timeouts and muxes
 >> the output of the tests is quite retarded.  I won't be surprised if
 >> there are bugs lingering around... but it's hard to explain why such =
 a
 >> bug would only trigger in sparc64 if that's the case.
 >=20
 > No, all is fine, timeout triggers but it does not manage to shut down =
 the
 > test case.

 Is the test case sent the term signal?  And the kill signal?  Does it =
 become a zombie?

 Is rump_allserver in the same process group as the test case?  Does it =
 receive the signal and die too?

 Lastly, maybe you can try changing timeout_callback() in timer.cpp so =
 that it simply reads:

 void
 impl::child_timer::timeout_callback(void)
 {
     m_terminate =3D true;
     ::kill(-m_pid, SIGKILL);
 }

 That term/kill duality is... unnecessary.=

From: julio@meroh.net
To: martin@duskware.de
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sat, 16 Feb 2013 16:29:12 -0500

 On Feb 16, 2013, at 15:55 , Martin Husemann <martin@duskware.de> wrote:

 > The following reply was made to PR kern/47563; it has been noted by =
 GNATS.
 >=20
 > From: Martin Husemann <martin@duskware.de>
 > To: gnats-bugs@NetBSD.org
 > Cc: Antti Kantee <pooka@NetBSD.org>, Julio Merino <jmmv@NetBSD.org>,
 > 	rmind@NetBSD.org, gson@NetBSD.org
 > Subject: Re: kern/47563: test cases do not timeout and hang
 > Date: Sat, 16 Feb 2013 21:53:51 +0100
 >=20
 > On Sat, Feb 16, 2013 at 09:37:29PM +0100, Martin Husemann wrote:
 >> This leaves the open question why atf does not timeout - I have not =
 idea
 >> on that, ktrace shows it dropping out of poll() regularily.
 >=20
 > I verified that atf-run/timer.cpp marks the timer as fired and invokes
 > the callback.

 In all honesty, the code in atf-run that manages timeouts and muxes the =
 output of the tests is quite retarded.  I won't be surprised if there =
 are bugs lingering around... but it's hard to explain why such a bug =
 would only trigger in sparc64 if that's the case.

 What happens to the rump_allserver process?  Does it crash?  Maybe =
 attempting to reproduce its behavior under a different platform would =
 trigger the same atf-run bug, if any?=

From: Jukka Ruohonen <jruohonen@iki.fi>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sun, 17 Feb 2013 06:21:58 +0200

 On Sat, Feb 16, 2013 at 08:40:06PM +0000, Martin Husemann wrote:
 >  Can we please urgently fix the pserialize vs. rump issue?

 For urgent cases, the test case should be bluntly marked as broken.

 - Jukka.

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/47563: test cases do not timeout and hang
Date: Sun, 17 Feb 2013 10:13:14 +0100

 On Sun, Feb 17, 2013 at 04:25:03AM +0000, Jukka Ruohonen wrote:
 >  For urgent cases, the test case should be bluntly marked as broken.

 The pserialize issue breaks all fss related tests and all others that
 try to start a rump environment including the npf rump components (like
 rump_allserver).

 It only works on non SMP machines (pserialize is a NO-OP there) or by accident
 on x86.

 IMHO the proper fix for this should be discussed publically ASAP, instead
 privately or not at all. According to our commit rules, if taken formally,
 both the fss and npf changes which made those subsystems use pserialize 
 would need to be backed out untill a real solution is found.

 Martin

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.