NetBSD Problem Report #41302

From martin@duskware.de  Wed Apr 29 08:59:53 2009
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 4D7C363BC62
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 29 Apr 2009 08:59:53 +0000 (UTC)
Message-Id: <20090429085949.A971133AAC@mail.duskware.de>
Date: Wed, 29 Apr 2009 10:59:44 +0200 (CEST)
From: martin@duskware.de
Reply-To: martin@duskware.de
To: gnats-bugs@gnats.NetBSD.org
Subject: cron dies at startup
X-Send-Pr-Version: 3.95

>Number:         41302
>Category:       port-sparc64
>Synopsis:       cron dies at startup
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    martin
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Apr 29 09:00:00 +0000 2009
>Closed-Date:    Thu May 21 13:26:17 +0000 2009
>Last-Modified:  Tue May 26 19:20:08 +0000 2009
>Originator:     Martin Husemann
>Release:        NetBSD 5.0
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD setting-sun.duskware.de 5.0 NetBSD 5.0 (SETTINGSUN) #1: Wed Apr 29 07:57:43 CEST 2009 martin@night-porter.duskware.de:/usr/src-5/sys/arch/sparc64/compile/SETTINGSUN sparc64
Architecture: sparc64
Machine: sparc64
>Description:

After upgrading my system to 5.0, cron goes sometimes missing. This is not
100% reproducable, but often happens at system startup (i.e. init running
/etc/rc) - when I log in as root and run "/etc/rc.d/cron start" manually
it always seems to work and cron keeps running.

Maybe some resource limit problem preventing the inital fork?

>How-To-Repeat:
s/a
>Fix:
n/a

>Release-Note:

>Audit-Trail:
From: "Jeremy C. Reed" <reed@reedmedia.net>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/41302: cron dies at startup
Date: Wed, 29 Apr 2009 09:02:17 -0500 (CDT)

 I wonder if this is related to
 http://mail-index.netbsd.org/netbsd-users/2009/02/09/msg002977.html

From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, martin@duskware.de
Subject: re: bin/41302: cron dies at startup
Date: Thu, 30 Apr 2009 17:06:29 +1000


    From: "Jeremy C. Reed" <reed@reedmedia.net>
    To: gnats-bugs@NetBSD.org
    Cc: 
    Subject: Re: bin/41302: cron dies at startup
    Date: Wed, 29 Apr 2009 09:02:17 -0500 (CDT)

     I wonder if this is related to
     http://mail-index.netbsd.org/netbsd-users/2009/02/09/msg002977.html


 i think it's the same problem.

 that one is very strange.  clearly it is dying in daemon() right
 after fork returns... but it has something to do with SIGHUP occuring
 right at this moment:


    429      1 cron     CALL  fork
    429      1 cron     RET   fork 343/0x157
    429      1 cron     CALL  exit(0)
    343      1 cron     EMUL  "netbsd"
    343      1 cron     PSIG  SIGHUP caught handler=0x102880 mask=(): 
 code=SI_NOINFO
    343      1 cron     RET   fork 0
    343      1 cron     CALL  setcontext(0xffffffffffffb660)
    343      1 cron     RET   setcontext JUSTRETURN
    343      1 cron     CALL  getpid
    343      1 cron     RET   getpid 343/0x157, 1
    343      1 cron     CALL  gettimeofday(0xffffffffffffab40,0)
    343      1 cron     RET   gettimeofday 0


 429 is the parent, and 343 is the child.  the parent fork()'s and exits
 just like in daemon() but the child doesn't really get to run any more.
 the first thing it should do is call setsid(), but we don't see that
 before we see the failure starting (getpid/gettimeofday both are used
 to generate the failure message.)

 cron has a SIGHUP handler that looks like:

 static void
 sighup_handler(int x __unused)
 {
         log_close();
 }

 void 
 log_close(void) { 
         if (LogFD != ERR) {
                 close(LogFD);
                 LogFD = ERR;
         }
 }

 in the above log, pid 343 starts in emul netbsd, gets a SIGHUP and
 has a handler (does it run here?  i'm not sure.) but then we get
 the RET into this child right after, and then a setcontext... i'm
 not sure what exactly is going on here, but this is clearly where
 it all goes wrong.  why is a SIGHUP happening, and why is it making
 the child fail?


 .mrg.

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/41302: cron dies at startup
Date: Sun, 3 May 2009 11:04:27 +0200

 This is a funny one:

 init (pid 1) runs /bin/sh (pid 2) to execute /etc/rc. Now in my startup, cron
 is the last daemon to start. While cron is doing the daemonize() dance,
 sh is done and exits, via exit1(), which contains this code:

     352                         if (tp->t_session == sp) {
     353                                 /* we can't guarantee the revoke will do this */
     354                                 pgrp = tp->t_pgrp;
     355                                 tp->t_pgrp = NULL;
     356                                 tp->t_session = NULL;
     357                                 mutex_spin_exit(&tty_lock);
     358                                 if (pgrp != NULL) {
     359                                         pgsignal(pgrp, SIGHUP, 1);
     360                                 }

 pgrp is 2, and the cron parent process is still in this group.

 *booom*

 I wonder if deamonize() should sigignore SIGHUP (and undo that in the child)?

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/41302: cron dies at startup
Date: Sun, 3 May 2009 12:14:28 +0200

 --GvXjxJ+pjyke8COw
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 This is a patch similar to what FreeBSD did to fix this problem (only 
 difference is restoring the signal handler when fork() fails.

 OK to commit?

 Martin

 --GvXjxJ+pjyke8COw
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=patch

 Index: daemon.c
 ===================================================================
 RCS file: /cvsroot/src/lib/libc/gen/daemon.c,v
 retrieving revision 1.9
 diff -u -r1.9 daemon.c
 --- daemon.c	7 Aug 2003 16:42:46 -0000	1.9
 +++ daemon.c	3 May 2009 10:11:17 -0000
 @@ -39,9 +39,11 @@
  #endif /* LIBC_SCCS and not lint */

  #include "namespace.h"
 +#include <errno.h>
  #include <fcntl.h>
  #include <paths.h>
  #include <stdlib.h>
 +#include <signal.h>
  #include <unistd.h>

  #ifdef __weak_alias
 @@ -52,10 +54,25 @@
  daemon(nochdir, noclose)
  	int nochdir, noclose;
  {
 +	struct sigaction osa, sa;
  	int fd;
 +	pid_t newgrp;
 +	int oerrno;
 +	int osa_ok;
 +
 +	/* A SIGHUP may be thrown when the parent exits below. */
 +	sigemptyset(&sa.sa_mask);
 +	sa.sa_handler = SIG_IGN;
 +	sa.sa_flags = 0;
 +	osa_ok = sigaction(SIGHUP, &sa, &osa);

  	switch (fork()) {
  	case -1:
 +		if (osa_ok != -1) {
 +			oerrno = errno;
 +			sigaction(SIGHUP, &osa, NULL);
 +			errno = oerrno;
 +		}
  		return (-1);
  	case 0:
  		break;
 @@ -63,8 +80,14 @@
  		_exit(0);
  	}

 -	if (setsid() == -1)
 +	newgrp = setsid();
 +	oerrno = errno;
 +	if (osa_ok != -1)
 +		sigaction(SIGHUP, &osa, NULL);
 +	if (newgrp == -1) {
 +		errno = oerrno;
  		return (-1);
 +	}

  	if (!nochdir)
  		(void)chdir("/");

 --GvXjxJ+pjyke8COw--

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/41302: cron dies at startup
Date: Mon, 18 May 2009 11:25:37 +0200

 On Sun, May 03, 2009 at 11:04:27AM +0200, Martin Husemann wrote:
 > pgrp is 2, and the cron parent process is still in this group.

 I looked a bit further and it seems that the cron signal handler runs,
 returns, and then fork returns to the daemonize() call with child pid = -1
 and errno = 0.

 Looks like a sparc64 specific bug...

 Martin

From: Martin Husemann <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/41302 CVS commit: src/sys/arch/sparc64/sparc64
Date: Thu, 21 May 2009 13:24:38 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Thu May 21 13:24:38 UTC 2009

 Modified Files:
 	src/sys/arch/sparc64/sparc64: vm_machdep.c

 Log Message:
 Deja Vu: when preparing the initial trap frame for a new forked lwp,
 explicitly clear condition code. Otherwise we might catch a signal
 (handlers are inherited from the parent) before we ever return to
 userland. The current trapframe is converted into a ucontext and after
 the signal handler returns, the lwp stays in userland and directly
 uses the ucontext to return to the fork call.
 Fixes PR 41302.


 To generate a diff of this commit:
 cvs rdiff -u -r1.87 -r1.88 src/sys/arch/sparc64/sparc64/vm_machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: bin-bug-people->martin
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Thu, 21 May 2009 13:26:17 +0000
Responsible-Changed-Why:
I broke it (again)


State-Changed-From-To: open->closed
State-Changed-By: martin@NetBSD.org
State-Changed-When: Thu, 21 May 2009 13:26:17 +0000
State-Changed-Why:
I fixed it


From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/41302 CVS commit: [netbsd-5-0] src/sys/arch/sparc64/sparc64
Date: Tue, 26 May 2009 19:18:05 +0000

 Module Name:	src
 Committed By:	snj
 Date:		Tue May 26 19:18:05 UTC 2009

 Modified Files:
 	src/sys/arch/sparc64/sparc64 [netbsd-5-0]: vm_machdep.c

 Log Message:
 Pull up following revision(s) (requested by martin in ticket #774):
 	sys/arch/sparc64/sparc64/vm_machdep.c: revision 1.88
 Deja Vu: when preparing the initial trap frame for a new forked lwp,
 explicitly clear condition code. Otherwise we might catch a signal
 (handlers are inherited from the parent) before we ever return to
 userland. The current trapframe is converted into a ucontext and after
 the signal handler returns, the lwp stays in userland and directly
 uses the ucontext to return to the fork call.
 Fixes PR 41302.


 To generate a diff of this commit:
 cvs rdiff -u -r1.84 -r1.84.6.1 src/sys/arch/sparc64/sparc64/vm_machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/41302 CVS commit: [netbsd-5] src/sys/arch/sparc64/sparc64
Date: Tue, 26 May 2009 19:19:53 +0000

 Module Name:	src
 Committed By:	snj
 Date:		Tue May 26 19:19:53 UTC 2009

 Modified Files:
 	src/sys/arch/sparc64/sparc64 [netbsd-5]: vm_machdep.c

 Log Message:
 Pull up following revision(s) (requested by martin in ticket #774):
 	sys/arch/sparc64/sparc64/vm_machdep.c: revision 1.88
 Deja Vu: when preparing the initial trap frame for a new forked lwp,
 explicitly clear condition code. Otherwise we might catch a signal
 (handlers are inherited from the parent) before we ever return to
 userland. The current trapframe is converted into a ucontext and after
 the signal handler returns, the lwp stays in userland and directly
 uses the ucontext to return to the fork call.
 Fixes PR 41302.


 To generate a diff of this commit:
 cvs rdiff -u -r1.84 -r1.84.4.1 src/sys/arch/sparc64/sparc64/vm_machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.