NetBSD Problem Report #44500

From chuck@xxx.pdl.cmu.edu  Mon Jan 31 20:38:09 2011
Return-Path: <chuck@xxx.pdl.cmu.edu>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 8EF6763B873
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 31 Jan 2011 20:38:09 +0000 (UTC)
Message-Id: <20110131203754.19AA116134@xxx.pdl.cmu.edu>
Date: Mon, 31 Jan 2011 15:37:54 -0500 (EST)
From: chuck@ece.cmu.edu
Reply-To: chuck@ece.cmu.edu
To: gnats-bugs@gnats.NetBSD.org
Subject: 4.0 sa threaded apps hard hang netbsd-5 and HEAD kernels on some ports [cpu_setfunc() related]
X-Send-Pr-Version: 3.95

>Number:         44500
>Category:       kern
>Synopsis:       4.0 sa threaded apps hard hang netbsd-5 and HEAD kernels on some ports [cpu_setfunc() related]
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    martin
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Jan 31 20:40:00 +0000 2011
>Closed-Date:    Sun Oct 06 10:52:54 +0000 2013
>Last-Modified:  Sun Oct 06 10:52:54 +0000 2013
>Originator:     Chuck Cranor
>Release:        netbsd-5 branch, June 10th 2009 and later, also HEAD if SA compat support is enabled (kern.no_sa_support=0)
>Organization:
Carnegie Mellon University
>Environment:
	NetBSD/alpha
	netbsd-5: 
	NetBSD 5.0_STABLE (GENERIC-$Revision: 1.325 $) #8: Mon Jan 31 13:24:52 EST 2011
	chuck@xxx.pdl.cmu.edu:/.amd/flow/home/chuck/src/netbsd/cur/src/sys/arch/alpha/compile/GENERIC


	HEAD: 
	NetBSD 5.99.44 (GENERIC-$Revision: 1.338 $) #1: Sun Jan 30 13:20:09 EST 2011
	chuck@xxxcdc:/Users/chuck/local/cvsco/netbsd/cur/src/sys/arch/alpha/compile/GENERIC

Architecture: alpha
Machine: alpha
>Description:

The problem was introduced to the netbsd-5 branch via netbsd-5 ticket
number 798: 
    http://releng.netbsd.org/cgi-bin/req-5.cgi?show=798

So a NetBSD 5.0 kernel is OK, but a NetBSD 5.0.2 is not.

    The changes to vm_machdep.c appear to have removed the
call to cpu_setfunc() from cpu_lwp_fork() and replaced it
with the actual content of the old cpu_setfunc() function.
The net result here is that the behavior of cpu_lwp_fork() 
does not change, but it no longer calls cpu_setfunc().

    The old cpu_setfunc() is now replace with a new stripped 
down version that calls setfunc_trampoline() instead of 
lwp_trampoline()  [the s3 register is no longer setup or used]
The only thing that calls the cpu_setfunc() is now compat_sa.c
( cpu_lwp_fork() no longer calls it ).

    The main difference between the lwp_trampoline() and the new
setfunc_trampoline() is that the setfunc_trampoline() no longer
calls lwp_startup().   Removin the call to lwp_startup() causes
the alpha to hang hard if you run a 4.0 threaded app like "dig"...

    So, lwp_startup() does something that keeps the system from
hanging.   To figure out what that was, I started adding in bits
of lpw_startup() into the setfunc_trampoline() until the system
stopped hanging.   It turns out the two critical bits are:

void
xlwp_startup(struct lwp *prev, struct lwp *new)
{
        if (prev != NULL) {
                curcpu()->ci_mtx_count++;  /*YES*/
                prev->l_ctxswtch = 0;      /*YES*/
        }
}

    Put that much of lwp_startup() back into setfunc_trampoline(), and 
the system no longer hangs when you run "dig"... a complete diff
that applies to a netbsd-5 branch checked out on date 10-Jun-2009
(e.g. with "cvs -q update -r netbsd-5 -dP -D 10-Jun-2009") is included
at the end.

    You need both the l_ctxswtch and ci_mtx_count statements.
If you comment out the "l_ctxswtch" statement, the system hangs
as soon as you run "dig".    If you comment out the ci_mtx_count
statement, the system runs "dig" (it prints an error message to
console) but then hangs when "dig" exits.   Couldn't get DDB in
either case.


    The hard hang occurs in mi_switch() ... the kernel gets stuck
in an endless loop here (i added the debugging line):

                /*
                 * We may need to spin-wait for if 'newl' is still
                 * context switching on another CPU.
                 */

               if (newl->l_ctxswtch != 0) {
                        u_int count;
                        count = SPINLOCK_BACKOFF_MIN;
                        while (newl->l_ctxswtch) {
                                SPINLOCK_BACKOFF(count);
printf("POINTA\n");  /*XXXCDC*/
                        }
                }

it just prints "POINTA" endlessly.   Note my system only has one
CPU (so the case the comment is looking for does not apply).  Because
interrupts are disabled, it is not possible to break to DDB if you
are stuck in that while() loop, your system is hung


Looking at HEAD, the current state of the tree is not uniform:

arch	cpu_setfunc calls       does it call lpw_startup?  when changed?
------- ----------------------  ----------------------------------------
acorn26 lwp_trampoline		yes 
alpha	setfunc_trampoline	no (vm_machdep.1.100, 2009/06/01)
arm32	lwp_trampoline		yes
hppa	setfunc_trampoline	no (vm_machdep.c 1.36, 2009/06/03)
m68k	setfunc_trampoline	no (vm_machdep.c 1.28, 2009/05/30)
mips	setfunc_trampoline	no (vm_machdep.c 1.123, 2009/05/30)
powerpc	setfunc_trampoline	no (vm_machdep.c 1.77, 2009/06/07)
sh3	lwp_setfunc_trampoline	no (never called lpw_startup?)
sparc	lwp_setfunc_trampoline	no (vm_machdep.c 1.100, 2009/05/29)
sparc64	lwp_setfunc_trampoline	no (vm_machep.c 1.89, 2009/05/30)
x86	lwp_trampoline		yes

the "no" ports are likely to have problems with compat_sa binaries,
I think.


>How-To-Repeat:

	Find a NetBSD 4.0 binary that uses SA threads.  A statically
	linked version of /usr/bin/dig will do... here is an alpha one:

	http://yogi.pdl.cmu.edu/~chuck/tmp/dig.static.gz

	boot system single user, enable SA compat code (if HEAD), run binary.

	>>> boot -file testin -fl s
        ...
        # mount -r /usr
        # sysctl -w kern.no_sa_support=0
        # /root/dig.static
        << system hangs, power cycle required to recover >>


>Fix:


This is just a work around, not a fix:

Index: arch/alpha/alpha/locore.s
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/locore.s,v
retrieving revision 1.113.10.1
diff -u -r1.113.10.1 locore.s
--- arch/alpha/alpha/locore.s	9 Jun 2009 17:38:38 -0000	1.113.10.1
+++ arch/alpha/alpha/locore.s	30 Jan 2011 03:47:33 -0000
@@ -752,6 +752,9 @@
  * Simplified version of above: don't call lwp_startup()
  */
 LEAF_NOPROFILE(setfunc_trampoline, 0)
+	mov	v0, a0   /* NEW */
+	mov	s3, a1   /* NEW */
+	CALL(xlwp_startup)   /* NEW */
 	mov	s0, pv
 	mov	s1, ra
 	mov	s2, a0
Index: arch/alpha/alpha/vm_machdep.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/vm_machdep.c,v
retrieving revision 1.96.30.1
diff -u -r1.96.30.1 vm_machdep.c
--- arch/alpha/alpha/vm_machdep.c	9 Jun 2009 17:38:39 -0000	1.96.30.1
+++ arch/alpha/alpha/vm_machdep.c	30 Jan 2011 03:47:33 -0000
@@ -228,6 +228,8 @@
 	    (u_int64_t)exception_return;	/* s1: ra */
 	up->u_pcb.pcb_context[2] =
 	    (u_int64_t)arg;			/* s2: arg */
+	up->u_pcb.pcb_context[3] =
+	    (u_int64_t)l;			/* s3: lwp */
 	up->u_pcb.pcb_context[7] =
 	    (u_int64_t)setfunc_trampoline;	/* ra: assembly magic */
 }	
Index: kern/kern_lwp.c
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_lwp.c,v
retrieving revision 1.126.2.2
diff -u -r1.126.2.2 kern_lwp.c
--- kern/kern_lwp.c	8 Mar 2009 03:15:36 -0000	1.126.2.2
+++ kern/kern_lwp.c	30 Jan 2011 03:48:08 -0000
@@ -706,6 +706,22 @@
 	}
 }

+
+/*
+ * Called by MD code when a new LWP begins execution.  Must be called
+ * with the previous LWP locked (so at splsched), or if there is no
+ * previous LWP, at splsched.
+ */
+void xlwp_startup(struct lwp *prev, struct lwp *new);
+void
+xlwp_startup(struct lwp *prev, struct lwp *new)
+{
+	if (prev != NULL) {
+		curcpu()->ci_mtx_count++;  /*YES*/
+		prev->l_ctxswtch = 0;      /*YES*/
+	}
+}
+
 /*
  * Exit an LWP.
  */

>Release-Note:

>Audit-Trail:
From: "Valeriy E. Ushakov" <uwe@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/44500 CVS commit: src/sys/arch/sh3/sh3
Date: Tue, 1 Feb 2011 01:54:15 +0000

 Module Name:	src
 Committed By:	uwe
 Date:		Tue Feb  1 01:54:14 UTC 2011

 Modified Files:
 	src/sys/arch/sh3/sh3: locore_subr.S vm_machdep.c

 Log Message:
 cpu_setfunc() must use lwp_trampoline to arrange for the recycled lwp
 to go through lwp_startup() the first time it's switched to.

 This makes NetBSD-4.x SA binaries work on current.  Tested with dig(1)
 in NetBSD-4.x chroot on landisk.

 Looks like this mistake of mine was picked up and replicated in
 several other ports, sorry.

 Reported by chuck@ in PR kern/44500


 To generate a diff of this commit:
 cvs rdiff -u -r1.53 -r1.54 src/sys/arch/sh3/sh3/locore_subr.S
 cvs rdiff -u -r1.69 -r1.70 src/sys/arch/sh3/sh3/vm_machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: kern-bug-people->closed
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Tue, 01 Feb 2011 11:53:16 +0000
Responsible-Changed-Why:
I'll take care to cleanup and request pullups


Responsible-Changed-From-To: closed->martin
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Tue, 01 Feb 2011 11:54:14 +0000
Responsible-Changed-Why:
hi, my name is martin


From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/44500: 4.0 sa threaded apps hard hang netbsd-5 and HEAD
 kernels on some ports [cpu_setfunc() related]
Date: Tue, 1 Feb 2011 21:32:05 +0100

 On Mon, Jan 31, 2011 at 08:40:01PM +0000, chuck@ece.cmu.edu wrote:
 > [...]
 > Looking at HEAD, the current state of the tree is not uniform:
 > 
 > arch	cpu_setfunc calls       does it call lpw_startup?  when changed?
 > ------- ----------------------  ----------------------------------------
 > acorn26 lwp_trampoline		yes 
 > alpha	setfunc_trampoline	no (vm_machdep.1.100, 2009/06/01)
 > arm32	lwp_trampoline		yes
 > hppa	setfunc_trampoline	no (vm_machdep.c 1.36, 2009/06/03)
 > m68k	setfunc_trampoline	no (vm_machdep.c 1.28, 2009/05/30)
 > mips	setfunc_trampoline	no (vm_machdep.c 1.123, 2009/05/30)
 > powerpc	setfunc_trampoline	no (vm_machdep.c 1.77, 2009/06/07)
 > sh3	lwp_setfunc_trampoline	no (never called lpw_startup?)
 > sparc	lwp_setfunc_trampoline	no (vm_machdep.c 1.100, 2009/05/29)
 > sparc64	lwp_setfunc_trampoline	no (vm_machep.c 1.89, 2009/05/30)
 > x86	lwp_trampoline		yes
 > 
 > the "no" ports are likely to have problems with compat_sa binaries,
 > I think.

 I think I've seen this on a sparc64 kernel, building 4.0 sparc binary
 packages. I have to switch between 5.0 and 5.1 kernels depending on
 which package is built.
 BTW, sparc64 can still enter ddb ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: "Matt Thomas" <matt@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/44500 CVS commit: src/sys/arch/powerpc/powerpc
Date: Wed, 2 Feb 2011 09:02:40 +0000

 Module Name:	src
 Committed By:	matt
 Date:		Wed Feb  2 09:02:40 UTC 2011

 Modified Files:
 	src/sys/arch/powerpc/powerpc: vm_machdep.c

 Log Message:
 Always call cpu_lwp_bootstrap even in cpu_setfunc.  Let cpu_lwp_fork use
 cpu_setfunc instead of duplicating code.  Simplify stack setup.
 PR 44500


 To generate a diff of this commit:
 cvs rdiff -u -r1.81 -r1.82 src/sys/arch/powerpc/powerpc/vm_machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Nick Hudson" <skrll@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/44500 CVS commit: src/sys/arch
Date: Mon, 7 Feb 2011 12:19:36 +0000

 Module Name:	src
 Committed By:	skrll
 Date:		Mon Feb  7 12:19:35 UTC 2011

 Modified Files:
 	src/sys/arch/hp700/hp700: locore.S
 	src/sys/arch/hp700/include: cpu.h
 	src/sys/arch/hppa/hppa: vm_machdep.c

 Log Message:
 Fix PR/44500 for hppa.


 To generate a diff of this commit:
 cvs rdiff -u -r1.54 -r1.55 src/sys/arch/hp700/hp700/locore.S
 cvs rdiff -u -r1.65 -r1.66 src/sys/arch/hp700/include/cpu.h
 cvs rdiff -u -r1.46 -r1.47 src/sys/arch/hppa/hppa/vm_machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 06 Oct 2013 10:52:54 +0000
State-Changed-Why:
I think at this point where/if this issue isn't fixed it's a dead article
regardless.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.