NetBSD Problem Report #37611

From martin@duskware.de  Tue Dec 25 12:41:06 2007
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 2B6FC63B852
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 25 Dec 2007 12:41:06 +0000 (UTC)
Message-Id: <20071225124103.B1C2533AA2@mail.duskware.de>
Date: Tue, 25 Dec 2007 13:41:03 +0100 (CET)
From: martin@duskware.de
Reply-To: martin@duskware.de
To: gnats-bugs@NetBSD.org
Subject: macs with cuda adb don't boot -current
X-Send-Pr-Version: 3.95

>Number:         37611
>Category:       port-mac68k
>Synopsis:       macs with cuda adb don't boot -current
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    nat
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Dec 25 12:45:01 +0000 2007
>Closed-Date:    Mon Nov 20 10:36:55 +0000 2023
>Last-Modified:  Mon Nov 20 10:36:55 +0000 2023
>Originator:     martin@duskware.de
>Release:        NetBSD 4.99.45
>Organization:
>Environment:
System: NetBSD mac-beth.duskware.de 4.99.45 NetBSD 4.99.45 (MAC-BETH) #14: Tue Dec 25 13:12:03 CET 2007 martin@night-porter.duskware.de:/usr/src/sys/arch/mac68k/compile/MAC-BETH mac68k
Architecture: m68k
Machine: mac68k
>Description:

Trying to boot -current hangs in adb_read_date_time().
I suspect it hangs in this loop:

        case ADB_HW_IISI:
                output[0] = 0x02;       /* 2 byte message */
                output[1] = 0x01;       /* to pram/rtc device */
                output[2] = 0x03;       /* read date/time */
                result = send_adb_IIsi((u_char *)output, (u_char *)output,
                    (void *)adb_op_comprout, __UNVOLATILE(&flag), (int)0);
                if (result != 0)        /* exit if not sent */
                        return -1;

--->>>          while (0 == flag)       /* wait for result */
                        ;

                *curtime = (long)(*(long *)(output + 1));

but either ddb is lying to me or I'm overlooking something, as it says the
hang is a bit earlier (which does not make sense).

>How-To-Repeat:
Just boot -current on a machine with cuda?

>Fix:
The patch below avoids the problem, but there is more to cleanup.

Index: adb_direct.c
===================================================================
RCS file: /cvsroot/src/sys/arch/mac68k/dev/adb_direct.c,v
retrieving revision 1.60
diff -c -u -p -r1.60 adb_direct.c
cvs diff: conflicting specifications of output style
--- adb_direct.c        3 Dec 2007 15:33:52 -0000       1.60
+++ adb_direct.c        25 Dec 2007 12:33:22 -0000
@@ -2720,6 +2720,8 @@ adb_read_date_time(unsigned long *curtim
        int result;
        volatile int flag = 0;

+       return -1;
+
        switch (adbHardware) {
        case ADB_HW_II:
                return -1;

First using "volatile int flag" and then passing __UNVOLATILE(&flag) around
looks like a poor man's condvar to me. Second the code plays tricks behind
the compilers back:

void 
adb_op_comprout(void)
{
        __asm("movw     #1,%a2@                 | update flag value");
}

(this is the code that sets flag to 1)

>Release-Note:

>Audit-Trail:
From: Scott Reynolds <scottr@NetBSD.org>
To: gnats-bugs@NetBSD.org,
 Martin Husemann <martin@duskware.de>
Cc: 
Subject: Re: port-mac68k/37611: macs with cuda adb don't boot -current
Date: Sun, 30 Mar 2008 21:18:44 -0500

 I've looked at this several times with a varying degree of  
 thoroughness, but haven't come with anything conclusive.

 While there are certainly similarities to a condvar or a semaphore,  
 this is a somewhat unique situation in that the ADB completion code is  
 shared by both the MRG (ROM) and direct drivers. Further, the flag is  
 set in an interrupt context, either directly or indirectly, so certain  
 accommodations must be made.

 Simply put, it shouldn't be an issue to pass the address with  
 __UNVOLATILE since it seems to only be dereferenced in compiled code  
 within the function where it's declared volatile.

 This doesn't answer the question as to why it's broken; there could  
 certainly be something that I'm missing. I am interested in knowing  
 which machine you're having trouble with, though.

 --scott

From: Martin Husemann <martin@duskware.de>
To: Scott Reynolds <scottr@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-mac68k/37611: macs with cuda adb don't boot -current
Date: Mon, 31 Mar 2008 17:29:18 +0200

 On Sun, Mar 30, 2008 at 09:18:44PM -0500, Scott Reynolds wrote:
 > This doesn't answer the question as to why it's broken; there could  
 > certainly be something that I'm missing. I am interested in knowing  
 > which machine you're having trouble with, though.

 I have no idea how ROM interoperates here, so I am probably missing big
 parts of the picture.

 Could you explain how adb_op_comprout() is supposed to work? AFAICT it changes
 a random word on the stack, probably something where previous versions of the
 compiler used to store something important. I suppose it should modify 
 that "something" via a global pointer instead.

 Martin

From: Scott Reynolds <scottr@NetBSD.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-mac68k/37611: macs with cuda adb don't boot -current
Date: Tue, 1 Apr 2008 00:44:22 -0500

 On Mar 31, 2008, at 10:29 AM, Martin Husemann wrote:

 > On Sun, Mar 30, 2008 at 09:18:44PM -0500, Scott Reynolds wrote:
 >> This doesn't answer the question as to why it's broken; there could
 >> certainly be something that I'm missing. I am interested in knowing
 >> which machine you're having trouble with, though.
 >
 > I have no idea how ROM interoperates here, so I am probably missing  
 > big
 > parts of the picture.
 >
 > Could you explain how adb_op_comprout() is supposed to work? AFAICT  
 > it changes
 > a random word on the stack, probably something where previous  
 > versions of the
 > compiler used to store something important. I suppose it should modify
 > that "something" via a global pointer instead.

 The completion routine is called at interrupt time with the a2  
 register set to the completion data buffer. This data buffer is  
 registered on a per-call basis to the various "send" functions. You  
 will find that most of the time the completion data buffer is actually  
 associated with the specific ADB command sent (it's a part of struct  
 adbCommand); this allows multiple asynchronous commands to be sent,  
 e.g. one to the keyboard and one to the mouse, which can respond with  
 completion independently of each other.

 There is another case in adb_op_sync() similar to the one you see in  
 adb_read_date_time(). The adb_op_sync() version times out, though. I  
 suspect that's what is needed here, especially if the TOD clock is  
 malfunctioning somehow, or doesn't exist.

From: Martin Husemann <martin@duskware.de>
To: Scott Reynolds <scottr@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-mac68k/37611: macs with cuda adb don't boot -current
Date: Tue, 1 Apr 2008 11:59:04 +0200

 On Tue, Apr 01, 2008 at 12:44:22AM -0500, Scott Reynolds wrote:
 > The completion routine is called at interrupt time with the a2  
 > register set to the completion data buffer.

 I see the call now. I'll instrument a bit and see where the compData
 pointer gets lost.

 Is there any good reason why comprout does not use standard C calling
 conventions?

 Martin

From: Martin Husemann <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/37611 CVS commit: src/sys/arch/mac68k/dev
Date: Tue,  1 Apr 2008 12:02:52 +0000 (UTC)

 Module Name:	src
 Committed By:	martin
 Date:		Tue Apr  1 12:02:52 UTC 2008

 Modified Files:
 	src/sys/arch/mac68k/dev: adb_direct.c

 Log Message:
 Add timeout to busy loops waiting for ADB command completition.
 Turns the hard hang in PR port-mac68k/37611 into a warning.


 To generate a diff of this commit:
 cvs rdiff -r1.60 -r1.61 src/sys/arch/mac68k/dev/adb_direct.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Martin Husemann <martin@duskware.de>
To: Scott Reynolds <scottr@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-mac68k/37611: macs with cuda adb don't boot -current
Date: Tue, 1 Apr 2008 14:13:09 +0200

 On Tue, Apr 01, 2008 at 12:44:22AM -0500, Scott Reynolds wrote:
 > The completion routine is called at interrupt time with the a2  
 > register set to the completion data buffer.

 See below for why this does not currently work. Minor nit: shouldn't
 it do a

 	movl #1,%a2@

 instead? We pass an int pointer, and after the command is done the flag
 value is 0x00010000 ;-)

 > There is another case in adb_op_sync() similar to the one you see in  
 > adb_read_date_time(). The adb_op_sync() version times out, though.

 Ok, I copied over the timeout code, and now it only warns loud if something
 goes wrong, but does not hang the whole kernel any more. Way more conveninent
 while debugging the root cause.

 I traced where this all fails: we end up in adb_pass_up(), comprout and 
 compdata are still correct. We then defer to the softint handler - but since
 the caller is busy waiting, we have no chance to ever run that handler before
 we timeout. Setting adb_polling=1 during this call makes it work early in
 boot, but unfortunately later hangs the whole kernel when running ntpdate
 (can't even break into ddb) - so I guess that does something very bad to the
 adb bus (which I otherwise do not use, I have serial console).

 Maybe adb commands should have a per command flag that indicates we are 
 polling for the result, in addition to the global flag? Or I just did
 something wrong when playing with the global flag.


 Martin

From: Scott Reynolds <scottr@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/37611 CVS commit: src/sys/arch/mac68k/dev
Date: Thu,  3 Apr 2008 05:03:23 +0000 (UTC)

 Module Name:	src
 Committed By:	scottr
 Date:		Thu Apr  3 05:03:23 UTC 2008

 Modified Files:
 	src/sys/arch/mac68k/dev: adb.c adb_direct.c adbvar.h

 Log Message:
 Factor out ADB spin-wait timeout loop, and use it for synchronous
 operations to access the PRAM RTC, etc. in the IIsi and Cuda cases.
 Thanks to Martin Husemann for pointing out the flaw.

 This is a slightly more thorough workaround for the issue Martin found
 in PR 37611, as it affected more than just adb_read_date_time().


 To generate a diff of this commit:
 cvs rdiff -r1.51 -r1.52 src/sys/arch/mac68k/dev/adb.c
 cvs rdiff -r1.61 -r1.62 src/sys/arch/mac68k/dev/adb_direct.c
 cvs rdiff -r1.24 -r1.25 src/sys/arch/mac68k/dev/adbvar.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: port-mac68k-maintainer->nat
Responsible-Changed-By: nat@NetBSD.org
Responsible-Changed-When: Sun, 12 Nov 2023 09:00:44 +0000
Responsible-Changed-Why:
Take.


State-Changed-From-To: open->feedback
State-Changed-By: nat@NetBSD.org
State-Changed-When: Sun, 12 Nov 2023 09:00:44 +0000
State-Changed-Why:
Martin is this ok to close or is it still an issue?


State-Changed-From-To: feedback->closed
State-Changed-By: martin@NetBSD.org
State-Changed-When: Mon, 20 Nov 2023 10:36:55 +0000
State-Changed-Why:
The commit fixed it, thanks!
(do I now get a price for longes bug in feedback state?)


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: gnats-precook-prs,v 1.4 2018/12/21 14:20:20 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.