NetBSD Problem Report #48171

From martin@duskware.de  Sun Sep  1 11:26:05 2013
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 1D89171FE5
	for <gnats-bugs@gnats.NetBSD.org>; Sun,  1 Sep 2013 11:26:05 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@gnats.NetBSD.org
Subject: -current does not boot
X-Send-Pr-Version: 3.95

>Number:         48171
>Category:       port-mac68k
>Synopsis:       -current does not boot
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    martin
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Sep 01 11:30:00 +0000 2013
>Closed-Date:    Thu Oct 30 21:33:42 +0000 2014
>Last-Modified:  Thu Oct 30 21:33:42 +0000 2014
>Originator:     Martin Husemann
>Release:        NetBSD 6.99.23
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: mac68k
Architecture: m68k
Machine: mac68k
>Description:

Trying to boot a -current kernel dies in:

panic(1bfee6,3c,212d00,70e09c,70e504) + c       
trap(2a1e1c,8,505,1f212364) + 154        
callout_halt(1f212354,0,212280,212354,3c) + 18
sleepq_block(3c,0,212280,212280,1c11fa,21318c,70e09c,70e504) + c8
kpause(1c11fa,0,3c,0) + a6                                       
uvm_scheduler(70efb4,70e450,70efb4,204f0,15ef50) + 6c
main(1,103edc,650000,0,1000) + 7a8                   

within the first seconds after / has been mounted. The curlwp->l_timeout_ch
seems to be bogus, as the first argument to callout_halt() is not accessible
in ddb.

>How-To-Repeat:
Just try to boot.

>Fix:
No idea yet

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: port-mac68k-maintainer->martin
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Sun, 01 Sep 2013 18:53:04 +0000
Responsible-Changed-Why:
take


From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-mac68k/48171: -current does not boot
Date: Sun, 22 Jun 2014 16:38:58 +0200

 I initially believed this to be a gcc problem, but I am not sure anymore.
 It still happens with gcc 4.8.3 and the crash is here:

 root file system type: ffs
 uvm_fault(0x227980, 0x93208000, 0x1) -> 0xe
   type 8, code [mmu,,ssw]: 505             
 trap type 8, code = 0x505, v = 0x93208564
 kernel program counter = 0x9bb90         
 kernel: MMU fault trap          
 pid = 0, lid = 1, pc = 0009BB90, ps = 2008, sfc = 1, dfc = 1
 Registers:                                                  
              0        1        2        3        4        5        6        7
 dreg: 00000001 0000000C 04CFCC9D 93208554 0070909C 0070966C 001C6D82 00000000
 areg: 00000000 002087D0 93208554 00081B54 00081B54 00081C94 0029BEB0 FFFFCFFC
 [..]
 db> bt                                                                    
 cpu_Debugger(?)
 db_panic(8,2000,0,1096d6,29bd4c) + 6
 vpanic(1c5b56,29bd58,29bde4,132606,1c5b56) + 122
 panic(1c5b56,4cfcc9d,93208554,70909c,70966c) + c
 trap(29bdfc,8,505,93208564) + 6e4               
 callout_halt(93208554,0,208480,208554,3c) + 18
 sleepq_block(3c,0,208480,208480,1c6d82,209364,70909c,70966c) + b2
 kpause(1c6d82,0,3c,0) + cc                                       
 uvm_scheduler(709fb4,7095ac,709fb4,0,53a6ed05) + 6c
 main(1,1049a6,64a000,0,1000) + 79c                 

 The issue comes from inside sleepq_block. We are running lwp0 here, and
 go into this code:

    0x90c8c <sleepq_block+168>:  jsr 0x93650 <mi_switch>
    0x90c92 <sleepq_block+174>:  clrl %sp@-
    0x90c94 <sleepq_block+176>:  movel %d3,%sp@-
    0x90c96 <sleepq_block+178>:  jsr 0x9bb78 <callout_halt>

 and the C code is:

 261                     if (timo) {
 262                             callout_schedule(&l->l_timeout_ch, timo);
 263                     }
 264                     mi_switch(l);
 265     
 266                     /* The LWP and sleep queue are now unlocked. */
 267                     if (timo) {
 268                             /*
 269                              * Even if the callout appears to have fired, we need to
 270                              * stop it in order to synchronise with other CPUs.
 271                              */
 272                             if (callout_halt(&l->l_timeout_ch, NULL))
 273                                     error = EWOULDBLOCK;
 274                     }

 Before calling mi_switch(l), gcc has placed &l->l_timeout_ch in register
 %d3 and invoked callout_schedule with that argument. At this time, the
 address was fine.

 Now after mi_switch, %d3 is different but gcc does not reload it and uses
 the altered value for the callout_halt() call - and we crash there.

 As %d3 is callee-saved (and mi_switch code has all the proper push/pop),
 this must mean that something overwrites lwp0's stack. Other m68k machines
 are reported to work (and this all started before the gcc transition),
 so the "something" overwriting the stack *might* be mac68k specific.

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-mac68k/48171: -current does not boot
Date: Wed, 25 Jun 2014 14:27:58 +0200

 Probably the same as #48293

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-mac68k/48171: -current does not boot
Date: Wed, 25 Jun 2014 23:47:38 +0200

 Assuming this would be a stack smashing, I enabled makeoptions USE_SSP=yes
 for the kernel. This makes it work.

 Same for otpions DIAGNOSTIC.

 Ideas welcome, I'm a bit puzzled.

 Martin

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-mac68k/48171: -current does not boot
Date: Thu, 26 Jun 2014 06:12:29 +0000

 On Wed, Jun 25, 2014 at 09:50:01PM +0000, Martin Husemann wrote:
  >  Assuming this would be a stack smashing, I enabled makeoptions USE_SSP=yes
  >  for the kernel. This makes it work.
  >  
  >  Same for otpions DIAGNOSTIC.
  >  
  >  Ideas welcome, I'm a bit puzzled.

 ...try turning those on (or off) only for the module that contains
 mi_switch (kern_synch.c). If there's a compiler problem, that's
 probably where it's happening.

 -- 
 David A. Holland
 dholland@netbsd.org

State-Changed-From-To: open->closed
State-Changed-By: martin@NetBSD.org
State-Changed-When: Thu, 30 Oct 2014 21:33:42 +0000
State-Changed-Why:
This was fixed by the latest gcc 4.8 import


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.