NetBSD Problem Report #48171
From martin@duskware.de Sun Sep 1 11:26:05 2013
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id 1D89171FE5
for <gnats-bugs@gnats.NetBSD.org>; Sun, 1 Sep 2013 11:26:05 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@gnats.NetBSD.org
Subject: -current does not boot
X-Send-Pr-Version: 3.95
>Number: 48171
>Category: port-mac68k
>Synopsis: -current does not boot
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: martin
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Sep 01 11:30:00 +0000 2013
>Closed-Date: Thu Oct 30 21:33:42 +0000 2014
>Last-Modified: Thu Oct 30 21:33:42 +0000 2014
>Originator: Martin Husemann
>Release: NetBSD 6.99.23
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: mac68k
Architecture: m68k
Machine: mac68k
>Description:
Trying to boot a -current kernel dies in:
panic(1bfee6,3c,212d00,70e09c,70e504) + c
trap(2a1e1c,8,505,1f212364) + 154
callout_halt(1f212354,0,212280,212354,3c) + 18
sleepq_block(3c,0,212280,212280,1c11fa,21318c,70e09c,70e504) + c8
kpause(1c11fa,0,3c,0) + a6
uvm_scheduler(70efb4,70e450,70efb4,204f0,15ef50) + 6c
main(1,103edc,650000,0,1000) + 7a8
within the first seconds after / has been mounted. The curlwp->l_timeout_ch
seems to be bogus, as the first argument to callout_halt() is not accessible
in ddb.
>How-To-Repeat:
Just try to boot.
>Fix:
No idea yet
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: port-mac68k-maintainer->martin
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Sun, 01 Sep 2013 18:53:04 +0000
Responsible-Changed-Why:
take
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-mac68k/48171: -current does not boot
Date: Sun, 22 Jun 2014 16:38:58 +0200
I initially believed this to be a gcc problem, but I am not sure anymore.
It still happens with gcc 4.8.3 and the crash is here:
root file system type: ffs
uvm_fault(0x227980, 0x93208000, 0x1) -> 0xe
type 8, code [mmu,,ssw]: 505
trap type 8, code = 0x505, v = 0x93208564
kernel program counter = 0x9bb90
kernel: MMU fault trap
pid = 0, lid = 1, pc = 0009BB90, ps = 2008, sfc = 1, dfc = 1
Registers:
0 1 2 3 4 5 6 7
dreg: 00000001 0000000C 04CFCC9D 93208554 0070909C 0070966C 001C6D82 00000000
areg: 00000000 002087D0 93208554 00081B54 00081B54 00081C94 0029BEB0 FFFFCFFC
[..]
db> bt
cpu_Debugger(?)
db_panic(8,2000,0,1096d6,29bd4c) + 6
vpanic(1c5b56,29bd58,29bde4,132606,1c5b56) + 122
panic(1c5b56,4cfcc9d,93208554,70909c,70966c) + c
trap(29bdfc,8,505,93208564) + 6e4
callout_halt(93208554,0,208480,208554,3c) + 18
sleepq_block(3c,0,208480,208480,1c6d82,209364,70909c,70966c) + b2
kpause(1c6d82,0,3c,0) + cc
uvm_scheduler(709fb4,7095ac,709fb4,0,53a6ed05) + 6c
main(1,1049a6,64a000,0,1000) + 79c
The issue comes from inside sleepq_block. We are running lwp0 here, and
go into this code:
0x90c8c <sleepq_block+168>: jsr 0x93650 <mi_switch>
0x90c92 <sleepq_block+174>: clrl %sp@-
0x90c94 <sleepq_block+176>: movel %d3,%sp@-
0x90c96 <sleepq_block+178>: jsr 0x9bb78 <callout_halt>
and the C code is:
261 if (timo) {
262 callout_schedule(&l->l_timeout_ch, timo);
263 }
264 mi_switch(l);
265
266 /* The LWP and sleep queue are now unlocked. */
267 if (timo) {
268 /*
269 * Even if the callout appears to have fired, we need to
270 * stop it in order to synchronise with other CPUs.
271 */
272 if (callout_halt(&l->l_timeout_ch, NULL))
273 error = EWOULDBLOCK;
274 }
Before calling mi_switch(l), gcc has placed &l->l_timeout_ch in register
%d3 and invoked callout_schedule with that argument. At this time, the
address was fine.
Now after mi_switch, %d3 is different but gcc does not reload it and uses
the altered value for the callout_halt() call - and we crash there.
As %d3 is callee-saved (and mi_switch code has all the proper push/pop),
this must mean that something overwrites lwp0's stack. Other m68k machines
are reported to work (and this all started before the gcc transition),
so the "something" overwriting the stack *might* be mac68k specific.
Martin
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-mac68k/48171: -current does not boot
Date: Wed, 25 Jun 2014 14:27:58 +0200
Probably the same as #48293
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-mac68k/48171: -current does not boot
Date: Wed, 25 Jun 2014 23:47:38 +0200
Assuming this would be a stack smashing, I enabled makeoptions USE_SSP=yes
for the kernel. This makes it work.
Same for otpions DIAGNOSTIC.
Ideas welcome, I'm a bit puzzled.
Martin
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-mac68k/48171: -current does not boot
Date: Thu, 26 Jun 2014 06:12:29 +0000
On Wed, Jun 25, 2014 at 09:50:01PM +0000, Martin Husemann wrote:
> Assuming this would be a stack smashing, I enabled makeoptions USE_SSP=yes
> for the kernel. This makes it work.
>
> Same for otpions DIAGNOSTIC.
>
> Ideas welcome, I'm a bit puzzled.
...try turning those on (or off) only for the module that contains
mi_switch (kern_synch.c). If there's a compiler problem, that's
probably where it's happening.
--
David A. Holland
dholland@netbsd.org
State-Changed-From-To: open->closed
State-Changed-By: martin@NetBSD.org
State-Changed-When: Thu, 30 Oct 2014 21:33:42 +0000
State-Changed-Why:
This was fixed by the latest gcc 4.8 import
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.