NetBSD Problem Report #16154
Received: (qmail 23485 invoked from network); 1 Apr 2002 13:29:27 -0000
Message-Id: <20020401132941.226631111A@www.netbsd.org>
Date: Mon, 1 Apr 2002 05:29:41 -0800 (PST)
From: manu@netbsd.org
Sender: nobody@netbsd.org
Reply-To: manu@netbsd.org
To: gnats-bugs@gnats.netbsd.org
Subject: any user can hang the machine by masking SIGSEGV and faulting
X-Send-Pr-Version: www-1.0
>Number: 16154
>Category: port-mips
>Synopsis: any user can hang the machine by masking SIGSEGV and faulting
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: port-mips-maintainer
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Apr 01 13:30:01 +0000 2002
>Closed-Date: Sat Jun 19 10:43:01 +0000 2004
>Last-Modified: Sat Jun 19 10:43:01 +0000 2004
>Originator: Emmanuel Dreyfus
>Release: NetBSD-current
>Organization:
The NetBSD Project
>Environment:
NetBSD plume 1.5ZC NetBSD 1.5ZC (IRIX3) #69: Mon Apr 1 13:28:17 CEST 2002 manu@plume:/cvs/src/sys/arch/sgimips/compile/IRIX3 sgimips
>Description:
When a program masks SIGSEGV and does a page fault, NetBSD/mips hangs.
It is possible to drop into ddb and send a kill -9 to the offending
process, this will restore the machine to a fully functionnal state.
>How-To-Repeat:
#include <stdio.h>
#include <signal.h>
int main (void) {
char *p = (char *)0xc;
signal(SIGSEGV, SIG_IGN);
printf("let's go\n");
*p = *p + 1;
printf("still alive?\n");
return 0;
}
>Fix:
I have not yet fully spotted the problem. However, here are the
information I gathered:
The normal behavior would be to loop on the fault: on *p access, we
get a page fault. There is no valid mapping at the requested address,
hence we attempt to send a SIGSEGV. It is blocked, so we return to
userland and restart the offending instruction. We fault again and we
loop here forever.
This should not hang the machine since on return to userland, we can
schedule another process to run. The problem on mips ports is that
the offending process is *always* re-scheduled to run.
More information: on the page fault, mips3_UserGenException is invoked,
from here, we have approximately this code path (determined by bloating
my kernel with printf's):
mips3_UserGenException
trap
uvm_fault
uvmfault_lookup
trapsignal
psignal1
ast
preempt
mi_switch
...
mi_switch always selects the offending process to run again, thus hanging
the machine.
>Release-Note:
>Audit-Trail:
From: manu@netbsd.org (Emmanuel Dreyfus)
To: gnats-bugs@netbsd.org
Cc: uch@vnop.net (UCHIYAMA Yasushi),
nathanw@wasabisystems.com (Nathan J. Williams)
Subject: port-mips/16154
Date: Wed, 3 Apr 2002 07:58:54 +0200
More info about the problem:
- it occurs on a R5000, kernel has options MIPS3 and options
MIPS3_L2CACHE_ABSENT
- the machine only hang when the offending process is the only runnable
process. If there is a
while(1);
running in the background, then launching the offending process will not
hang the machine. If the while(1); is suspended or killed, we get an
immediate hang.
- when the machine is hung, schedcpu() is not called every second
anymore. As soon as the offending process is killed from ddb, schedcpu()
is working again. in ddb we can see that schedcpu is still in the
callout structs (show all callout), but for an unknown reason, it does
not work (interrupts disabled?)
--
Emmanuel Dreyfus.
Sryvpvgngvbaf!
Ibhf irarm qr creqer ibger grzcf n qrpbqre har fvtangher fnaf vagrerg.
manu@netbsd.org
From: UCHIYAMA Yasushi <uch@vnop.net>
To: manu@netbsd.org
Cc: gnats-bugs@netbsd.org, nathanw@wasabisystems.com
Subject: Re: port-mips/16154
Date: Wed, 03 Apr 2002 18:04:53 +0900 (JST)
Does it reproduce on the old kernel
e.g. ftp.netbsd.org/pub/NetBSD/arch/sgimips/netbsd.ip22 (1.5W)?
---
UCHIYAMA Yasushi
uch@vnop.net
From: manu@netbsd.org (Emmanuel Dreyfus)
To: uch@vnop.net (UCHIYAMA Yasushi)
Cc: gnats-bugs@netbsd.org, nathanw@wasabisystems.com
Subject: Re: port-mips/16154
Date: Wed, 3 Apr 2002 22:08:21 +0200
> Does it reproduce on the old kernel
> e.g. ftp.netbsd.org/pub/NetBSD/arch/sgimips/netbsd.ip22 (1.5W)?
Yes, it does.
--
Emmanuel Dreyfus.
Ugly one-liners -- http://gizmo.minet.net:8080/sh
manu@netbsd.org
From: manu@netbsd.org (Emmanuel Dreyfus)
To: stephenm@employees.org (Stephen Ma)
Cc: gnats-bugs@netbsd.org
Subject: Re: port-mips/16154
Date: Thu, 4 Apr 2002 22:06:12 +0200
By adding breaks in hardclock and softclock, I can tell that while being
hang, hardclock() is called, but not softclock().
Here is the code path for hardclock()
mips3_KernIntr -> cpu_intr -> ip22_intr -> hardclock
softclock() is never called while hung.
--
Emmanuel Dreyfus
manu@netbsd.org
From: manu@netbsd.org (Emmanuel Dreyfus)
To: gnats-bugs@netbsd.org
Cc: stephenm@employees.org (Stephen Ma), nathanw@wasabisystems.com,
uch@vnop.net (UCHIYAMA Yasushi)
Subject: Re: port-mips/16154
Date: Fri, 5 Apr 2002 22:17:33 +0200
More debugging:
While hang, hardclock is called, but not softclock, because the
MIPS3_CLKF_BASEPRI test in hardclock turns into a zero. At that time, SR
stored on the trapframe is 0xfc03.
For MIPS3, we have this:
#define MIPS3_CLKF_BASEPRI(framep) \
((~(framep)->sr & (MIPS_INT_MASK | MIPS_SR_INT_IE)) == 0)
MIPS_INT_MASK | MIPS_SR_INT_IE = 0xff00 | 0x01 = 0xff01
~SR = ~0xfc03 = 0x03fc
~SR & (MIPS_INT_MASK | MIPS_SR_INT_IE) = 0x0300
and we don't get into softclock, but we go into softintr_schedule.
When there is another process eating some CPU, we don't hang. In this
case, sometime SR= 0xff03, and sometime 0xfc03. When it's 0xff03, we go
into softclock().
Does this speaks to someone?
--
Emmanuel Dreyfus.
NetBSD, parceque je le vaux bien.
manu@netbsd.org
From: stephenm@employees.org (Stephen Ma)
To: manu@netbsd.org (Emmanuel Dreyfus)
Cc: gnats-bugs@netbsd.org, stephenm@employees.org (Stephen Ma),
nathanw@wasabisystems.com, uch@vnop.net (UCHIYAMA Yasushi)
Subject: Re: port-mips/16154
Date: Fri, 5 Apr 2002 20:51:11 -0800
>>>>> "manu" == Emmanuel Dreyfus <manu@netbsd.org> writes:
manu> More debugging: While hang, hardclock is called, but not
manu> softclock, because the MIPS3_CLKF_BASEPRI test in hardclock
manu> turns into a zero. At that time, SR stored on the trapframe is
manu> 0xfc03.
manu> For MIPS3, we have this: #define MIPS3_CLKF_BASEPRI(framep) \
manu> When there is another process eating some CPU, we don't hang. In
manu> this case, sometime SR= 0xff03, and sometime 0xfc03. When it's
manu> 0xff03, we go into softclock().
manu> Does this speaks to someone?
0xfc03 means all hard interrupts enabled, but no soft interrupts, but
0xff03 means both hard and soft interrupts enabled. There's code at
the beginning of trap() that enables just the hard interrupts.
Does anyone know why soft interrupts are disabled for trap()?
- S
From: UCHIYAMA Yasushi <uch@vnop.net>
To: stephenm@employees.org
Cc: manu@netbsd.org, gnats-bugs@netbsd.org, nathanw@wasabisystems.com
Subject: Re: port-mips/16154
Date: Sun, 07 Apr 2002 01:18:51 +0900 (JST)
| Does anyone know why soft interrupts are disabled for trap()?
From Rev. 1.1 comments.
/*
* Enable hardware interrupts if they were on before.
* We only respond to software interrupts when returning to user mode.
*/
if (statusReg & MACH_SR_INT_ENA_PREV)
splx((statusReg & MACH_HARD_INT_MASK) | MACH_SR_INT_ENA_CUR);
I've updated kernel, userland and toolchain to -current, after here,
R4000 Indy frequently and randomly caught Segmentation fault. I can't
figure out cause of this problem yet.
---
UCHIYAMA Yasushi
uch@vnop.net
From: stephenm@employees.org (Stephen Ma)
To: UCHIYAMA Yasushi <uch@vnop.net>
Cc: stephenm@employees.org, manu@netbsd.org, gnats-bugs@netbsd.org,
nathanw@wasabisystems.com
Subject: Re: port-mips/16154
Date: Sun, 7 Apr 2002 11:08:22 -0700
> | Does anyone know why soft interrupts are disabled for trap()?
> From Rev. 1.1 comments.
> /*
> * Enable hardware interrupts if they were on before.
> * We only respond to software interrupts when returning to user mode.
> */
From the "more wild guesses" drawer, I would guess that this is
causing the problem.
Let's say that the only runnable process is the segfaulting one. Since
this process is in a permanent segfault loop, the CPU is never is
user-mode - it's always processing the segfault exception. Since
trap() doesn't enable soft interrupts, no soft interrupts can get
processed until after the CPU returns from exception processing.
On return to user-mode, the CPU hits the segfaulting instruction
again, and because (on the R5000, at least), this happens in the same
pipeline stage as interrupts, the segfault (specifically, the TLB read
miss exception) takes precedence, so the CPU doesn't get a chance to
handle any pending soft interrupts.
Without soft interrupts, softclock() never gets called. Also, the
serial ports send rx data up to the kernel via soft interrupts,
so you're unable to wake any processes waiting for serial
input. The same applies for processes waiting for network data.
The fix would be to find some location in the trap() call flow where
it's safe to process pending soft interrupts. Possibly this could be
done after trap() returns to UserGenException(). There could be other
safe places to let soft interrupts be processed, but for most of the
other NetBSD architectures, it's usually done just before a return
from kernel to user mode.
Does this make sense?
- S
From: stephenm@employees.org (Stephen Ma)
To: manu@netbsd.org (Emmanuel Dreyfus)
Cc: gnats-bugs@netbsd.org
Subject: port-mips/16154
Date: Sat, 16 Nov 2002 00:08:46 -0800
Here's a quick attempt at a fix. I think this is a safe point to
reenable interrupts. It compiles for the SGI indy, but I don't have a
MIPS box on which to test this patch.
$NetBSD: mipsX_subr.S,v 1.10 2002/11/12 14:00:41 nisimura Exp $
- S
--- mipsX_subr.S.orig Fri Nov 15 17:40:55 2002
+++ mipsX_subr.S Sun Nov 17 13:57:47 2002
@@ -737,6 +737,20 @@
COP0_SYNC
jal _C_LABEL(trap)
sw a3, CALLFRAME_SIZ-4(sp) # for debugging
+#ifndef IPL_ICU_MASK
+/*
+ * Allow any pending soft interrupts to run. This is needed in the case
+ * of an exception occurring immediately after the return from exception
+ * which would prevent the soft interrupt triggering.
+ */
+ mfc0 t2, MIPS_COP_0_STATUS
+ REG_L t0, CALLFRAME_SIZ + FRAME_SR(sp)
+ and t0, t0, MIPS_INT_MASK
+ DYNAMIC_STATUS_MASK_TOUSER(t0, t1) # machine dependent masking
+ or t0, t0, t2
+ mtc0 t0, MIPS_COP_0_STATUS
+ COP0_SYNC
+#endif
/*
* Check pending asynchronous traps.
*/
State-Changed-From-To: open->closed
State-Changed-By: manu
State-Changed-When: Sat Jun 19 10:42:42 UTC 2004
State-Changed-Why:
the problem cannot be observed anymore
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.