NetBSD Problem Report #16154

Received: (qmail 23485 invoked from network); 1 Apr 2002 13:29:27 -0000
Message-Id: <20020401132941.226631111A@www.netbsd.org>
Date: Mon,  1 Apr 2002 05:29:41 -0800 (PST)
From: manu@netbsd.org
Sender: nobody@netbsd.org
Reply-To: manu@netbsd.org
To: gnats-bugs@gnats.netbsd.org
Subject: any user can hang the machine by masking SIGSEGV and faulting
X-Send-Pr-Version: www-1.0

>Number:         16154
>Category:       port-mips
>Synopsis:       any user can hang the machine by masking SIGSEGV and faulting
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-mips-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 01 13:30:01 +0000 2002
>Closed-Date:    Sat Jun 19 10:43:01 +0000 2004
>Last-Modified:  Sat Jun 19 10:43:01 +0000 2004
>Originator:     Emmanuel Dreyfus
>Release:        NetBSD-current
>Organization:
The NetBSD Project
>Environment:
NetBSD plume 1.5ZC NetBSD 1.5ZC (IRIX3) #69: Mon Apr  1 13:28:17 CEST 2002     manu@plume:/cvs/src/sys/arch/sgimips/compile/IRIX3 sgimips
>Description:
When a program masks SIGSEGV and does a page fault, NetBSD/mips hangs. 
It is possible to drop into ddb and send a kill -9 to the offending
process, this will restore the machine to a fully functionnal state.
>How-To-Repeat:
#include <stdio.h>
#include <signal.h>

int main (void) {
        char *p = (char *)0xc;

        signal(SIGSEGV, SIG_IGN);

        printf("let's go\n");
        *p = *p + 1;
        printf("still alive?\n");
        return 0;
}
>Fix:
I have not yet fully spotted the problem. However, here are the 
information I gathered:

The normal behavior would be to loop on the fault: on *p access, we
get a page fault. There is no valid mapping at the requested address, 
hence we attempt to send a SIGSEGV. It is blocked, so we return to 
userland and restart the offending instruction. We fault again and we 
loop here forever.

This should not hang the machine since on return to userland, we can
schedule another process to run. The problem on mips ports is that 
the offending process is *always* re-scheduled to run.

More information: on the page fault, mips3_UserGenException is invoked,
from here, we have approximately this code path (determined by bloating
my kernel with printf's):
mips3_UserGenException
  trap
    uvm_fault
      uvmfault_lookup
    trapsignal
      psignal1
  ast
    preempt
      mi_switch
        ...

mi_switch always selects the offending process to run again, thus hanging
the machine. 
>Release-Note:
>Audit-Trail:

From: manu@netbsd.org (Emmanuel Dreyfus)
To: gnats-bugs@netbsd.org
Cc: uch@vnop.net (UCHIYAMA Yasushi),
 nathanw@wasabisystems.com (Nathan J. Williams)
Subject: port-mips/16154
Date: Wed, 3 Apr 2002 07:58:54 +0200

 More info about the problem:

 - it occurs on a R5000, kernel has options MIPS3 and options
 MIPS3_L2CACHE_ABSENT

 - the machine only hang when the offending process is the only runnable
 process. If there is a 
 while(1); 
 running in the background, then launching the offending process will not
 hang the machine. If the while(1); is suspended or killed, we get an
 immediate hang.

 - when the machine is hung,  schedcpu() is not called every second
 anymore. As soon as the offending process is killed from ddb, schedcpu()
 is working again. in ddb we can see that schedcpu is still in the
 callout structs (show all callout), but for an unknown reason, it does
 not work (interrupts disabled?)

 -- 
 Emmanuel Dreyfus.
 Sryvpvgngvbaf!
 Ibhf irarm qr creqer ibger grzcf n qrpbqre har fvtangher fnaf vagrerg.
 manu@netbsd.org

From: UCHIYAMA Yasushi <uch@vnop.net>
To: manu@netbsd.org
Cc: gnats-bugs@netbsd.org, nathanw@wasabisystems.com
Subject: Re: port-mips/16154
Date: Wed, 03 Apr 2002 18:04:53 +0900 (JST)

 Does it reproduce on the old kernel
 e.g. ftp.netbsd.org/pub/NetBSD/arch/sgimips/netbsd.ip22 (1.5W)?
 ---
 UCHIYAMA Yasushi
 uch@vnop.net


From: manu@netbsd.org (Emmanuel Dreyfus)
To: uch@vnop.net (UCHIYAMA Yasushi)
Cc: gnats-bugs@netbsd.org, nathanw@wasabisystems.com
Subject: Re: port-mips/16154
Date: Wed, 3 Apr 2002 22:08:21 +0200

 > Does it reproduce on the old kernel
 > e.g. ftp.netbsd.org/pub/NetBSD/arch/sgimips/netbsd.ip22 (1.5W)?

 Yes, it does.

 -- 
 Emmanuel Dreyfus.
 Ugly one-liners -- http://gizmo.minet.net:8080/sh
 manu@netbsd.org

From: manu@netbsd.org (Emmanuel Dreyfus)
To: stephenm@employees.org (Stephen Ma)
Cc: gnats-bugs@netbsd.org
Subject: Re: port-mips/16154
Date: Thu, 4 Apr 2002 22:06:12 +0200

 By adding breaks in hardclock and softclock, I can tell that while being
 hang, hardclock() is called, but not softclock().

 Here is the code path for hardclock()
 mips3_KernIntr -> cpu_intr -> ip22_intr -> hardclock

 softclock() is never called while hung.

 -- 
 Emmanuel Dreyfus
 manu@netbsd.org

From: manu@netbsd.org (Emmanuel Dreyfus)
To: gnats-bugs@netbsd.org
Cc: stephenm@employees.org (Stephen Ma), nathanw@wasabisystems.com,
 uch@vnop.net (UCHIYAMA Yasushi)
Subject: Re: port-mips/16154
Date: Fri, 5 Apr 2002 22:17:33 +0200

 More debugging:

 While hang, hardclock is called, but not softclock, because the
 MIPS3_CLKF_BASEPRI test in hardclock turns into a zero. At that time, SR
 stored on the trapframe is 0xfc03.

 For MIPS3, we have this:
 #define MIPS3_CLKF_BASEPRI(framep)      \
         ((~(framep)->sr & (MIPS_INT_MASK | MIPS_SR_INT_IE)) == 0)

 MIPS_INT_MASK | MIPS_SR_INT_IE  = 0xff00 | 0x01 = 0xff01
 ~SR = ~0xfc03 = 0x03fc

 ~SR & (MIPS_INT_MASK | MIPS_SR_INT_IE) = 0x0300
 and we don't get into softclock, but we go into softintr_schedule.


 When there is another process eating some CPU, we don't hang. In this
 case, sometime SR= 0xff03, and sometime 0xfc03. When it's 0xff03, we go
 into softclock().

 Does this speaks to someone? 

 -- 
 Emmanuel Dreyfus.
 NetBSD, parceque je le vaux bien.
 manu@netbsd.org

From: stephenm@employees.org (Stephen Ma)
To: manu@netbsd.org (Emmanuel Dreyfus)
Cc: gnats-bugs@netbsd.org, stephenm@employees.org (Stephen Ma),
   nathanw@wasabisystems.com, uch@vnop.net (UCHIYAMA Yasushi)
Subject: Re: port-mips/16154
Date: Fri, 5 Apr 2002 20:51:11 -0800

 >>>>> "manu" == Emmanuel Dreyfus <manu@netbsd.org> writes:

 manu> More debugging: While hang, hardclock is called, but not
 manu> softclock, because the MIPS3_CLKF_BASEPRI test in hardclock
 manu> turns into a zero. At that time, SR stored on the trapframe is
 manu> 0xfc03.

 manu> For MIPS3, we have this: #define MIPS3_CLKF_BASEPRI(framep) \
 manu> When there is another process eating some CPU, we don't hang. In
 manu> this case, sometime SR= 0xff03, and sometime 0xfc03. When it's
 manu> 0xff03, we go into softclock().

 manu> Does this speaks to someone?

 0xfc03 means all hard interrupts enabled, but no soft interrupts, but
 0xff03 means both hard and soft interrupts enabled. There's code at
 the beginning of trap() that enables just the hard interrupts.

 Does anyone know why soft interrupts are disabled for trap()?

 - S


From: UCHIYAMA Yasushi <uch@vnop.net>
To: stephenm@employees.org
Cc: manu@netbsd.org, gnats-bugs@netbsd.org, nathanw@wasabisystems.com
Subject: Re: port-mips/16154
Date: Sun, 07 Apr 2002 01:18:51 +0900 (JST)

  | Does anyone know why soft interrupts are disabled for trap()?

 From Rev. 1.1 comments.
         /*
          * Enable hardware interrupts if they were on before.
          * We only respond to software interrupts when returning to user mode.
          */
         if (statusReg & MACH_SR_INT_ENA_PREV)
                 splx((statusReg & MACH_HARD_INT_MASK) | MACH_SR_INT_ENA_CUR);

 I've updated kernel, userland and toolchain to -current, after here,
 R4000 Indy frequently and randomly caught Segmentation fault. I can't
 figure out cause of this problem yet.
 ---
 UCHIYAMA Yasushi
 uch@vnop.net


From: stephenm@employees.org (Stephen Ma)
To: UCHIYAMA Yasushi <uch@vnop.net>
Cc: stephenm@employees.org, manu@netbsd.org, gnats-bugs@netbsd.org,
   nathanw@wasabisystems.com
Subject: Re: port-mips/16154
Date: Sun, 7 Apr 2002 11:08:22 -0700

 >  | Does anyone know why soft interrupts are disabled for trap()?
 > From Rev. 1.1 comments.
 >         /*
 >          * Enable hardware interrupts if they were on before.
 >          * We only respond to software interrupts when returning to user mode.
 >          */

 From the "more wild guesses" drawer, I would guess that this is
 causing the problem.

 Let's say that the only runnable process is the segfaulting one. Since
 this process is in a permanent segfault loop, the CPU is never is
 user-mode - it's always processing the segfault exception. Since
 trap() doesn't enable soft interrupts, no soft interrupts can get
 processed until after the CPU returns from exception processing.

 On return to user-mode, the CPU hits the segfaulting instruction
 again, and because (on the R5000, at least), this happens in the same
 pipeline stage as interrupts, the segfault (specifically, the TLB read
 miss exception) takes precedence, so the CPU doesn't get a chance to
 handle any pending soft interrupts.

 Without soft interrupts, softclock() never gets called. Also, the
 serial ports send rx data up to the kernel via soft interrupts,
 so you're unable to wake any processes waiting for serial
 input. The same applies for processes waiting for network data.

 The fix would be to find some location in the trap() call flow where
 it's safe to process pending soft interrupts. Possibly this could be
 done after trap() returns to UserGenException(). There could be other
 safe places to let soft interrupts be processed, but for most of the
 other NetBSD architectures, it's usually done just before a return
 from kernel to user mode.

 Does this make sense?

 - S


From: stephenm@employees.org (Stephen Ma)
To: manu@netbsd.org (Emmanuel Dreyfus)
Cc: gnats-bugs@netbsd.org
Subject: port-mips/16154
Date: Sat, 16 Nov 2002 00:08:46 -0800

 Here's a quick attempt at a fix. I think this is a safe point to
 reenable interrupts. It compiles for the SGI indy, but I don't have a
 MIPS box on which to test this patch.

 $NetBSD: mipsX_subr.S,v 1.10 2002/11/12 14:00:41 nisimura Exp $

 - S

 --- mipsX_subr.S.orig	Fri Nov 15 17:40:55 2002
 +++ mipsX_subr.S	Sun Nov 17 13:57:47 2002
 @@ -737,6 +737,20 @@
  	COP0_SYNC
  	jal	_C_LABEL(trap)
  	sw	a3, CALLFRAME_SIZ-4(sp)		# for debugging
 +#ifndef IPL_ICU_MASK
 +/*
 + * Allow any pending soft interrupts to run. This is needed in the case
 + * of an exception occurring immediately after the return from exception
 + * which would prevent the soft interrupt triggering.
 + */
 +	mfc0	t2, MIPS_COP_0_STATUS
 +	REG_L	t0, CALLFRAME_SIZ + FRAME_SR(sp)
 +	and	t0, t0, MIPS_INT_MASK
 +	DYNAMIC_STATUS_MASK_TOUSER(t0, t1)	# machine dependent masking
 +	or	t0, t0, t2
 +	mtc0	t0, MIPS_COP_0_STATUS
 +	COP0_SYNC
 +#endif
  /*
   * Check pending asynchronous traps.
   */

State-Changed-From-To: open->closed 
State-Changed-By: manu 
State-Changed-When: Sat Jun 19 10:42:42 UTC 2004 
State-Changed-Why:  
the problem cannot be observed anymore 
>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.