NetBSD Problem Report #34101

From jld@panix.com  Fri Jul 28 03:14:00 2006
Return-Path: <jld@panix.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 1BC6063B850
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 28 Jul 2006 03:14:00 +0000 (UTC)
Message-Id: <200607280313.k6S3Dvp26825@byzantium.nyc.access.net>
Date: Thu, 27 Jul 2006 23:13:57 -0400 (EDT)
From: jld@panix.com
Reply-To: jld@panix.com
To: gnats-bugs@NetBSD.org
Subject: ltsleep during panic hangs system
X-Send-Pr-Version: 3.95

>Number:         34101
>Category:       kern
>Synopsis:       ltsleep during panic hangs system
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jul 28 03:15:00 +0000 2006
>Originator:     Jed Davis
>Release:        NetBSD 3.0
>Organization:
PANIX Public Access Internet and UNIX, NYC
>Environment:
System: NetBSD panix3.panix.com 3.0 NetBSD 3.0 (PANIX-FIVE) #0: Fri Apr 14 21:05:29 EDT 2006  root@juggler.panix.com:/devel/netbsd/3.0/src/sys/arch/i386/compile/PANIX-FIVE i386
Architecture: i386
Machine: i386
>Description:

The top of ltsleep() contains this:

        /*
         * XXXSMP
         * This is probably bogus.  Figure out what the right
         * thing to do here really is.
         * Note that not sleeping if ltsleep is called with curlwp == NULL
         * in the shutdown case is disgusting but partly necessary given
         * how shutdown (barely) works.
         */
        if (cold || (doing_shutdown && (panicstr || (l == NULL)))) {
                /*
                 * After a panic, or during autoconfiguration,
                 * just give interrupts a chance, then just return;
                 * don't run any other procs or panic below,
                 * in case this is the idle process and already asleep.
                 */

The problem with that is that, if the system is panicking and trying
to reboot (which may include an attempt to sync disks), and a kernel
thread that loops calling ltsleep to wait for work (e.g., aiodoned, or
i386's MD apm_thread) gets woken up, it will run forever and the
system will never succeed in rebooting.

However, it appears to be like that for a reason, and thus that the
correct solution is not to just yank it out and try to sleep normally.


PR port-i386/33353 was opened to the specific instance of this problem
with apm_thread, in which special case it might be reasonable to have
the affected thread just exit if it's woken during a panic -- but that
seems like not the right solution somehow (even if it'd work).

>How-To-Repeat:

This happens most of the time when a host at Panix experiences a panic;
enough that we've had to locally modify swwdog(4) to pass RB_NOSYNC and
use it as a workaround.

>Fix:

That's what I'm filing this PR to find out.  A somewhat distasteful
workaround is noted above. 

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.