NetBSD Problem Report #38673
From martin@aprisoft.de Fri May 16 08:50:33 2008
Return-Path: <martin@aprisoft.de>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id 61A8C63B293
for <gnats-bugs@gnats.NetBSD.org>; Fri, 16 May 2008 08:50:33 +0000 (UTC)
Message-Id: <20080516085030.7C247AF580F@emmas.aprisoft.de>
Date: Fri, 16 May 2008 10:50:30 +0200 (CEST)
From: martin@duskware.de
Reply-To: martin@duskware.de
To: gnats-bugs@gnats.NetBSD.org
Subject: race condition in block device handling on w/o fast softints
X-Send-Pr-Version: 3.95
>Number: 38673
>Category: port-sparc64
>Synopsis: all sbus interrupts considered mpsafe
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: martin
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri May 16 08:55:00 +0000 2008
>Closed-Date: Sun May 18 23:06:58 +0000 2008
>Last-Modified: Sun May 18 23:06:58 +0000 2008
>Originator: Martin Husemann
>Release: NetBSD 4.99.62
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD nelly.aprisoft.de 4.99.62 NetBSD 4.99.62 (NELLY) #15: Fri May 16 10:39:52 CEST 2008 martin@emmas.aprisoft.de:/nelly/usr/src/sys/arch/sparc64/compile/NELLY sparc64
Architecture: sparc64
Machine: sparc64
>Description:
On a dual U2, trying to read a whole disk partition (like: dd if=/dev/rsd0a
of=/dev/zero bs=1m) fails after a while:
panic: sdstart(): dequeued wrong buf
Stopped in pid 0.4 (system) at netbsd:cpu_Debugger+0x4: nop
db{0}> bt
sdstart(3a44800, 3a09200, 5bf, 11f04a0, 1814000, 3a89e00) at netbsd:sdstart+0x37
4
scsipi_put_xs(3a44800, 0, 14396d0, 1198ae0, 7, ff898000) at netbsd:scsipi_put_xs
+0xe0
scsipi_complete(3a09200, 0, 46, 3a07ea0, c1, b) at netbsd:scsipi_complete+0x1c4
scsipi_done(3a40890, 3a07b30, ff888000, ff800000, ffffe000, 900040007ace8012) at
netbsd:scsipi_done+0x1a4
ncr53c9x_done(3a40800, 3a07b30, 44, 0, 10000, 1) at netbsd:ncr53c9x_done+0xf8
ncr53c9x_intr(3a07b30, 0, e0017ed0, 10000, 1037600, 101) at netbsd:ncr53c9x_intr
+0x9a4
sparc_interrupt(0, 7, 0, 0, 1814000, 3fff) at netbsd:sparc_interrupt+0x23c
_kernel_lock(10575, 0, fffffff, 146f800, d04f2c0, 1) at netbsd:_kernel_lock+0x11
4
biodone2(1, 11f02a0, 5fd, 11f0490, 1814000, d04fa90) at netbsd:biodone2+0x6c
biointr(0, d047ec0, 3, 6, 1814000, 180c000) at netbsd:biointr+0xa4
softint_thread(d68e008, d04f2c0, 11e9800, 11e9800, 11e9400, 11d6800) at netbsd:s
oftint_thread+0xd0
lwp_trampoline(f005eaf0, 111400, fffb1e28, 110418, fffb1df8, 1) at netbsd:lwp_tr
ampoline+0x8
db{0}> mach cpu 1
db{1}> bt
physio(0, 0, 1108, 100000, f, eb37bf0) at netbsd:physio+0x2f4
cdev_read(6, eb37bf0, 0, 11ec000, de98000, 4030dbf0) at netbsd:cdev_read+0x60
spec_read(eb37a48, 1166bf0, 11d0400, ea36fa0, de98000, 1) at netbsd:spec_read+0x
1e0
nfsspec_read(eb37a48, 10001, badcafe, 146f800, ea36fa0, 1) at netbsd:nfsspec_rea
d+0x38
VOP_READ(e86e720, eb37bf0, 0, d04bd40, badcafe, badcafe) at netbsd:VOP_READ+0x40
vn_read(e7c8180, e7c8180, eb37bf0, d04bd40, 1, 11ea000) at netbsd:vn_read+0x88
dofileread(16, e7c8180, 40a00000, 100000, 3, 1) at netbsd:dofileread+0x60
sys_read(3, eb37dc0, eb37e00, badcafe, badcafe, badcafe) at netbsd:sys_read+0x60
syscall_plain(eb37ed0, 3, 4073c5a4, 166, 4073c5a4, 800) at netbsd:syscall_plain+
0x11c
?(3, 40a00000, 100000, 20, 0, 4030dbf0) at 0x10092fc
>How-To-Repeat:
s/a
>Fix:
n/a
>Release-Note:
>Audit-Trail:
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/38673: race condition in block device handling on w/o fast softints
Date: Sat, 17 May 2008 13:39:50 +0100
On Fri, May 16, 2008 at 08:55:00AM +0000, martin@duskware.de wrote:
> >Synopsis: race condition in block device handling on w/o fast softints
I think it's not related to the soft interrupt.
> panic: sdstart(): dequeued wrong buf
> Stopped in pid 0.4 (system) at netbsd:cpu_Debugger+0x4: nop
> db{0}> bt
> sdstart(3a44800, 3a09200, 5bf, 11f04a0, 1814000, 3a89e00) at netbsd:sdstart+0x37
> 4
> scsipi_put_xs(3a44800, 0, 14396d0, 1198ae0, 7, ff898000) at netbsd:scsipi_put_xs
> +0xe0
> scsipi_complete(3a09200, 0, 46, 3a07ea0, c1, b) at netbsd:scsipi_complete+0x1c4
> scsipi_done(3a40890, 3a07b30, ff888000, ff800000, ffffe000, 900040007ace8012) at
> netbsd:scsipi_done+0x1a4
> ncr53c9x_done(3a40800, 3a07b30, 44, 0, 10000, 1) at netbsd:ncr53c9x_done+0xf8
> ncr53c9x_intr(3a07b30, 0, e0017ed0, 10000, 1037600, 101) at netbsd:ncr53c9x_intr
> +0x9a4
> sparc_interrupt(0, 7, 0, 0, 1814000, 3fff) at netbsd:sparc_interrupt+0x23c
> _kernel_lock(10575, 0, fffffff, 146f800, d04f2c0, 1) at netbsd:_kernel_lock+0x11
> 4
> biodone2(1, 11f02a0, 5fd, 11f0490, 1814000, d04fa90) at netbsd:biodone2+0x6c
> biointr(0, d047ec0, 3, 6, 1814000, 180c000) at netbsd:biointr+0xa4
> softint_thread(d68e008, d04f2c0, 11e9800, 11e9800, 11e9400, 11d6800) at netbsd:s
> oftint_thread+0xd0
> lwp_trampoline(f005eaf0, 111400, fffb1e28, 110418, fffb1df8, 1) at netbsd:lwp_tr
> ampoline+0x8
> db{0}> mach cpu 1
> db{1}> bt
> physio(0, 0, 1108, 100000, f, eb37bf0) at netbsd:physio+0x2f4
> cdev_read(6, eb37bf0, 0, 11ec000, de98000, 4030dbf0) at netbsd:cdev_read+0x60
> spec_read(eb37a48, 1166bf0, 11d0400, ea36fa0, de98000, 1) at netbsd:spec_read+0x
> 1e0
> nfsspec_read(eb37a48, 10001, badcafe, 146f800, ea36fa0, 1) at netbsd:nfsspec_rea
> d+0x38
> VOP_READ(e86e720, eb37bf0, 0, d04bd40, badcafe, badcafe) at netbsd:VOP_READ+0x40
>
> vn_read(e7c8180, e7c8180, eb37bf0, d04bd40, 1, 11ea000) at netbsd:vn_read+0x88
> dofileread(16, e7c8180, 40a00000, 100000, 3, 1) at netbsd:dofileread+0x60
> sys_read(3, eb37dc0, eb37e00, badcafe, badcafe, badcafe) at netbsd:sys_read+0x60
>
> syscall_plain(eb37ed0, 3, 4073c5a4, 166, 4073c5a4, 800) at netbsd:syscall_plain+
> 0x11c
> ?(3, 40a00000, 100000, 20, 0, 4030dbf0) at 0x10092fc
It seems that cpu1 should be holding kernel_lock because it's working on an
NFS vnode. You can verify that by digging 'vp' out of the arguments to
VOP_READ or vn_read, and then running 'show vnode' on it. If VV_MPSAFE is
clear, kernel_lock will have been taken. Or you can do 'show lock
kernel_lock' if a LOCKDEBUG kernel.
It also looks like cpu0 was waiting to acquire kernel_lock when the
interrupt occurred. kernel_lock should be acquired for ncr53c9x_intr(), but
I don't see intr_biglock_wrapper() in the backtrace. It would be useful to
verify which CPU holds the lock, and to verify that ncr53c9x_intr() is
actually occuring at IPL_VM.
It's difficult to tell what is going on, because the backtraces from all
CPUs but the one that panicked are always from some point after the event.
Andrew
Responsible-Changed-From-To: kern-bug-people->martin
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Sun, 18 May 2008 21:28:18 +0000
Responsible-Changed-Why:
my bug
From: Martin Husemann <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/38673 CVS commit: src/sys/arch/sparc64
Date: Sun, 18 May 2008 22:40:14 +0000 (UTC)
Module Name: src
Committed By: martin
Date: Sun May 18 22:40:14 UTC 2008
Modified Files:
src/sys/arch/sparc64/dev: psycho.c sbus.c
src/sys/arch/sparc64/include: cpu.h
src/sys/arch/sparc64/sparc64: clock.c intr.c machdep.c
Log Message:
Explicitly pass a "mpsafe" arg down to intr_establish, as at that point
we do not have the original ipl passed in around to check for mpsafeness.
Fixes PR port-sparc64/38673. Thanks to Andrew for pointing at the problem.
To generate a diff of this commit:
cvs rdiff -r1.85 -r1.86 src/sys/arch/sparc64/dev/psycho.c
cvs rdiff -r1.79 -r1.80 src/sys/arch/sparc64/dev/sbus.c
cvs rdiff -r1.80 -r1.81 src/sys/arch/sparc64/include/cpu.h
cvs rdiff -r1.96 -r1.97 src/sys/arch/sparc64/sparc64/clock.c
cvs rdiff -r1.59 -r1.60 src/sys/arch/sparc64/sparc64/intr.c
cvs rdiff -r1.221 -r1.222 src/sys/arch/sparc64/sparc64/machdep.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: martin@NetBSD.org
State-Changed-When: Sun, 18 May 2008 23:06:58 +0000
State-Changed-Why:
fixed
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.