NetBSD Problem Report #38673

From martin@aprisoft.de  Fri May 16 08:50:33 2008
Return-Path: <martin@aprisoft.de>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 61A8C63B293
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 16 May 2008 08:50:33 +0000 (UTC)
Message-Id: <20080516085030.7C247AF580F@emmas.aprisoft.de>
Date: Fri, 16 May 2008 10:50:30 +0200 (CEST)
From: martin@duskware.de
Reply-To: martin@duskware.de
To: gnats-bugs@gnats.NetBSD.org
Subject: race condition in block device handling on w/o fast softints
X-Send-Pr-Version: 3.95

>Number:         38673
>Category:       port-sparc64
>Synopsis:       all sbus interrupts considered mpsafe
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    martin
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri May 16 08:55:00 +0000 2008
>Closed-Date:    Sun May 18 23:06:58 +0000 2008
>Last-Modified:  Sun May 18 23:06:58 +0000 2008
>Originator:     Martin Husemann
>Release:        NetBSD 4.99.62
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD nelly.aprisoft.de 4.99.62 NetBSD 4.99.62 (NELLY) #15: Fri May 16 10:39:52 CEST 2008 martin@emmas.aprisoft.de:/nelly/usr/src/sys/arch/sparc64/compile/NELLY sparc64
Architecture: sparc64
Machine: sparc64
>Description:

On a dual U2, trying to read a whole disk partition (like: dd if=/dev/rsd0a
of=/dev/zero bs=1m) fails after a while:

panic: sdstart(): dequeued wrong buf
Stopped in pid 0.4 (system) at  netbsd:cpu_Debugger+0x4:        nop
db{0}> bt
sdstart(3a44800, 3a09200, 5bf, 11f04a0, 1814000, 3a89e00) at netbsd:sdstart+0x37
4
scsipi_put_xs(3a44800, 0, 14396d0, 1198ae0, 7, ff898000) at netbsd:scsipi_put_xs
+0xe0
scsipi_complete(3a09200, 0, 46, 3a07ea0, c1, b) at netbsd:scsipi_complete+0x1c4
scsipi_done(3a40890, 3a07b30, ff888000, ff800000, ffffe000, 900040007ace8012) at
 netbsd:scsipi_done+0x1a4
ncr53c9x_done(3a40800, 3a07b30, 44, 0, 10000, 1) at netbsd:ncr53c9x_done+0xf8
ncr53c9x_intr(3a07b30, 0, e0017ed0, 10000, 1037600, 101) at netbsd:ncr53c9x_intr
+0x9a4
sparc_interrupt(0, 7, 0, 0, 1814000, 3fff) at netbsd:sparc_interrupt+0x23c
_kernel_lock(10575, 0, fffffff, 146f800, d04f2c0, 1) at netbsd:_kernel_lock+0x11
4
biodone2(1, 11f02a0, 5fd, 11f0490, 1814000, d04fa90) at netbsd:biodone2+0x6c
biointr(0, d047ec0, 3, 6, 1814000, 180c000) at netbsd:biointr+0xa4
softint_thread(d68e008, d04f2c0, 11e9800, 11e9800, 11e9400, 11d6800) at netbsd:s
oftint_thread+0xd0
lwp_trampoline(f005eaf0, 111400, fffb1e28, 110418, fffb1df8, 1) at netbsd:lwp_tr
ampoline+0x8
db{0}> mach cpu 1
db{1}> bt
physio(0, 0, 1108, 100000, f, eb37bf0) at netbsd:physio+0x2f4
cdev_read(6, eb37bf0, 0, 11ec000, de98000, 4030dbf0) at netbsd:cdev_read+0x60
spec_read(eb37a48, 1166bf0, 11d0400, ea36fa0, de98000, 1) at netbsd:spec_read+0x
1e0
nfsspec_read(eb37a48, 10001, badcafe, 146f800, ea36fa0, 1) at netbsd:nfsspec_rea
d+0x38
VOP_READ(e86e720, eb37bf0, 0, d04bd40, badcafe, badcafe) at netbsd:VOP_READ+0x40

vn_read(e7c8180, e7c8180, eb37bf0, d04bd40, 1, 11ea000) at netbsd:vn_read+0x88
dofileread(16, e7c8180, 40a00000, 100000, 3, 1) at netbsd:dofileread+0x60
sys_read(3, eb37dc0, eb37e00, badcafe, badcafe, badcafe) at netbsd:sys_read+0x60

syscall_plain(eb37ed0, 3, 4073c5a4, 166, 4073c5a4, 800) at netbsd:syscall_plain+
0x11c
?(3, 40a00000, 100000, 20, 0, 4030dbf0) at 0x10092fc


>How-To-Repeat:
s/a

>Fix:
n/a

>Release-Note:

>Audit-Trail:
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/38673: race condition in block device handling on w/o fast softints
Date: Sat, 17 May 2008 13:39:50 +0100

 On Fri, May 16, 2008 at 08:55:00AM +0000, martin@duskware.de wrote:

 > >Synopsis:       race condition in block device handling on w/o fast softints

 I think it's not related to the soft interrupt.

 > panic: sdstart(): dequeued wrong buf
 > Stopped in pid 0.4 (system) at  netbsd:cpu_Debugger+0x4:        nop
 > db{0}> bt
 > sdstart(3a44800, 3a09200, 5bf, 11f04a0, 1814000, 3a89e00) at netbsd:sdstart+0x37
 > 4
 > scsipi_put_xs(3a44800, 0, 14396d0, 1198ae0, 7, ff898000) at netbsd:scsipi_put_xs
 > +0xe0
 > scsipi_complete(3a09200, 0, 46, 3a07ea0, c1, b) at netbsd:scsipi_complete+0x1c4
 > scsipi_done(3a40890, 3a07b30, ff888000, ff800000, ffffe000, 900040007ace8012) at
 >  netbsd:scsipi_done+0x1a4
 > ncr53c9x_done(3a40800, 3a07b30, 44, 0, 10000, 1) at netbsd:ncr53c9x_done+0xf8
 > ncr53c9x_intr(3a07b30, 0, e0017ed0, 10000, 1037600, 101) at netbsd:ncr53c9x_intr
 > +0x9a4
 > sparc_interrupt(0, 7, 0, 0, 1814000, 3fff) at netbsd:sparc_interrupt+0x23c
 > _kernel_lock(10575, 0, fffffff, 146f800, d04f2c0, 1) at netbsd:_kernel_lock+0x11
 > 4
 > biodone2(1, 11f02a0, 5fd, 11f0490, 1814000, d04fa90) at netbsd:biodone2+0x6c
 > biointr(0, d047ec0, 3, 6, 1814000, 180c000) at netbsd:biointr+0xa4
 > softint_thread(d68e008, d04f2c0, 11e9800, 11e9800, 11e9400, 11d6800) at netbsd:s
 > oftint_thread+0xd0
 > lwp_trampoline(f005eaf0, 111400, fffb1e28, 110418, fffb1df8, 1) at netbsd:lwp_tr
 > ampoline+0x8
 > db{0}> mach cpu 1
 > db{1}> bt
 > physio(0, 0, 1108, 100000, f, eb37bf0) at netbsd:physio+0x2f4
 > cdev_read(6, eb37bf0, 0, 11ec000, de98000, 4030dbf0) at netbsd:cdev_read+0x60
 > spec_read(eb37a48, 1166bf0, 11d0400, ea36fa0, de98000, 1) at netbsd:spec_read+0x
 > 1e0
 > nfsspec_read(eb37a48, 10001, badcafe, 146f800, ea36fa0, 1) at netbsd:nfsspec_rea
 > d+0x38
 > VOP_READ(e86e720, eb37bf0, 0, d04bd40, badcafe, badcafe) at netbsd:VOP_READ+0x40
 > 
 > vn_read(e7c8180, e7c8180, eb37bf0, d04bd40, 1, 11ea000) at netbsd:vn_read+0x88
 > dofileread(16, e7c8180, 40a00000, 100000, 3, 1) at netbsd:dofileread+0x60
 > sys_read(3, eb37dc0, eb37e00, badcafe, badcafe, badcafe) at netbsd:sys_read+0x60
 > 
 > syscall_plain(eb37ed0, 3, 4073c5a4, 166, 4073c5a4, 800) at netbsd:syscall_plain+
 > 0x11c
 > ?(3, 40a00000, 100000, 20, 0, 4030dbf0) at 0x10092fc

 It seems that cpu1 should be holding kernel_lock because it's working on an
 NFS vnode. You can verify that by digging 'vp' out of the arguments to
 VOP_READ or vn_read, and then running 'show vnode' on it. If VV_MPSAFE is
 clear, kernel_lock will have been taken. Or you can do 'show lock
 kernel_lock' if a LOCKDEBUG kernel.

 It also looks like cpu0 was waiting to acquire kernel_lock when the
 interrupt occurred. kernel_lock should be acquired for ncr53c9x_intr(), but
 I don't see intr_biglock_wrapper() in the backtrace. It would be useful to
 verify which CPU holds the lock, and to verify that ncr53c9x_intr() is
 actually occuring at IPL_VM.

 It's difficult to tell what is going on, because the backtraces from all
 CPUs but the one that panicked are always from some point after the event.

 Andrew

Responsible-Changed-From-To: kern-bug-people->martin
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Sun, 18 May 2008 21:28:18 +0000
Responsible-Changed-Why:
my bug


From: Martin Husemann <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/38673 CVS commit: src/sys/arch/sparc64
Date: Sun, 18 May 2008 22:40:14 +0000 (UTC)

 Module Name:	src
 Committed By:	martin
 Date:		Sun May 18 22:40:14 UTC 2008

 Modified Files:
 	src/sys/arch/sparc64/dev: psycho.c sbus.c
 	src/sys/arch/sparc64/include: cpu.h
 	src/sys/arch/sparc64/sparc64: clock.c intr.c machdep.c

 Log Message:
 Explicitly pass a "mpsafe" arg down to intr_establish, as at that point
 we do not have the original ipl passed in around to check for mpsafeness.
 Fixes PR port-sparc64/38673. Thanks to Andrew for pointing at the problem.


 To generate a diff of this commit:
 cvs rdiff -r1.85 -r1.86 src/sys/arch/sparc64/dev/psycho.c
 cvs rdiff -r1.79 -r1.80 src/sys/arch/sparc64/dev/sbus.c
 cvs rdiff -r1.80 -r1.81 src/sys/arch/sparc64/include/cpu.h
 cvs rdiff -r1.96 -r1.97 src/sys/arch/sparc64/sparc64/clock.c
 cvs rdiff -r1.59 -r1.60 src/sys/arch/sparc64/sparc64/intr.c
 cvs rdiff -r1.221 -r1.222 src/sys/arch/sparc64/sparc64/machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: martin@NetBSD.org
State-Changed-When: Sun, 18 May 2008 23:06:58 +0000
State-Changed-Why:
fixed


>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.