NetBSD Problem Report #41974
From spz@NetBSD.org Wed Sep 2 05:56:21 2009
Return-Path: <spz@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id 6BBAE63BB08
for <gnats-bugs@gnats.NetBSD.org>; Wed, 2 Sep 2009 05:56:21 +0000 (UTC)
Message-Id: <20090902055618.EC3191E4E000@build.netbsd.org>
Date: Wed, 2 Sep 2009 05:56:18 +0000 (UTC)
From: spz@NetBSD.org
Reply-To: spz@NetBSD.org
To: gnats-bugs@gnats.NetBSD.org
Subject: panic in cpu_in_cksum / likely NFS issue
X-Send-Pr-Version: 3.95
>Number: 41974
>Category: kern
>Synopsis: panic in cpu_in_cksum
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Sep 02 06:00:00 +0000 2009
>Last-Modified: Thu Feb 26 00:45:00 +0000 2015
>Originator: S.P.Zeidler
>Release: NetBSD 5.0_STABLE
>Organization:
TNF
>Environment:
System: NetBSD b2.netbsd.org 5.0_STABLE NetBSD 5.0_STABLE (BUILD) #5: Sun Aug 9 20:35:32 UTC 2009 tls@ADMIN:/chroots/netbsd-5/src/sys/arch/amd64/compile/obj/BUILD amd64
Architecture: x86_64
Machine: amd64
>Description:
This has been greeting me at least 4 times now:
uvm_fault(0xffffffff80b662c0, 0xffff80012da62000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff80465425 cs 8 rflags 10206 cr2 ffff80012da62000 cpl 4 rsp ffff80003232c2b8
kernel: page fault trap, code=0
Stopped in pid 0.47 (system) at netbsd:cpu_in_cksum+0xa5: movl 0(%rbx),
%ecx
db{0}> bt
cpu_in_cksum() at netbsd:cpu_in_cksum+0xa5
uvm_fault(0xffffffff80b87e60, 0x0, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff80466f54 cs 8 rflags 10246 cr2 53c cpl 8 rsp ffff80003232bc60
kernel: page fault trap, code=0
Faulted in DDB; continuing...
db{0}> sh reg
ds 0xd000
es 0x7008
fs 0xd000
gs 0x7a00
rdi 0xffff800009f4a400
rsi 0
rbp 0x534
rbx 0xffff80012da62000
rdx 0
rcx 0x800000
rax 0
r8 0x5010c889d
r9 0
r10 0
r11 0x5c8
r12 0x14
r13 0xffff800007cdd000
r14 0x2
r15 0xffff80003232c340
rip 0xffffffff80465425 cpu_in_cksum+0xa5
cs 0x8
rflags 0x10206
rsp 0xffff80003232c2b8
ss 0x10
netbsd:cpu_in_cksum+0xa5: movl 0(%rbx),%ecx
db{0}> mach cpu
addr dev id flags ipis curlwp fpcurlwp
0xffffffff80ad0a00 cpu0 0 3009 0 0xffff80003231f7c0 0x0
0xffff800007bd1000 cpu1 1 f002 0 0xffff80003231f3e0 0xffff8000323767e0
db{0}> mach cpu 1
using CPU 1
db{0}> bt
x86_pause() at netbsd:x86_pause
_kernel_lock() at netbsd:_kernel_lock+0xc7
sleepq_block() at netbsd:sleepq_block+0x1b3
cv_timedwait() at netbsd:cv_timedwait+0xb0
nfs_rcvlock() at netbsd:nfs_rcvlock+0xa2
nfs_request() at netbsd:nfs_request+0x421
nfs_writerpc() at netbsd:nfs_writerpc+0x42e
nfs_doio() at netbsd:nfs_doio+0x4d4
nfssvc_iod() at netbsd:nfssvc_iod+0x193
db{0}> sh reg
ds 0
es 0
fs 0
gs 0
rdi 0
rsi 0xffff800007bd1000
rbp 0xffff80003232f740
rbx 0
rdx 0x6
rcx 0
rax 0
r8 0
r9 0x60
r10 0xffff800007bd1080
r11 0
r12 0xffff800007bd1000
r13 0x1
r14 0xffff80003231f3e0
r15 0
rip 0xffffffff80465758 x86_pause
cs 0x8
rflags 0x246
rsp 0xffff80003232f6f8
ss 0x10
netbsd:x86_pause: repe nop
dumping won't work, alas. Any other info to grab next time?
>How-To-Repeat:
run builds and wait
>Fix:
>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Wed, 2 Sep 2009 16:28:01 +0200
On Wed, Sep 02, 2009 at 06:00:00AM +0000, spz@NetBSD.org wrote:
> Machine: amd64
> >Description:
> This has been greeting me at least 4 times now:
>
> uvm_fault(0xffffffff80b662c0, 0xffff80012da62000, 1) -> e
> fatal page fault in supervisor mode
> trap type 6 code 0 rip ffffffff80465425 cs 8 rflags 10206 cr2 ffff80012da62000 cpl 4 rsp ffff80003232c2b8
> kernel: page fault trap, code=0
> Stopped in pid 0.47 (system) at netbsd:cpu_in_cksum+0xa5: movl 0(%rbx),
> %ecx
> db{0}> bt
> cpu_in_cksum() at netbsd:cpu_in_cksum+0xa5
> uvm_fault(0xffffffff80b87e60, 0x0, 1) -> e
Looks a lot like what I'm seeing on a UP alpha NFS client. This seems to
be related do the server going away and back (I suspect it's the same
for this client).
See http://mail-index.netbsd.org/current-users/2008/12/23/msg006816.html
and followups
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: "S.P.Zeidler" <spz@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/41974
Date: Sun, 30 Sep 2012 07:42:23 +0000
Just for the record: The issue has survived into netbsd-6.
From: "Greg A. Woods" <woods@planix.ca>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>,
NetBSD GNATS Administrator <gnats-admin@NetBSD.org>
Cc:
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Mon, 23 Feb 2015 18:53:18 -0800
Some more info, possibly useful....
I recently, and finally, switched one of my servers from i386 to amd64
and suddenly I get these same cpu_in_cksum uvm_fault panics almost any
time I try to write (i.e. copy a large file) to an NFS mount point. Not
with every write, but it doesn't seem to take very many tries to
reproduce.
I never ever saw this problem before with the i386 kernel.
Both the before (i386) and after (amd64) systems were built from the
same source tree, which is on the very tip of the netbsd-5 branch.
These are running bare-metal on a Dell PE2950 (2x8-core, 32GB RAM).
It doesn't make any difference whether hardware assisted check-summing
capabilities are enabled in the ethernet interface or not. Initial
panics were observed with caps_enabled=3D0, but panics have continued with
the following config:
$ /sbin/ifconfig bnx1
bnx1: flags=3D8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST>=
mtu 1500
capabilities=3D3f00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,U=
DP4CSUM_Rx,UDP4CSUM_Tx>
caps_enabled=3D3f00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,U=
DP4CSUM_Rx,UDP4CSUM_Tx>
address: 00:1d:09:35:3c:09
media: Ethernet autoselect (1000baseT full-duplex)
status: active
inet 10.0.1.129 netmask 0xffffff00 broadcast 10.0.1.255
There's no trouble reading from remote NFS servers -- only writing to
them as an NFS client, and perhaps only with larger files/writes. I've
done several full builds, and a bunch of pkgsrc builds, with sources on
the same NFS server which fails when written to, and I've never had any
problem with the read-only access to src and pkgsrc. Manual tests with
'dd' reading large files with large reads work A-OK as well
(i.e. reading with the amd64 kernel as a client, or reading from the
other machine with the adm64 kernel as a server).
I.e.: note that the amd64 kernel happily serves NFS without
encountering this error.
Assuming the new PE2950 that arrived today is in working order then soon
I should be able to test if this happens in a Xen domU, and with
NetBSD-current.
One other possibly interesting point: The server in this case has been
an older PE2650 running NetBSD 4.0_STABLE, and it has a weird "tick" in
its RAID controller and/or driver (see PR# kern/35769), which means it
sometimes doesn't always respond to NFS requests in the most timely
manner. I.e. perhaps this bug is more easily tickled when the NFS
server is slow, and/or the network connection is poor, or similar.
Perhaps I will try using an NFS mount of my iMac; and soon I should also
be able to cross-mount the PE2950s for testing as well (especially if
the bug is reproducible in a Xen kernel).
--=20
Greg A. Woods
Planix, Inc.
<woods@planix.com> +1 250 762-7675 http://www.planix.com/
From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Tue, 24 Feb 2015 12:28:07 +0100
On Tue, Feb 24, 2015 at 04:15:00AM +0000, Greg A. Woods wrote:
> I recently, and finally, switched one of my servers from i386 to amd64
> and suddenly I get these same cpu_in_cksum uvm_fault panics almost any
> time I try to write (i.e. copy a large file) to an NFS mount point. Not
> with every write, but it doesn't seem to take very many tries to
> reproduce.
Can you be more specific about what uvm_fault panics you see?
Joerg
From: "Greg A. Woods" <woods@planix.ca>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: Joerg Sonnenberger <joerg@britannica.bec.de>,
NetBSD GNATS Administrator <gnats-admin@NetBSD.org>
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Tue, 24 Feb 2015 10:11:28 -0800
Joerg asked:
>=20
> Can you be more specific about what uvm_fault panics you see?
Sorry, but they're just very much like the ones reported in the PR.
They are apparently from a fault in cpu_in_cksum(), and any attempt to
print a full stack trace causes another fault to happen in ddb, and the
system needs a hard reset if you try to dump or sync or reboot with sync
from ddb. "reboot 0x4" works of course, albeit with the same loss to
the mounted filesystems as a hard reset.
Here's a copy of a couple of different crashes from my console log.
uvm_fault(0xffffffff80d5bd80, 0xffff8001806fa000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff80571b65 cs 8 rflags 10202 cr2 ffff8001806f=
a000 cpl 4 rsp ffff80008723b328
kernel: page fault trap, code=3D0
Stopped in pid 0.95 (system) at netbsd:cpu_in_cksum+0xa5: movl 0(%=
rbx),%ecx
db{4}> bt
cpu_in_cksum() at netbsd:cpu_in_cksum+0xa5
uvm_fault(0xffffffff80d8ca00, 0x8000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff805736d4 cs 8 rflags 10246 cr2 8008 cpl 8 r=
sp ffff80008723ace0
kernel: page fault trap, code=3D0
Faulted in DDB; continuing...
db{4}>=20
uvm_fault(0xffffffff80d5bd80, 0xffff8001806fa000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff80571b65 cs 8 rflags 10202 cr2 ffff8001806f=
a000 cpl 4 rsp ffff8000b5053328
kernel: page fault trap, code=3D0
Stopped in pid 0.96 (system) at netbsd:cpu_in_cksum+0xa5: movl 0(%=
rbx),%ecx
db{7}> trace
cpu_in_cksum() at netbsd:cpu_in_cksum+0xa5
uvm_fault(0xffffffff80d8ca00, 0x8000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff805736d4 cs 8 rflags 10246 cr2 8008 cpl 8 r=
sp ffff8000b5052ce0
kernel: page fault trap, code=3D0
Faulted in DDB; continuing...
db{7}>=20
--=20
Greg A. Woods
Planix, Inc.
<woods@planix.com> +1 250 762-7675 http://www.planix.com/
From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>, "Greg A. Woods" <woods@planix.ca>
Cc:
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Tue, 24 Feb 2015 20:38:10 +0100
On Tue, Feb 24, 2015 at 10:11:28AM -0800, Greg A. Woods wrote:
> Joerg asked:
> >
> > Can you be more specific about what uvm_fault panics you see?
>
> Sorry, but they're just very much like the ones reported in the PR.
Can you disable the CPU_IN_CKSUM option and trigger to trigger it again?
The C version would be easier to instrument to see why the mbuf chain is
bad...
Joerg
From: "Greg A. Woods" <woods@planix.ca>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: Joerg Sonnenberger <joerg@britannica.bec.de>
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Wed, 25 Feb 2015 16:41:01 -0800
At Tue, 24 Feb 2015 20:38:10 +0100, Joerg Sonnenberger <joerg@britannica.be=
c.de> wrote:
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
>=20
> Can you disable the CPU_IN_CKSUM option and trigger to trigger it again?
> The C version would be easier to instrument to see why the mbuf chain is
> bad...
Well, it seemed quite a lot more difficult to trigger the crash with the
C version, but eventually, just when I didn't think it would happen....
The back-trace looks better, and the machine didn't hang quite so badly,
but syncing still caused a second panic, and so there was no crash dump
either.
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff8054021e cs 8 rflags 10206 cr2 ffff8001806f=
a000 cpl 4 rsp ffff800086b58870
kernel: page fault trap, code=3D0
Stopped in pid 0.5 (system) at netbsd:cpu_in_cksum+0xce: movl 0xf=
fffffffffffffc0(%rsi),%edx
db{0}> bt
cpu_in_cksum() at netbsd:cpu_in_cksum+0xce
in_delayed_cksum() at netbsd:in_delayed_cksum+0x3c
ip_output() at netbsd:ip_output+0xd68
udp_output() at netbsd:udp_output+0x221
udp_usrreq() at netbsd:udp_usrreq+0x31c
udp_usrreq_wrapper() at netbsd:udp_usrreq_wrapper+0x51
nfs_timer() at netbsd:nfs_timer+0x41f
callout_softclock() at netbsd:callout_softclock+0x21b
softint_dispatch() at netbsd:softint_dispatch+0xc3
DDB lost frame for netbsd:Xsoftintr+0x50, trying 0xffff800086b58d70
Xsoftintr() at netbsd:Xsoftintr+0x50
--- interrupt ---
0:
db{0}> sh reg
ds 0
es 0
fs 0
gs 0
rdi 0
rsi 0xffff8001806fa040
rbp 0xffff800086b58890
rbx 0
rdx 0
rcx 0
rax 0x8000
r8 0x7fc0
r9 0xc000000000000000
r10 0x43ab4701
r11 0x8000
r12 0xffff80003ec23400
r13 0x8000
r14 0
r15 0
rip 0xffffffff8054021e cpu_in_cksum+0xce
cs 0x8
rflags 0x10206
rsp 0xffff800086b58870
ss 0
netbsd:cpu_in_cksum+0xce: movl 0xffffffffffffffc0(%rsi),%edx
db{0}> sh uvmexp
Current UVM status:
pagesize=3D4096 (0x1000), pagemask=3D0xfff, pageshift=3D12
8129044 VM pages: 692704 active, 111 inactive, 2027 wired, 7047912 free
pages 127267 anon, 562283 file, 5292 exec
freemin=3D2048, free-target=3D2730, wired-max=3D2709681
faults=3D8436894, traps=3D8435463, intrs=3D69970442, ctxswitch=3D249513713
softint=3D96511309, syscalls=3D171036039, swapins=3D14, swapouts=3D23
fault counts:
noram=3D0, noanon=3D0, pgwait=3D0, pgrele=3D0
ok relocks(total)=3D4570(4571), anget(retrys)=3D107575(0), amapcopy=3D2=
42698
neighbor anon/obj pg=3D148737/251970, gets(lock/unlock)=3D61640/4571
cases: anon=3D71314, anoncow=3D36270, obj=3D58530, prcopy=3D3110, przer=
o=3D3520508
daemon and swap counts:
woke=3D521, revs=3D521, scans=3D4741657, obscans=3D4627347, anscans=3D0
busy=3D0, freed=3D4627347, reactivate=3D243, deactivate=3D7295399
pageouts=3D0, pending=3D0, nswget=3D0
nswapdev=3D1, swpgavail=3D12582911
swpages=3D12582911, swpginuse=3D0, swpgonly=3D0, paging=3D0
db{0}> reboot
syncing disks... panic: assert_sleepable: softint caller=3D0xffffffff804f82=
69
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff80571f55 cs 8 rflags 246 cr2 ffff8001806fa0=
00 cpl 0 rsp ffff800086b58140
Stopped in pid 0.5 (system) at netbsd:breakpoint+0x5: leave
db{0}> bt
breakpoint() at netbsd:breakpoint+0x5
panic() at netbsd:panic+0x24d
assert_sleepable() at netbsd:assert_sleepable+0x80
_fstrans_start() at netbsd:_fstrans_start+0x29
ffs_sync() at netbsd:ffs_sync+0x6f
VFS_SYNC() at netbsd:VFS_SYNC+0x33
sys_sync() at netbsd:sys_sync+0xeb
vfs_shutdown() at netbsd:vfs_shutdown+0x50
cpu_reboot() at netbsd:cpu_reboot+0x100
db_reboot_cmd() at netbsd:db_reboot_cmd+0x47
db_command() at netbsd:db_command+0xb0
db_command_loop() at netbsd:db_command_loop+0xf8
db_trap() at netbsd:db_trap+0x114
kdb_trap() at netbsd:kdb_trap+0xf3
trap() at netbsd:trap+0x34a
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff80573e7b cs 8 rflags 10247 cr2 b8 cpl 8 rsp=
ffff800086b57b40
kernel: page fault trap, code=3D0
Faulted in DDB; continuing...
db{0}> sh reg
ds 0x8150
es 0
fs 0x8100
gs 0x7
rdi 0
rsi 0x3f8
rbp 0xffff800086b58140
rbx 0xffff800086b58150
rdx 0x8
rcx 0
rax 0x1
r8 0xffffffff80aa619a copyright+0xb381a
r9 0x1
r10 0xffff800086b58060
r11 0xffffffff802ea7f0 comcnputc
r12 0x104
r13 0xffffffff80a41eb0 copyright+0x4f530
r14 0x1
r15 0xffff80003ec49900
rip 0xffffffff80571f55 breakpoint+0x5
cs 0x8
rflags 0x246
rsp 0xffff800086b58140
ss 0
netbsd:breakpoint+0x5: leave
db{0}> reboot
rebooting...
--=20
Greg A. Woods
Planix, Inc.
<woods@planix.com> +1 250 762-7675 http://www.planix.com/
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.