NetBSD Problem Report #41974

From spz@NetBSD.org  Wed Sep  2 05:56:21 2009
Return-Path: <spz@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 6BBAE63BB08
	for <gnats-bugs@gnats.NetBSD.org>; Wed,  2 Sep 2009 05:56:21 +0000 (UTC)
Message-Id: <20090902055618.EC3191E4E000@build.netbsd.org>
Date: Wed,  2 Sep 2009 05:56:18 +0000 (UTC)
From: spz@NetBSD.org
Reply-To: spz@NetBSD.org
To: gnats-bugs@gnats.NetBSD.org
Subject: panic in cpu_in_cksum / likely NFS issue
X-Send-Pr-Version: 3.95

>Number:         41974
>Category:       kern
>Synopsis:       panic in cpu_in_cksum
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Sep 02 06:00:00 +0000 2009
>Last-Modified:  Thu Feb 26 00:45:00 +0000 2015
>Originator:     S.P.Zeidler
>Release:        NetBSD 5.0_STABLE
>Organization:
	TNF
>Environment:
System: NetBSD b2.netbsd.org 5.0_STABLE NetBSD 5.0_STABLE (BUILD) #5: Sun Aug 9 20:35:32 UTC 2009 tls@ADMIN:/chroots/netbsd-5/src/sys/arch/amd64/compile/obj/BUILD amd64
Architecture: x86_64
Machine: amd64
>Description:
	This has been greeting me at least 4 times now:

uvm_fault(0xffffffff80b662c0, 0xffff80012da62000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff80465425 cs 8 rflags 10206 cr2  ffff80012da62000 cpl 4 rsp ffff80003232c2b8
kernel: page fault trap, code=0
Stopped in pid 0.47 (system) at netbsd:cpu_in_cksum+0xa5:       movl    0(%rbx),
%ecx
db{0}> bt
cpu_in_cksum() at netbsd:cpu_in_cksum+0xa5
uvm_fault(0xffffffff80b87e60, 0x0, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff80466f54 cs 8 rflags 10246 cr2  53c cpl 8 rsp ffff80003232bc60
kernel: page fault trap, code=0
Faulted in DDB; continuing...
db{0}> sh reg
ds          0xd000
es          0x7008
fs          0xd000
gs          0x7a00
rdi         0xffff800009f4a400
rsi         0
rbp         0x534
rbx         0xffff80012da62000
rdx         0
rcx         0x800000
rax         0
r8          0x5010c889d
r9          0
r10         0
r11         0x5c8
r12         0x14
r13         0xffff800007cdd000
r14         0x2
r15         0xffff80003232c340
rip         0xffffffff80465425  cpu_in_cksum+0xa5
cs          0x8
rflags      0x10206
rsp         0xffff80003232c2b8
ss          0x10
netbsd:cpu_in_cksum+0xa5:       movl    0(%rbx),%ecx
db{0}> mach cpu
addr            dev     id      flags   ipis    curlwp          fpcurlwp
0xffffffff80ad0a00      cpu0    0       3009    0       0xffff80003231f7c0             0x0
0xffff800007bd1000      cpu1    1       f002    0       0xffff80003231f3e0      0xffff8000323767e0
db{0}> mach cpu 1
using CPU 1
db{0}> bt
x86_pause() at netbsd:x86_pause
_kernel_lock() at netbsd:_kernel_lock+0xc7
sleepq_block() at netbsd:sleepq_block+0x1b3
cv_timedwait() at netbsd:cv_timedwait+0xb0
nfs_rcvlock() at netbsd:nfs_rcvlock+0xa2
nfs_request() at netbsd:nfs_request+0x421
nfs_writerpc() at netbsd:nfs_writerpc+0x42e
nfs_doio() at netbsd:nfs_doio+0x4d4
nfssvc_iod() at netbsd:nfssvc_iod+0x193
db{0}> sh reg
ds          0
es          0
fs          0
gs          0
rdi         0
rsi         0xffff800007bd1000
rbp         0xffff80003232f740
rbx         0
rdx         0x6
rcx         0
rax         0
r8          0
r9          0x60
r10         0xffff800007bd1080
r11         0
r12         0xffff800007bd1000
r13         0x1
r14         0xffff80003231f3e0
r15         0
rip         0xffffffff80465758  x86_pause
cs          0x8
rflags      0x246
rsp         0xffff80003232f6f8
ss          0x10
netbsd:x86_pause:       repe nop

	dumping won't work, alas. Any other info to grab next time?

>How-To-Repeat:
	run builds and wait
>Fix:


>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Wed, 2 Sep 2009 16:28:01 +0200

 On Wed, Sep 02, 2009 at 06:00:00AM +0000, spz@NetBSD.org wrote:
 > Machine: amd64
 > >Description:
 > 	This has been greeting me at least 4 times now:
 > 
 > uvm_fault(0xffffffff80b662c0, 0xffff80012da62000, 1) -> e
 > fatal page fault in supervisor mode
 > trap type 6 code 0 rip ffffffff80465425 cs 8 rflags 10206 cr2  ffff80012da62000 cpl 4 rsp ffff80003232c2b8
 > kernel: page fault trap, code=0
 > Stopped in pid 0.47 (system) at netbsd:cpu_in_cksum+0xa5:       movl    0(%rbx),
 > %ecx
 > db{0}> bt
 > cpu_in_cksum() at netbsd:cpu_in_cksum+0xa5
 > uvm_fault(0xffffffff80b87e60, 0x0, 1) -> e

 Looks a lot like what I'm seeing on a UP alpha NFS client. This seems to
 be related do the server going away and back (I suspect it's the same
 for this client).

 See http://mail-index.netbsd.org/current-users/2008/12/23/msg006816.html
 and followups


 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: "S.P.Zeidler" <spz@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/41974
Date: Sun, 30 Sep 2012 07:42:23 +0000

 Just for the record: The issue has survived into netbsd-6.

From: "Greg A. Woods" <woods@planix.ca>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>,
    NetBSD GNATS Administrator <gnats-admin@NetBSD.org>
Cc: 
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Mon, 23 Feb 2015 18:53:18 -0800

 Some more info, possibly useful....

 I recently, and finally, switched one of my servers from i386 to amd64
 and suddenly I get these same cpu_in_cksum uvm_fault panics almost any
 time I try to write (i.e. copy a large file) to an NFS mount point.  Not
 with every write, but it doesn't seem to take very many tries to
 reproduce.

 I never ever saw this problem before with the i386 kernel.

 Both the before (i386) and after (amd64) systems were built from the
 same source tree, which is on the very tip of the netbsd-5 branch.

 These are running bare-metal on a Dell PE2950 (2x8-core, 32GB RAM).

 It doesn't make any difference whether hardware assisted check-summing
 capabilities are enabled in the ethernet interface or not.  Initial
 panics were observed with caps_enabled=3D0, but panics have continued with
 the following config:

 $ /sbin/ifconfig bnx1
 bnx1: flags=3D8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST>=
  mtu 1500
         capabilities=3D3f00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,U=
 DP4CSUM_Rx,UDP4CSUM_Tx>
         caps_enabled=3D3f00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,U=
 DP4CSUM_Rx,UDP4CSUM_Tx>
         address: 00:1d:09:35:3c:09
         media: Ethernet autoselect (1000baseT full-duplex)
         status: active
         inet 10.0.1.129 netmask 0xffffff00 broadcast 10.0.1.255


 There's no trouble reading from remote NFS servers -- only writing to
 them as an NFS client, and perhaps only with larger files/writes.  I've
 done several full builds, and a bunch of pkgsrc builds, with sources on
 the same NFS server which fails when written to, and I've never had any
 problem with the read-only access to src and pkgsrc.  Manual tests with
 'dd' reading large files with large reads work A-OK as well
 (i.e. reading with the amd64 kernel as a client, or reading from the
 other machine with the adm64 kernel as a server).

 I.e.:  note that the amd64 kernel happily serves NFS without
 encountering this error.

 Assuming the new PE2950 that arrived today is in working order then soon
 I should be able to test if this happens in a Xen domU, and with
 NetBSD-current.


 One other possibly interesting point:  The server in this case has been
 an older PE2650 running NetBSD 4.0_STABLE, and it has a weird "tick" in
 its RAID controller and/or driver (see PR# kern/35769), which means it
 sometimes doesn't always respond to NFS requests in the most timely
 manner.  I.e. perhaps this bug is more easily tickled when the NFS
 server is slow, and/or the network connection is poor, or similar.
 Perhaps I will try using an NFS mount of my iMac; and soon I should also
 be able to cross-mount the PE2950s for testing as well (especially if
 the bug is reproducible in a Xen kernel).

 --=20
 						Greg A. Woods
 						Planix, Inc.

 <woods@planix.com>       +1 250 762-7675        http://www.planix.com/

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Tue, 24 Feb 2015 12:28:07 +0100

 On Tue, Feb 24, 2015 at 04:15:00AM +0000, Greg A. Woods wrote:
 >  I recently, and finally, switched one of my servers from i386 to amd64
 >  and suddenly I get these same cpu_in_cksum uvm_fault panics almost any
 >  time I try to write (i.e. copy a large file) to an NFS mount point.  Not
 >  with every write, but it doesn't seem to take very many tries to
 >  reproduce.

 Can you be more specific about what uvm_fault panics you see?

 Joerg

From: "Greg A. Woods" <woods@planix.ca>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: Joerg Sonnenberger <joerg@britannica.bec.de>,
    NetBSD GNATS Administrator <gnats-admin@NetBSD.org>
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Tue, 24 Feb 2015 10:11:28 -0800

 Joerg asked:
 >=20
 > Can you be more specific about what uvm_fault panics you see?

 Sorry, but they're just very much like the ones reported in the PR.

 They are apparently from a fault in cpu_in_cksum(), and any attempt to
 print a full stack trace causes another fault to happen in ddb, and the
 system needs a hard reset if you try to dump or sync or reboot with sync
 from ddb.  "reboot 0x4" works of course, albeit with the same loss to
 the mounted filesystems as a hard reset.

 Here's a copy of a couple of different crashes from my console log.

 uvm_fault(0xffffffff80d5bd80, 0xffff8001806fa000, 1) -> e
 fatal page fault in supervisor mode
 trap type 6 code 0 rip ffffffff80571b65 cs 8 rflags 10202 cr2  ffff8001806f=
 a000 cpl 4 rsp ffff80008723b328
 kernel: page fault trap, code=3D0
 Stopped in pid 0.95 (system) at netbsd:cpu_in_cksum+0xa5:       movl    0(%=
 rbx),%ecx
 db{4}> bt
 cpu_in_cksum() at netbsd:cpu_in_cksum+0xa5
 uvm_fault(0xffffffff80d8ca00, 0x8000, 1) -> e
 fatal page fault in supervisor mode
 trap type 6 code 0 rip ffffffff805736d4 cs 8 rflags 10246 cr2  8008 cpl 8 r=
 sp ffff80008723ace0
 kernel: page fault trap, code=3D0
 Faulted in DDB; continuing...
 db{4}>=20

 uvm_fault(0xffffffff80d5bd80, 0xffff8001806fa000, 1) -> e
 fatal page fault in supervisor mode
 trap type 6 code 0 rip ffffffff80571b65 cs 8 rflags 10202 cr2  ffff8001806f=
 a000 cpl 4 rsp ffff8000b5053328
 kernel: page fault trap, code=3D0
 Stopped in pid 0.96 (system) at netbsd:cpu_in_cksum+0xa5:       movl    0(%=
 rbx),%ecx
 db{7}> trace
 cpu_in_cksum() at netbsd:cpu_in_cksum+0xa5
 uvm_fault(0xffffffff80d8ca00, 0x8000, 1) -> e
 fatal page fault in supervisor mode
 trap type 6 code 0 rip ffffffff805736d4 cs 8 rflags 10246 cr2  8008 cpl 8 r=
 sp ffff8000b5052ce0
 kernel: page fault trap, code=3D0
 Faulted in DDB; continuing...
 db{7}>=20

 --=20
 						Greg A. Woods
 						Planix, Inc.

 <woods@planix.com>       +1 250 762-7675        http://www.planix.com/

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>, "Greg A. Woods" <woods@planix.ca>
Cc: 
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Tue, 24 Feb 2015 20:38:10 +0100

 On Tue, Feb 24, 2015 at 10:11:28AM -0800, Greg A. Woods wrote:
 > Joerg asked:
 > > 
 > > Can you be more specific about what uvm_fault panics you see?
 > 
 > Sorry, but they're just very much like the ones reported in the PR.

 Can you disable the CPU_IN_CKSUM option and trigger to trigger it again?
 The C version would be easier to instrument to see why the mbuf chain is
 bad...

 Joerg

From: "Greg A. Woods" <woods@planix.ca>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: Joerg Sonnenberger <joerg@britannica.bec.de>
Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
Date: Wed, 25 Feb 2015 16:41:01 -0800

 At Tue, 24 Feb 2015 20:38:10 +0100, Joerg Sonnenberger <joerg@britannica.be=
 c.de> wrote:
 Subject: Re: kern/41974: panic in cpu_in_cksum / likely NFS issue
 >=20
 > Can you disable the CPU_IN_CKSUM option and trigger to trigger it again?
 > The C version would be easier to instrument to see why the mbuf chain is
 > bad...

 Well, it seemed quite a lot more difficult to trigger the crash with the
 C version, but eventually, just when I didn't think it would happen....

 The back-trace looks better, and the machine didn't hang quite so badly,
 but syncing still caused a second panic, and so there was no crash dump
 either.

 fatal page fault in supervisor mode
 trap type 6 code 0 rip ffffffff8054021e cs 8 rflags 10206 cr2  ffff8001806f=
 a000 cpl 4 rsp ffff800086b58870
 kernel: page fault trap, code=3D0
 Stopped in pid 0.5 (system) at  netbsd:cpu_in_cksum+0xce:       movl    0xf=
 fffffffffffffc0(%rsi),%edx
 db{0}> bt
 cpu_in_cksum() at netbsd:cpu_in_cksum+0xce
 in_delayed_cksum() at netbsd:in_delayed_cksum+0x3c
 ip_output() at netbsd:ip_output+0xd68
 udp_output() at netbsd:udp_output+0x221
 udp_usrreq() at netbsd:udp_usrreq+0x31c
 udp_usrreq_wrapper() at netbsd:udp_usrreq_wrapper+0x51
 nfs_timer() at netbsd:nfs_timer+0x41f
 callout_softclock() at netbsd:callout_softclock+0x21b
 softint_dispatch() at netbsd:softint_dispatch+0xc3
 DDB lost frame for netbsd:Xsoftintr+0x50, trying 0xffff800086b58d70
 Xsoftintr() at netbsd:Xsoftintr+0x50
 --- interrupt ---
 0:
 db{0}> sh reg
 ds          0
 es          0
 fs          0
 gs          0
 rdi         0
 rsi         0xffff8001806fa040
 rbp         0xffff800086b58890
 rbx         0
 rdx         0
 rcx         0
 rax         0x8000
 r8          0x7fc0
 r9          0xc000000000000000
 r10         0x43ab4701
 r11         0x8000
 r12         0xffff80003ec23400
 r13         0x8000
 r14         0
 r15         0
 rip         0xffffffff8054021e  cpu_in_cksum+0xce
 cs          0x8
 rflags      0x10206
 rsp         0xffff800086b58870
 ss          0
 netbsd:cpu_in_cksum+0xce:       movl    0xffffffffffffffc0(%rsi),%edx
 db{0}> sh uvmexp
 Current UVM status:
   pagesize=3D4096 (0x1000), pagemask=3D0xfff, pageshift=3D12
   8129044 VM pages: 692704 active, 111 inactive, 2027 wired, 7047912 free
   pages  127267 anon, 562283 file, 5292 exec
   freemin=3D2048, free-target=3D2730, wired-max=3D2709681
   faults=3D8436894, traps=3D8435463, intrs=3D69970442, ctxswitch=3D249513713
   softint=3D96511309, syscalls=3D171036039, swapins=3D14, swapouts=3D23
   fault counts:
     noram=3D0, noanon=3D0, pgwait=3D0, pgrele=3D0
     ok relocks(total)=3D4570(4571), anget(retrys)=3D107575(0), amapcopy=3D2=
 42698
     neighbor anon/obj pg=3D148737/251970, gets(lock/unlock)=3D61640/4571
     cases: anon=3D71314, anoncow=3D36270, obj=3D58530, prcopy=3D3110, przer=
 o=3D3520508
   daemon and swap counts:
     woke=3D521, revs=3D521, scans=3D4741657, obscans=3D4627347, anscans=3D0
     busy=3D0, freed=3D4627347, reactivate=3D243, deactivate=3D7295399
     pageouts=3D0, pending=3D0, nswget=3D0
     nswapdev=3D1, swpgavail=3D12582911
     swpages=3D12582911, swpginuse=3D0, swpgonly=3D0, paging=3D0
 db{0}> reboot
 syncing disks... panic: assert_sleepable: softint caller=3D0xffffffff804f82=
 69
 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip ffffffff80571f55 cs 8 rflags 246 cr2  ffff8001806fa0=
 00 cpl 0 rsp ffff800086b58140
 Stopped in pid 0.5 (system) at  netbsd:breakpoint+0x5:  leave
 db{0}> bt
 breakpoint() at netbsd:breakpoint+0x5
 panic() at netbsd:panic+0x24d
 assert_sleepable() at netbsd:assert_sleepable+0x80
 _fstrans_start() at netbsd:_fstrans_start+0x29
 ffs_sync() at netbsd:ffs_sync+0x6f
 VFS_SYNC() at netbsd:VFS_SYNC+0x33
 sys_sync() at netbsd:sys_sync+0xeb
 vfs_shutdown() at netbsd:vfs_shutdown+0x50
 cpu_reboot() at netbsd:cpu_reboot+0x100
 db_reboot_cmd() at netbsd:db_reboot_cmd+0x47
 db_command() at netbsd:db_command+0xb0
 db_command_loop() at netbsd:db_command_loop+0xf8
 db_trap() at netbsd:db_trap+0x114
 kdb_trap() at netbsd:kdb_trap+0xf3
 trap() at netbsd:trap+0x34a
 fatal page fault in supervisor mode
 trap type 6 code 0 rip ffffffff80573e7b cs 8 rflags 10247 cr2  b8 cpl 8 rsp=
  ffff800086b57b40
 kernel: page fault trap, code=3D0
 Faulted in DDB; continuing...
 db{0}> sh reg
 ds          0x8150
 es          0
 fs          0x8100
 gs          0x7
 rdi         0
 rsi         0x3f8
 rbp         0xffff800086b58140
 rbx         0xffff800086b58150
 rdx         0x8
 rcx         0
 rax         0x1
 r8          0xffffffff80aa619a  copyright+0xb381a
 r9          0x1
 r10         0xffff800086b58060
 r11         0xffffffff802ea7f0  comcnputc
 r12         0x104
 r13         0xffffffff80a41eb0  copyright+0x4f530
 r14         0x1
 r15         0xffff80003ec49900
 rip         0xffffffff80571f55  breakpoint+0x5
 cs          0x8
 rflags      0x246
 rsp         0xffff800086b58140
 ss          0
 netbsd:breakpoint+0x5:  leave
 db{0}> reboot
 rebooting...

 --=20
 						Greg A. Woods
 						Planix, Inc.

 <woods@planix.com>       +1 250 762-7675        http://www.planix.com/

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.