NetBSD Problem Report #35448
From agrier@poofygoof.com Sat Jan 20 03:54:28 2007
Return-Path: <agrier@poofygoof.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id D70A263B8CE
for <gnats-bugs@gnats.NetBSD.org>; Sat, 20 Jan 2007 03:54:28 +0000 (UTC)
Message-Id: <20070120035419.E991233143@arwen.poofy.goof.com>
Date: Fri, 19 Jan 2007 19:54:18 -0800 (PST)
From: agrier@poofygoof.com
Reply-To: agrier@poofygoof.com
To: gnats-bugs@NetBSD.org
Subject: memory management fault trap during heavy network I/O
X-Send-Pr-Version: 3.95
>Number: 35448
>Category: port-alpha
>Synopsis: memory management fault trap during heavy network I/O
>Confidential: no
>Severity: critical
>Priority: medium
>Responsible: mhitch
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Jan 20 03:55:00 +0000 2007
>Closed-Date: Wed Sep 09 06:31:38 +0000 2015
>Last-Modified: Wed Sep 09 06:31:38 +0000 2015
>Originator: agrier@poofygoof.com
>Release: NetBSD 4.99.8
>Organization:
Aaron J. Grier | "Not your ordinary poofy goof." | agrier@poofygoof.com
>Environment:
System: NetBSD arwen.poofy.goof.com 4.99.8 NetBSD 4.99.8 (ARWEN) #0: Thu Jan 18 23:03:09 PST 2007 agrier@arwen.poofy.goof.com:/var/obj/ARWEN alpha
Architecture: alpha
Machine: alpha
ARWEN is an alphaserver 1000A 5/400.
the ARWEN kernel is GENERIC with hardcoded line to attach root at ld0.
>Description:
- the trap:
CPU 0: fatal kernel trap:
CPU 0 trap entry = 0x2 (memory management fault)
CPU 0 a0 = 0xfffffe0108266000
CPU 0 a1 = 0x1
CPU 0 a2 = 0x0
CPU 0 pc = 0xfffffc00007ecde0
CPU 0 ra = 0xfffffc000035f9ac
CPU 0 pv = 0x0
CPU 0 curlwp = 0xfffffc000fcd2660
CPU 0 pid = 335, comm = nfsio
panic: trap
Begin traceback...
alpha trace requires known PC =eject=
End traceback...
syncing disks... 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 5 5 5 5 5 giving up
- the backtrace:
(gdb) bt
#0 0xfffffc00007df888 in dumpsys ()
at /projects/NetBSD/src/sys/arch/alpha/alpha/machdep.c:1229
#1 0xfffffc00007dfdb0 in cpu_reboot ()
at /projects/NetBSD/src/sys/arch/alpha/alpha/machdep.c:1048
#2 0xfffffc0000644a50 in panic ()
at /projects/NetBSD/src/sys/kern/subr_prf.c:246
#3 0xfffffc00007e7248 in trap ()
at /projects/NetBSD/src/sys/arch/alpha/alpha/trap.c:601
#4 0xfffffc00003003e8 in XentMM ()
at /projects/NetBSD/src/sys/arch/alpha/alpha/locore.s:492
#5 0xfffffc000035f9ac in in_delayed_cksum ()
at /projects/NetBSD/src/sys/netinet/ip_output.c:1123
can not access 0xfffffffd, invalid translation (invalid L1 PTE)
can not access 0xfffffffd, invalid translation (invalid L1 PTE)
Cannot access memory at address 0xfffffffffffffffd
- some poking:
(gdb) frame 5
#5 0xfffffc000035f9ac in in_delayed_cksum ()
at /projects/NetBSD/src/sys/netinet/ip_output.c:1123
1123 csum = in4_cksum(m, 0, offset, ntohs(ip->ip_len) - offset);
(gdb) proc 0xfffffc000fcd2660 # curlwp from the trap
(gdb) bt
#0 0xfffffc000062a730 in mi_switch ()
at /projects/NetBSD/src/sys/kern/kern_synch.c:997
(gdb) list *0xfffffc00007ecde0 # pc from the trap
0xfffffc00007ecde0 is in in4_cksum
(/projects/NetBSD/src/sys/netinet/in4_cksum.c:175).
- dmesg
NetBSD 4.99.8 (ARWEN) #0: Thu Jan 18 23:03:09 PST 2007
agrier@arwen.poofy.goof.com:/var/obj/ARWEN
AlphaServer 1000A 5/400, 400MHz, s/n
8192 byte page size, 1 processor.
total memory = 256 MB
(2016 KB reserved for PROM, 254 MB used by NetBSD)
avail memory = 241 MB
mainbus0 (root)
cpu0 at mainbus0: ID 0 (primary), 21164A-2
cpu0: Architecture extensions: 1<BWX>
cia0 at mainbus0: DECchip 2117x Core Logic Chipset (ALCOR/ALCOR2), pass 3
cia0: extended capabilities: 21<DWEN,BWEN>
cia0: using BWX for PCI config access
pci0 at cia0 bus 0
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pceb0 at pci0 dev 7 function 0: Intel 82375EB/SB PCI-EISA Bridge (rev. 0x05)
ppb0 at pci0 dev 8 function 0: Digital Equipment DC21050 PCI-PCI Bridge (rev. 0x02)
pci1 at ppb0 bus 2
pci1: i/o space, memory space enabled, rd/line, wr/inv ok
isp0 at pci1 dev 0 function 0: QLogic 1020 Fast Wide SCSI HBA
isp0: interrupting at dec_1000a irq 0
scsibus0 at isp0: 16 targets, 8 luns per target
tlp0 at pci0 dev 11 function 0: DECchip 21140 Ethernet, pass 1.2
tlp0: interrupting at dec_1000a irq 1
tlp0: DEC DE500-XA, Ethernet address 00:00:f8:02:06:a5
tlp0: 10baseT, 100baseTX, 100baseTX-FDX, 10baseT-FDX
mlx0 at pci0 dev 12 function 0: Mylex RAID (v2 interface)
mlx0: interrupting at dec_1000a irq 3
mlx0: DAC960P/PD, 3 channels, firmware 2.70-0-00, 32MB RAM
ld0 at mlx0 unit 0: RAID5, online
ld0: 16380 MB, 8320 cyl, 64 head, 63 sec, 512 bytes/sect x 33546240 sectors
ld1 at mlx0 unit 1: RAID5, online
ld1: 32768 MB, 8322 cyl, 128 head, 63 sec, 512 bytes/sect x 67108864 sectors
ld2 at mlx0 unit 2: RAID5, online
ld2: 32768 MB, 8322 cyl, 128 head, 63 sec, 512 bytes/sect x 67108864 sectors
ld3 at mlx0 unit 3: RAID5, online
ld3: 4536 MB, 2304 cyl, 64 head, 63 sec, 512 bytes/sect x 9289728 sectors
eisa0 at pceb0
eisa0: can't map I/O space for slot 9
isa0 at pceb0
lpt0 at isa0 port 0x3bc-0x3bf irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
attimer0 at isa0 port 0x40-0x43: AT Timer
vga0 at isa0 port 0x3b0-0x3df iomem 0xa0000-0xbffff
wsdisplay0 at vga0 kbdmux 1
wsmux1: connecting to wsdisplay0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker (CPU-intensive output)
spkr0 at pcppi0
isabeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
mcclock0 at isa0 port 0x70-0x71: mc146818 or compatible
pcppi0: attached to attimer0
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 0 lun 0: <DEC, RZ28M (C) DEC, 0568> disk fixed
sd0: async, 8-bit transfers
sd0: 2007 MB, 3045 cyl, 16 head, 84 sec, 512 bytes/sect x 4110480 sectors
sd0: sync (100.00ns offset 12), 8-bit (10.000MB/s) transfers, tagged queueing
cd0 at scsibus0 target 4 lun 0: <DEC, RRD45 (C) DEC, 1645> cdrom removable
cd0: async, 8-bit transfers
WARNING: can't figure what device matches "RAID 0 12 0 0 0 0 0"
root on ld0a dumps on sd0b
- other misc foo
ps won't grok the coredump:
arwen$ ps -N netbsd.gdb -M /var/crash/netbsd.0.core
ps: can't read proc credentials at 0xfffffc000ade3480: Undefined error: 0
>How-To-Repeat:
it seems to be triggered by syncing a remotely mounted mailbox from
within pine or mutt.
>Fix:
figure out what is causing the trap? maybe a stack smash, based on
previous port-alpha mailing list entries. perhaps
options KSTACK_CHECK_MAGIC
is in order?
>Release-Note:
>Audit-Trail:
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: port-alpha/35448: memory management fault trap during heavy
network I/O
Date: Mon, 22 Jan 2007 14:36:25 -0700 (MST)
On Sat, 20 Jan 2007, agrier@poofygoof.com wrote:
> - the trap:
>
> CPU 0: fatal kernel trap:
>
> CPU 0 trap entry = 0x2 (memory management fault)
> CPU 0 a0 = 0xfffffe0108266000
> CPU 0 a1 = 0x1
> CPU 0 a2 = 0x0
> CPU 0 pc = 0xfffffc00007ecde0
> CPU 0 ra = 0xfffffc000035f9ac
> CPU 0 pv = 0x0
> CPU 0 curlwp = 0xfffffc000fcd2660
> CPU 0 pid = 335, comm = nfsio
...
> (gdb) list *0xfffffc00007ecde0 # pc from the trap
> 0xfffffc00007ecde0 is in in4_cksum
> (/projects/NetBSD/src/sys/netinet/in4_cksum.c:175).
A preliminary analysis seems to indicate the trap occurred where
in4_cksum is summing 16 words in an unrolled loop. If I understand the
trap registers correctly, it looks like the address causing the trap is
0xfffffe0108266000 (the contents of a0 above). Running pmap(1) against
the coredump and kernel file seems to indicate that the address is not
within the current kernel's mapped address space. The gdb backtrace
fails, so it's a little hard to figure out where it came from. I'm going
to start groveling through the stack myself to see if I can dig out the
parameters to the in4_cksum() call, and if I can follow the traceback
manually.
It might be helpful if a backtrace from ddb could be obtained (although
my recent experience with a 4.0_BETA [not 4.0_BETA2 yet] kernel was unable
to get a good backtrace on my own machine).
> - other misc foo
>
> ps won't grok the coredump:
>
> arwen$ ps -N netbsd.gdb -M /var/crash/netbsd.0.core
> ps: can't read proc credentials at 0xfffffc000ade3480: Undefined error: 0
There's an xps gdb script in src/sys/gdbscripts that is able to display
some process information (and could be extended to show more process
details).
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: port-alpha-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, agrier@poofygoof.com
Subject: Re: port-alpha/35448: memory management fault trap during heavy
network I/O
Date: Mon, 29 Jan 2007 11:09:01 -0700 (MST)
On Mon, 22 Jan 2007, Michael L. Hitch wrote:
> fails, so it's a little hard to figure out where it came from. I'm going
> to start groveling through the stack myself to see if I can dig out the
> parameters to the in4_cksum() call, and if I can follow the traceback
> manually.
OK, I've dug out more information from the raw stack dump. I located
the address of the mbuf and found that it has the same bad address in
mh_data:
(gdb) print (struct mbuf)*0xfffffc000ef7be18
can not access 0x8266000, invalid translation (invalid L2 PTE)
can not access 0x8266000, invalid translation (invalid L2 PTE)
can not access 0x8266000, invalid translation (invalid L2 PTE)
can not access 0x8266000, invalid translation (invalid L2 PTE)
can not access 0x8266000, invalid translation (invalid L2 PTE)
can not access 0x8266000, invalid translation (invalid L2 PTE)
$2 = {m_hdr = {mh_next = 0x0, mh_nextpkt = 0x0,
mh_data = 0xfffffe0108266000 <Address 0xfffffe0108266000 out of
bounds>,
mh_owner = 0x4e4f5a414d412d58, mh_len = 4096, mh_flags = 67108865,
mh_paddr = 251117080, mh_type = 1}, M_dat = {MH = {MH_pkthdr = {
rcvif = 0xfffffe000005a080, tags = {slh_first = 0x0}, len = 188,
csum_flags = 0, csum_data = 0, segsz = 0}, MH_dat = {MH_ext = {
can not access 0x8266000, invalid translation (invalid L2 PTE)
can not access 0x8266000, invalid translation (invalid L2 PTE)
can not access 0x8266000, invalid translation (invalid L2 PTE)
can not access 0x8266000, invalid translation (invalid L2 PTE)
can not access 0x8266000, invalid translation (invalid L2 PTE)
can not access 0x8266000, invalid translation (invalid L2 PTE)
ext_buf = 0xfffffe0108266000 <Address 0xfffffe0108266000 out of
bounds>, ext_fr$
ext_arg = 0xfffffe000c617cb8, ext_size = 4096,
ext_type = 0xfffffc0000a62558, ext_nextref = 0xfffffc000ef7b118,
ext_prevref = 0xfffffc000ef7a218, ext_un = {
extun_paddr = 14733978372531027968, extun_pgs = {
On a whim, I took a look at the data located at 0xfffffe0008266000 and
found what looks like data that might be expected, and Aaron confirmed
that the data was part of a mailbox file that was being synched. So it
looked like something had corrupted the address used by the mbuf. I
followed the stack back to nfs_writerpc, which can use the address of data
being sent as the external data address for the mbuf. I dug out the
address of the uio and iovec structures used at that point and found:
(gdb) print (struct uio)*0xfffffe000c617e70
$8 = {uio_iov = 0xfffffe000c617e60, uio_iovcnt = 1, uio_offset = 102400,
uio_resid = 18446744069414588416, uio_rw = UIO_WRITE,
uio_vmspace = 0xfffffc0000abc018}
(gdb) print (struct iovec)*0xfffffe000c617e60
$9 = {iov_base = 0xfffffe0108267000, iov_len = 18446744069414588416}
(gdb) x/2gx 0xfffffe000c617e60
0xfffffe000c617e60: 0xfffffe0108267000 0xffffffff00001000
The buffer address in iov_base is corrupt as well. In addition, the
iov_len field appears corrupted.
Following the stack back further, I get to nfs_doio and get the address
of the struct buf that was used to generate the uio/iovec data:
(gdb) print (struct buf)*0xfffffc00052b8dc0
$3 = {b_u = {u_actq = {tqe_next = 0xdeadbeef, tqe_prev =
0xfffffc00052b88b8},
u_work = {wk_entry = {sqe_next = 0xdeadbeef}}}, b_interlock = {
lock_data = 86745072}, b_flags = 85, b_error = 0, b_prio = 0,
b_bufsize = 8192, b_bcount = 8192, b_resid = 8192, b_dev = 4294967295,
b_un = {
b_addr = 0xfffffe0008266000 "ntent-Transfer-Encoding:Message-ID;\n
b=T2nY8PninSOLy9W$
b_iodone = 0xfffffc00005bd600 <uvm_aio_biodone>,
b_proc = 0xfffffc0000abc4a0, b_vp = 0xfffffc000bea53c0, b_dep = {
lh_first = 0x0}, b_saveaddr = 0x0, b_fspriv = {
bf_private = 0xfffffc00052b95a8, bf_dcookie = -4397959768664}, b_hash
= {
le_next = 0x16, le_prev = 0x0}, b_vnbufs = {le_next = 0x87654321,
le_prev = 0x4}, b_freelist = {tqe_next = 0x0,
tqe_prev = 0xfffffe0000263700}, b_lblkno = 0, b_freelistindex = 0}
Lo and behold, it has the correct address of the data! So somwhere
between nfs_doio() and nfs_writeprc(), the iov_base and iov_len values
get clobbered (in an apparently fairly consistant way).
Since the bad address was easy to check for, I inserted a number of
KASSERT() statements in nfs_doio(), nfs_doio_write, and nfs_writerpc().
I was able to induce this failure on my own alpha at this point. I found
that the address was good at the entry of nfs_writerpc(), but had been
corrupted at the start of the loop sending out the data. This seemed odd,
since there didn't appear to be anything that would cause the type of
corruption I was seeing. While trying to figure out where some of the
local variables in nfs_writerpc() were located on the stack, I noticed
there was a 'retry:' label before the output loop. Finding where that
label was used shed some light on things. Certain conditions (which I'm
not too clear on, since I don't understand NFS all that well) would cause
a resend of the entire data buffer, and if that clobbered the data address
and length, would result in what I was seeing. Indeed, that was the case;
a few more KASSERT() statements showed that the UIO_ADVANCE() at line 1547
of nfs_vnops.c was clobbering the iovec data.
Closer examinination of what UIO_ADVANCE() was doing, and examination of
the generated code show what the problem was.
The alpha has 64 bit pointers, and the iov_len values was also 64 bits.
The variable backup used to adjust the iovec data is an unsigned 32 bit
value. The changes for version 1.225 appear to have intruduced a problem
that only showed up on the alpha. Prior to that, the unsigned value of
'backup' was being subtracted from iov_base, and added to iov_len. In
version 1.225, that was changed to use the macro UIO_ADVANCE() and passing
a negated value of 'backup' to the macro. The compiler thus negated the
32 bit unsigned value of 'backup' and zero-extended the result to 64 bits
which was added to iov_base, and subtracted fro iov_len. resulting in the
clobbered values.
Changing the UIO_ADVANCE() to a UIO_RETREAT() which passed 'backup'
directly and subtracted that from iov_base, and added it to iov_len gave
me a kernel which did not crash when nfs_writerpc() resent the data. I've
also just verified that simply making 'backup' a signed 32 bit also works
using the UIO_ADVANCE() macro.
---
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
Responsible-Changed-From-To: port-alpha-maintainer->mhitch
Responsible-Changed-By: mhitch@netbsd.org
Responsible-Changed-When: Mon, 29 Jan 2007 18:33:38 +0000
Responsible-Changed-Why:
I've analyzed it, so I'll take it.
State-Changed-From-To: open->analyzed
State-Changed-By: mhitch@netbsd.org
State-Changed-When: Mon, 29 Jan 2007 18:33:38 +0000
State-Changed-Why:
I've analyzed the problem.
From: Christian Biere <christianbiere@gmx.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-alpha/35448: memory management fault trap during heavy network I/O
Date: Thu, 1 Feb 2007 03:37:52 +0100
Michael L. Hitch wrote:
> Changing the UIO_ADVANCE() to a UIO_RETREAT() which passed 'backup'
> directly and subtracted that from iov_base, and added it to iov_len gave
> me a kernel which did not crash when nfs_writerpc() resent the data. I've
> also just verified that simply making 'backup' a signed 32 bit also works
> using the UIO_ADVANCE() macro.
I'd prefer the former because it's cleaner. I take "advance" as a strong
emphasis that it's meant to move forward. At least I've written a similar
function before and decided against the more flexible term "add" for
exactly this reason.
--- sys/nfs/nfs_vnops.c.orig 2007-01-26 21:52:50.000000000 +0100
+++ sys/nfs/nfs_vnops.c 2007-02-01 03:21:00.000000000 +0100
@@ -251,10 +251,22 @@ extern const nfstype nfsv3_type[9];
int nfs_numasync = 0;
#define DIRHDSIZ _DIRENT_NAMEOFF(dp)
-#define UIO_ADVANCE(uio, siz) \
- (void)((uio)->uio_resid -= (siz), \
- (uio)->uio_iov->iov_base = (char *)(uio)->uio_iov->iov_base + (siz), \
- (uio)->uio_iov->iov_len -= (siz))
+
+static __inline void
+UIO_ADVANCE(struct uio *uio, size_t n)
+{
+ uio->uio_resid -= n;
+ uio->uio_iov->iov_base = (char *)uio->uio_iov->iov_base + n;
+ uio->uio_iov->iov_len -= n;
+}
+
+static __inline void
+UIO_RETREAT(struct uio *uio, size_t n)
+{
+ uio->uio_resid += n;
+ uio->uio_iov->iov_base = (char *)uio->uio_iov->iov_base - n;
+ uio->uio_iov->iov_len += n;
+}
static void nfs_cache_enter(struct vnode *, struct vnode *,
struct componentname *);
@@ -1420,7 +1432,7 @@ retry:
break;
} else if (rlen < len) {
backup = len - rlen;
- UIO_ADVANCE(uiop, -backup);
+ UIO_RETREAT(uiop, backup);
uiop->uio_offset -= backup;
len = rlen;
}
@@ -1482,7 +1494,7 @@ retry:
* then, we should resend them to nfsd.
*/
backup = origresid - tsiz;
- UIO_ADVANCE(uiop, -backup);
+ UIO_RETREAT(uiop, backup);
uiop->uio_offset -= backup;
tsiz = origresid;
goto retry;
From: "Aaron J. Grier" <agrier@poofygoof.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-alpha/35448: memory management fault trap during heavy network I/O
Date: Mon, 26 Mar 2007 22:43:52 -0700
a modified version of the provided patch has kept my 1000A from
panicing, and this PR can be closed.
I'm now back where I started with netbsd-2, with NFS service
dissapearing during the weekly cron, with no apparent retransmissions or
recovery, but that's a different PR.
hooray?
--
Aaron J. Grier | "Not your ordinary poofy goof." | agrier@poofygoof.com
"silly brewer, saaz are for pils!" -- virt
From: "Aaron J. Grier" <agrier@poofygoof.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-alpha/35448
Date: Sun, 16 Dec 2012 16:28:08 -0800
flogging a dead horse here, but figured it was time for an update.
the RETREAT / ADVANCE patch in this PR definitely helped me with stability,
although this problem itself may be alpha-specific.
the underlying problem (which exacerbated this problem) was traced to a
duplex mismatch in my network, heavily exercising the NFS retransmit
code. :) it has since been addressed, and the system in question (with
the patch above) has been running stably under constant low-level use.
4:20PM up 623 days, 23:26, 14 users, load averages: 0.05, 0.04, 0.00
the only idea I have for flushing out further bugs with the NFS
retransmit code would be pounding on a UDP mount over a lossy link, IE
purposely set mismatched duplex, crappy hub or switch, or a packet loss
simulator.
since this patch does no harm and improves at least alpha, could it be
applied to current and pulled-up to NetBSD-6?
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-alpha/35448
Date: Tue, 18 Dec 2012 08:54:58 -0700 (MST)
On Mon, 17 Dec 2012, Aaron J. Grier wrote:
> flogging a dead horse here, but figured it was time for an update.
>
> the RETREAT / ADVANCE patch in this PR definitely helped me with stability,
> although this problem itself may be alpha-specific.
As I remember, it only showed up with the alpha gcc, but did not seem to
be there for amd64 (I can't remember if I tried to check the sparc64 code
or not).
I think the discussion about this change had someone not liking the
RETREAT/ADVANCE patch (or names) I had at the time. I had thought about a
patch using a different macro name (ADJUSTUIO or something similar) so
that there was a single macro that would indicate the adjustment could go
either direction. I've been a bit (!) negligent on following up on this.
> the only idea I have for flushing out further bugs with the NFS
> retransmit code would be pounding on a UDP mount over a lossy link, IE
> purposely set mismatched duplex, crappy hub or switch, or a packet loss
> simulator.
I think (but it has been some time) I was able to replicate the problem
with just a cross-over link between my two CS20 machines. Only one of
them is running at this time, so I'm not sure how easily I could
replicate it now. Maybe that would be a good opportunity to power up my
1000A and update it.
> since this patch does no harm and improves at least alpha, could it be
> applied to current and pulled-up to NetBSD-6?
I think NetBSD-6 has a newer version of gcc since then, and I would want
to verify that the patch is still needed before commiting anything.
Mike
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: "Chuck Silvers" <chs@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/35448 CVS commit: src/sys/nfs
Date: Thu, 14 May 2015 17:35:54 +0000
Module Name: src
Committed By: chs
Date: Thu May 14 17:35:54 UTC 2015
Modified Files:
src/sys/nfs: nfs_vnops.c
Log Message:
in nfs_writerpc(), avoid a signed/unsigned problem in computing the
number of bytes to back up in the uio when we need to resend a write RPC
(eg. after a server crash) on a 64-bit platform. should fix PR 35448.
To generate a diff of this commit:
cvs rdiff -u -r1.307 -r1.308 src/sys/nfs/nfs_vnops.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Soren Jacobsen" <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/35448 CVS commit: [netbsd-7] src/sys/nfs
Date: Tue, 19 May 2015 04:56:46 +0000
Module Name: src
Committed By: snj
Date: Tue May 19 04:56:46 UTC 2015
Modified Files:
src/sys/nfs [netbsd-7]: nfs_vnops.c
Log Message:
Pull up following revision(s) (requested by chs in ticket #769):
sys/nfs/nfs_vnops.c: revision 1.308
in nfs_writerpc(), avoid a signed/unsigned problem in computing the
number of bytes to back up in the uio when we need to resend a write RPC
(eg. after a server crash) on a 64-bit platform. should fix PR 35448.
To generate a diff of this commit:
cvs rdiff -u -r1.306 -r1.306.2.1 src/sys/nfs/nfs_vnops.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: analyzed->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Tue, 08 Sep 2015 09:06:12 +0000
State-Changed-Why:
Did the commit made in May fix the problem?
From: "Aaron J. Grier" <agrier@poofygoof.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-alpha/35448 (memory management fault trap during heavy network I/O)
Date: Tue, 8 Sep 2015 13:23:06 -0700
I have been running a version of mhitch's original ADVANCE/RETREAT patch
in for the last 8 years (now with NetBSD-5) on my Alpha without seeing
the bug that the patch was created to fix.
"works for me."
--
Aaron J. Grier | "Not your ordinary poofy goof." | agrier@poofygoof.com
State-Changed-From-To: feedback->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 09 Sep 2015 06:31:38 +0000
State-Changed-Why:
Call it fixed... if the committed patch is substantively different and
turns out not to work, be sure to let us know.
>Unformatted:
sources CVSed 2007-01-18
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.