NetBSD Problem Report #35448

From agrier@poofygoof.com  Sat Jan 20 03:54:28 2007
Return-Path: <agrier@poofygoof.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id D70A263B8CE
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 20 Jan 2007 03:54:28 +0000 (UTC)
Message-Id: <20070120035419.E991233143@arwen.poofy.goof.com>
Date: Fri, 19 Jan 2007 19:54:18 -0800 (PST)
From: agrier@poofygoof.com
Reply-To: agrier@poofygoof.com
To: gnats-bugs@NetBSD.org
Subject: memory management fault trap during heavy network I/O
X-Send-Pr-Version: 3.95

>Number:         35448
>Category:       port-alpha
>Synopsis:       memory management fault trap during heavy network I/O
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    mhitch
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 20 03:55:00 +0000 2007
>Closed-Date:    Wed Sep 09 06:31:38 +0000 2015
>Last-Modified:  Wed Sep 09 06:31:38 +0000 2015
>Originator:     agrier@poofygoof.com
>Release:        NetBSD 4.99.8
>Organization:
  Aaron J. Grier | "Not your ordinary poofy goof." | agrier@poofygoof.com
>Environment:


System: NetBSD arwen.poofy.goof.com 4.99.8 NetBSD 4.99.8 (ARWEN) #0: Thu Jan 18 23:03:09 PST 2007 agrier@arwen.poofy.goof.com:/var/obj/ARWEN alpha
Architecture: alpha
Machine: alpha

ARWEN is an alphaserver 1000A 5/400.

the ARWEN kernel is GENERIC with hardcoded line to attach root at ld0.
>Description:

- the trap:

CPU 0: fatal kernel trap:

CPU 0    trap entry = 0x2 (memory management fault)
CPU 0    a0         = 0xfffffe0108266000
CPU 0    a1         = 0x1
CPU 0    a2         = 0x0
CPU 0    pc         = 0xfffffc00007ecde0
CPU 0    ra         = 0xfffffc000035f9ac
CPU 0    pv         = 0x0
CPU 0    curlwp    = 0xfffffc000fcd2660
CPU 0        pid = 335, comm = nfsio

panic: trap
Begin traceback...
alpha trace requires known PC =eject=
End traceback...
syncing disks... 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 5 5 5 5 5 giving up

- the backtrace:

(gdb) bt
#0  0xfffffc00007df888 in dumpsys ()
    at /projects/NetBSD/src/sys/arch/alpha/alpha/machdep.c:1229
#1  0xfffffc00007dfdb0 in cpu_reboot ()
    at /projects/NetBSD/src/sys/arch/alpha/alpha/machdep.c:1048
#2  0xfffffc0000644a50 in panic ()
    at /projects/NetBSD/src/sys/kern/subr_prf.c:246
#3  0xfffffc00007e7248 in trap ()
    at /projects/NetBSD/src/sys/arch/alpha/alpha/trap.c:601
#4  0xfffffc00003003e8 in XentMM ()
    at /projects/NetBSD/src/sys/arch/alpha/alpha/locore.s:492
#5  0xfffffc000035f9ac in in_delayed_cksum ()
    at /projects/NetBSD/src/sys/netinet/ip_output.c:1123
can not access 0xfffffffd, invalid translation (invalid L1 PTE)
can not access 0xfffffffd, invalid translation (invalid L1 PTE)
Cannot access memory at address 0xfffffffffffffffd

- some poking:

(gdb) frame 5
#5  0xfffffc000035f9ac in in_delayed_cksum ()
    at /projects/NetBSD/src/sys/netinet/ip_output.c:1123
1123            csum = in4_cksum(m, 0, offset, ntohs(ip->ip_len) - offset);
(gdb) proc 0xfffffc000fcd2660 # curlwp from the trap
(gdb) bt
#0  0xfffffc000062a730 in mi_switch ()
    at /projects/NetBSD/src/sys/kern/kern_synch.c:997
(gdb) list *0xfffffc00007ecde0 # pc from the trap
0xfffffc00007ecde0 is in in4_cksum
(/projects/NetBSD/src/sys/netinet/in4_cksum.c:175).

- dmesg

NetBSD 4.99.8 (ARWEN) #0: Thu Jan 18 23:03:09 PST 2007
	agrier@arwen.poofy.goof.com:/var/obj/ARWEN
AlphaServer 1000A 5/400, 400MHz, s/n 
8192 byte page size, 1 processor.
total memory = 256 MB
(2016 KB reserved for PROM, 254 MB used by NetBSD)
avail memory = 241 MB
mainbus0 (root)
cpu0 at mainbus0: ID 0 (primary), 21164A-2
cpu0: Architecture extensions: 1<BWX>
cia0 at mainbus0: DECchip 2117x Core Logic Chipset (ALCOR/ALCOR2), pass 3
cia0: extended capabilities: 21<DWEN,BWEN>
cia0: using BWX for PCI config access
pci0 at cia0 bus 0
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pceb0 at pci0 dev 7 function 0: Intel 82375EB/SB PCI-EISA Bridge (rev. 0x05)
ppb0 at pci0 dev 8 function 0: Digital Equipment DC21050 PCI-PCI Bridge (rev. 0x02)
pci1 at ppb0 bus 2
pci1: i/o space, memory space enabled, rd/line, wr/inv ok
isp0 at pci1 dev 0 function 0: QLogic 1020 Fast Wide SCSI HBA
isp0: interrupting at dec_1000a irq 0
scsibus0 at isp0: 16 targets, 8 luns per target
tlp0 at pci0 dev 11 function 0: DECchip 21140 Ethernet, pass 1.2
tlp0: interrupting at dec_1000a irq 1
tlp0: DEC DE500-XA, Ethernet address 00:00:f8:02:06:a5
tlp0: 10baseT, 100baseTX, 100baseTX-FDX, 10baseT-FDX
mlx0 at pci0 dev 12 function 0: Mylex RAID (v2 interface)
mlx0: interrupting at dec_1000a irq 3
mlx0: DAC960P/PD, 3 channels, firmware 2.70-0-00, 32MB RAM
ld0 at mlx0 unit 0: RAID5, online
ld0: 16380 MB, 8320 cyl, 64 head, 63 sec, 512 bytes/sect x 33546240 sectors
ld1 at mlx0 unit 1: RAID5, online
ld1: 32768 MB, 8322 cyl, 128 head, 63 sec, 512 bytes/sect x 67108864 sectors
ld2 at mlx0 unit 2: RAID5, online
ld2: 32768 MB, 8322 cyl, 128 head, 63 sec, 512 bytes/sect x 67108864 sectors
ld3 at mlx0 unit 3: RAID5, online
ld3: 4536 MB, 2304 cyl, 64 head, 63 sec, 512 bytes/sect x 9289728 sectors
eisa0 at pceb0
eisa0: can't map I/O space for slot 9
isa0 at pceb0
lpt0 at isa0 port 0x3bc-0x3bf irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
attimer0 at isa0 port 0x40-0x43: AT Timer
vga0 at isa0 port 0x3b0-0x3df iomem 0xa0000-0xbffff
wsdisplay0 at vga0 kbdmux 1
wsmux1: connecting to wsdisplay0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker (CPU-intensive output)
spkr0 at pcppi0
isabeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
mcclock0 at isa0 port 0x70-0x71: mc146818 or compatible
pcppi0: attached to attimer0
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 0 lun 0: <DEC, RZ28M    (C) DEC, 0568> disk fixed
sd0: async, 8-bit transfers
sd0: 2007 MB, 3045 cyl, 16 head, 84 sec, 512 bytes/sect x 4110480 sectors
sd0: sync (100.00ns offset 12), 8-bit (10.000MB/s) transfers, tagged queueing
cd0 at scsibus0 target 4 lun 0: <DEC, RRD45   (C) DEC, 1645> cdrom removable
cd0: async, 8-bit transfers
WARNING: can't figure what device matches "RAID 0 12 0 0 0 0 0"
root on ld0a dumps on sd0b

- other misc foo

ps won't grok the coredump:

arwen$ ps -N netbsd.gdb -M /var/crash/netbsd.0.core
ps: can't read proc credentials at 0xfffffc000ade3480: Undefined error: 0

>How-To-Repeat:
it seems to be triggered by syncing a remotely mounted mailbox from
within pine or mutt.
>Fix:
figure out what is causing the trap?  maybe a stack smash, based on
previous port-alpha mailing list entries.  perhaps

options KSTACK_CHECK_MAGIC

is in order?

>Release-Note:

>Audit-Trail:
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: port-alpha/35448: memory management fault trap during heavy
 network I/O
Date: Mon, 22 Jan 2007 14:36:25 -0700 (MST)

 On Sat, 20 Jan 2007, agrier@poofygoof.com wrote:

 > - the trap:
 >
 > CPU 0: fatal kernel trap:
 >
 > CPU 0    trap entry = 0x2 (memory management fault)
 > CPU 0    a0         = 0xfffffe0108266000
 > CPU 0    a1         = 0x1
 > CPU 0    a2         = 0x0
 > CPU 0    pc         = 0xfffffc00007ecde0
 > CPU 0    ra         = 0xfffffc000035f9ac
 > CPU 0    pv         = 0x0
 > CPU 0    curlwp    = 0xfffffc000fcd2660
 > CPU 0        pid = 335, comm = nfsio
 ...
 > (gdb) list *0xfffffc00007ecde0 # pc from the trap
 > 0xfffffc00007ecde0 is in in4_cksum
 > (/projects/NetBSD/src/sys/netinet/in4_cksum.c:175).

    A preliminary analysis seems to indicate the trap occurred where 
 in4_cksum is summing 16 words in an unrolled loop.  If I understand the 
 trap registers correctly, it looks like the address causing the trap is 
 0xfffffe0108266000 (the contents of a0 above).  Running pmap(1) against 
 the coredump and kernel file seems to indicate that the address is not 
 within the current kernel's mapped address space.  The gdb backtrace 
 fails, so it's a little hard to figure out where it came from.  I'm going 
 to start groveling through the stack myself to see if I can dig out the 
 parameters to the in4_cksum() call, and if I can follow the traceback 
 manually.

    It might be helpful if a backtrace from ddb could be obtained (although 
 my recent experience with a 4.0_BETA [not 4.0_BETA2 yet] kernel was unable 
 to get a good backtrace on my own machine).

 > - other misc foo
 >
 > ps won't grok the coredump:
 >
 > arwen$ ps -N netbsd.gdb -M /var/crash/netbsd.0.core
 > ps: can't read proc credentials at 0xfffffc000ade3480: Undefined error: 0

    There's an xps gdb script in src/sys/gdbscripts that is able to display 
 some process information (and could be extended to show more process
 details).

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: port-alpha-maintainer@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, agrier@poofygoof.com
Subject: Re: port-alpha/35448: memory management fault trap during heavy
 network I/O
Date: Mon, 29 Jan 2007 11:09:01 -0700 (MST)

 On Mon, 22 Jan 2007, Michael L. Hitch wrote:

 > fails, so it's a little hard to figure out where it came from.  I'm going
 > to start groveling through the stack myself to see if I can dig out the
 > parameters to the in4_cksum() call, and if I can follow the traceback
 > manually.

    OK, I've dug out more information from the raw stack dump.  I located 
 the address of the mbuf and found that it has the same bad address in 
 mh_data:

 (gdb) print (struct mbuf)*0xfffffc000ef7be18
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 $2 = {m_hdr = {mh_next = 0x0, mh_nextpkt = 0x0,
      mh_data = 0xfffffe0108266000 <Address 0xfffffe0108266000 out of 
 bounds>,
      mh_owner = 0x4e4f5a414d412d58, mh_len = 4096, mh_flags = 67108865,
      mh_paddr = 251117080, mh_type = 1}, M_dat = {MH = {MH_pkthdr = {
          rcvif = 0xfffffe000005a080, tags = {slh_first = 0x0}, len = 188,
          csum_flags = 0, csum_data = 0, segsz = 0}, MH_dat = {MH_ext = {
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 can not access 0x8266000, invalid translation (invalid L2 PTE)
 can not access 0x8266000, invalid translation (invalid L2 PTE)
            ext_buf = 0xfffffe0108266000 <Address 0xfffffe0108266000 out of 
 bounds>, ext_fr$
            ext_arg = 0xfffffe000c617cb8, ext_size = 4096,
            ext_type = 0xfffffc0000a62558, ext_nextref = 0xfffffc000ef7b118,
            ext_prevref = 0xfffffc000ef7a218, ext_un = {
              extun_paddr = 14733978372531027968, extun_pgs = {

    On a whim, I took a look at the data located at 0xfffffe0008266000 and 
 found what looks like data that might be expected, and Aaron confirmed 
 that the data was part of a mailbox file that was being synched.  So it 
 looked like something had corrupted the address used by the mbuf.  I 
 followed the stack back to nfs_writerpc, which can use the address of data 
 being sent as the external data address for the mbuf.  I dug out the 
 address of the uio and iovec structures used at that point and found:

 (gdb) print (struct uio)*0xfffffe000c617e70
 $8 = {uio_iov = 0xfffffe000c617e60, uio_iovcnt = 1, uio_offset = 102400,
    uio_resid = 18446744069414588416, uio_rw = UIO_WRITE,
    uio_vmspace = 0xfffffc0000abc018}
 (gdb) print (struct iovec)*0xfffffe000c617e60
 $9 = {iov_base = 0xfffffe0108267000, iov_len = 18446744069414588416}
 (gdb) x/2gx 0xfffffe000c617e60
 0xfffffe000c617e60:     0xfffffe0108267000      0xffffffff00001000

    The buffer address in iov_base is corrupt as well.  In addition, the 
 iov_len field appears corrupted.

    Following the stack back further, I get to nfs_doio and get the address 
 of the struct buf that was used to generate the uio/iovec data:

 (gdb) print (struct buf)*0xfffffc00052b8dc0
 $3 = {b_u = {u_actq = {tqe_next = 0xdeadbeef, tqe_prev = 
 0xfffffc00052b88b8},
      u_work = {wk_entry = {sqe_next = 0xdeadbeef}}}, b_interlock = {
      lock_data = 86745072}, b_flags = 85, b_error = 0, b_prio = 0,
    b_bufsize = 8192, b_bcount = 8192, b_resid = 8192, b_dev = 4294967295,
    b_un = {
      b_addr = 0xfffffe0008266000 "ntent-Transfer-Encoding:Message-ID;\n 
 b=T2nY8PninSOLy9W$
    b_iodone = 0xfffffc00005bd600 <uvm_aio_biodone>,
    b_proc = 0xfffffc0000abc4a0, b_vp = 0xfffffc000bea53c0, b_dep = {
      lh_first = 0x0}, b_saveaddr = 0x0, b_fspriv = {
      bf_private = 0xfffffc00052b95a8, bf_dcookie = -4397959768664}, b_hash 
 = {
      le_next = 0x16, le_prev = 0x0}, b_vnbufs = {le_next = 0x87654321,
      le_prev = 0x4}, b_freelist = {tqe_next = 0x0,
      tqe_prev = 0xfffffe0000263700}, b_lblkno = 0, b_freelistindex = 0}

    Lo and behold, it has the correct address of the data!  So somwhere 
 between nfs_doio() and nfs_writeprc(), the iov_base and iov_len values
 get clobbered (in an apparently fairly consistant way).

    Since the bad address was easy to check for, I inserted a number of 
 KASSERT() statements in nfs_doio(), nfs_doio_write, and nfs_writerpc().
 I was able to induce this failure on my own alpha at this point.  I found 
 that the address was good at the entry of nfs_writerpc(), but had been 
 corrupted at the start of the loop sending out the data.  This seemed odd,
 since there didn't appear to be anything that would cause the type of 
 corruption I was seeing.  While trying to figure out where some of the
 local variables in nfs_writerpc() were located on the stack, I noticed 
 there was a 'retry:' label before the output loop.  Finding where that 
 label was used shed some light on things.  Certain conditions (which I'm 
 not too clear on, since I don't understand NFS all that well) would cause 
 a resend of the entire data buffer, and if that clobbered the data address 
 and length, would result in what I was seeing.  Indeed, that was the case;
 a few more KASSERT() statements showed that the UIO_ADVANCE() at line 1547 
 of nfs_vnops.c was clobbering the iovec data.

    Closer examinination of what UIO_ADVANCE() was doing, and examination of 
 the generated code show what the problem was.

    The alpha has 64 bit pointers, and the iov_len values was also 64 bits. 
 The variable backup used to adjust the iovec data is an unsigned 32 bit 
 value.  The changes for version 1.225 appear to have intruduced a problem 
 that only showed up on the alpha.  Prior to that, the unsigned value of 
 'backup' was being subtracted from iov_base, and added to iov_len.  In 
 version 1.225, that was changed to use the macro UIO_ADVANCE() and passing 
 a negated value of 'backup' to the macro.  The compiler thus negated the 
 32 bit unsigned value of 'backup' and zero-extended the result to 64 bits 
 which was added to iov_base, and subtracted fro iov_len. resulting in the 
 clobbered values.

    Changing the UIO_ADVANCE() to a UIO_RETREAT() which passed 'backup' 
 directly and subtracted that from iov_base, and added it to iov_len gave 
 me a kernel which did not crash when nfs_writerpc() resent the data.  I've 
 also just verified that simply making 'backup' a signed 32 bit also works 
 using the UIO_ADVANCE() macro.

 ---
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

Responsible-Changed-From-To: port-alpha-maintainer->mhitch
Responsible-Changed-By: mhitch@netbsd.org
Responsible-Changed-When: Mon, 29 Jan 2007 18:33:38 +0000
Responsible-Changed-Why:
I've analyzed it, so I'll take it.


State-Changed-From-To: open->analyzed
State-Changed-By: mhitch@netbsd.org
State-Changed-When: Mon, 29 Jan 2007 18:33:38 +0000
State-Changed-Why:
I've analyzed the problem.


From: Christian Biere <christianbiere@gmx.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-alpha/35448: memory management fault trap during heavy network I/O
Date: Thu, 1 Feb 2007 03:37:52 +0100

 Michael L. Hitch wrote:
 >     Changing the UIO_ADVANCE() to a UIO_RETREAT() which passed 'backup' 
 >  directly and subtracted that from iov_base, and added it to iov_len gave 
 >  me a kernel which did not crash when nfs_writerpc() resent the data.  I've 
 >  also just verified that simply making 'backup' a signed 32 bit also works 
 >  using the UIO_ADVANCE() macro.

 I'd prefer the former because it's cleaner. I take "advance" as a strong
 emphasis that it's meant to move forward. At least I've written a similar
 function before and decided against the more flexible term "add" for
 exactly this reason.


 --- sys/nfs/nfs_vnops.c.orig	2007-01-26 21:52:50.000000000 +0100
 +++ sys/nfs/nfs_vnops.c	2007-02-01 03:21:00.000000000 +0100
 @@ -251,10 +251,22 @@ extern const nfstype nfsv3_type[9];

  int nfs_numasync = 0;
  #define	DIRHDSIZ	_DIRENT_NAMEOFF(dp)
 -#define UIO_ADVANCE(uio, siz) \
 -    (void)((uio)->uio_resid -= (siz), \
 -    (uio)->uio_iov->iov_base = (char *)(uio)->uio_iov->iov_base + (siz), \
 -    (uio)->uio_iov->iov_len -= (siz))
 +
 +static __inline void
 +UIO_ADVANCE(struct uio *uio, size_t n)
 +{
 +	uio->uio_resid -= n;
 +	uio->uio_iov->iov_base = (char *)uio->uio_iov->iov_base + n;
 +    	uio->uio_iov->iov_len -= n;
 +}
 +
 +static __inline void
 +UIO_RETREAT(struct uio *uio, size_t n)
 +{
 +	uio->uio_resid += n;
 +	uio->uio_iov->iov_base = (char *)uio->uio_iov->iov_base - n;
 +    	uio->uio_iov->iov_len += n;
 +}

  static void nfs_cache_enter(struct vnode *, struct vnode *,
      struct componentname *);
 @@ -1420,7 +1432,7 @@ retry:
  					break;
  				} else if (rlen < len) {
  					backup = len - rlen;
 -					UIO_ADVANCE(uiop, -backup);
 +					UIO_RETREAT(uiop, backup);
  					uiop->uio_offset -= backup;
  					len = rlen;
  				}
 @@ -1482,7 +1494,7 @@ retry:
  				 * then, we should resend them to nfsd.
  				 */
  				backup = origresid - tsiz;
 -				UIO_ADVANCE(uiop, -backup);
 +				UIO_RETREAT(uiop, backup);
  				uiop->uio_offset -= backup;
  				tsiz = origresid;
  				goto retry;

From: "Aaron J. Grier" <agrier@poofygoof.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-alpha/35448: memory management fault trap during heavy network I/O
Date: Mon, 26 Mar 2007 22:43:52 -0700

 a modified version of the provided patch has kept my 1000A from
 panicing, and this PR can be closed.

 I'm now back where I started with netbsd-2, with NFS service
 dissapearing during the weekly cron, with no apparent retransmissions or
 recovery, but that's a different PR.

 hooray?

 -- 
   Aaron J. Grier | "Not your ordinary poofy goof." | agrier@poofygoof.com
               "silly brewer, saaz are for pils!"  --  virt

From: "Aaron J. Grier" <agrier@poofygoof.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-alpha/35448
Date: Sun, 16 Dec 2012 16:28:08 -0800

 flogging a dead horse here, but figured it was time for an update.

 the RETREAT / ADVANCE patch in this PR definitely helped me with stability,
 although this problem itself may be alpha-specific.

 the underlying problem (which exacerbated this problem) was traced to a
 duplex mismatch in my network, heavily exercising the NFS retransmit
 code.  :)  it has since been addressed, and the system in question (with
 the patch above) has been running stably under constant low-level use.

  4:20PM  up 623 days, 23:26, 14 users, load averages: 0.05, 0.04, 0.00

 the only idea I have for flushing out further bugs with the NFS
 retransmit code would be pounding on a UDP mount over a lossy link, IE
 purposely set mismatched duplex, crappy hub or switch, or a packet loss
 simulator.

 since this patch does no harm and improves at least alpha, could it be
 applied to current and pulled-up to NetBSD-6?

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-alpha/35448
Date: Tue, 18 Dec 2012 08:54:58 -0700 (MST)

 On Mon, 17 Dec 2012, Aaron J. Grier wrote:

 > flogging a dead horse here, but figured it was time for an update.
 >
 > the RETREAT / ADVANCE patch in this PR definitely helped me with stability,
 > although this problem itself may be alpha-specific.

    As I remember, it only showed up with the alpha gcc, but did not seem to 
 be there for amd64 (I can't remember if I tried to check the sparc64 code 
 or not).

    I think the discussion about this change had someone not liking the 
 RETREAT/ADVANCE patch (or names) I had at the time.  I had thought about a 
 patch using a different macro name (ADJUSTUIO or something similar) so 
 that there was a single macro that would indicate the adjustment could go 
 either direction.  I've been a bit (!) negligent on following up on this.

 > the only idea I have for flushing out further bugs with the NFS
 > retransmit code would be pounding on a UDP mount over a lossy link, IE
 > purposely set mismatched duplex, crappy hub or switch, or a packet loss
 > simulator.

    I think (but it has been some time) I was able to replicate the problem 
 with just a cross-over link between my two CS20 machines.  Only one of 
 them is running at this time, so I'm not sure how easily I could 
 replicate it now.  Maybe that would be a good opportunity to power up my 
 1000A and update it.

 > since this patch does no harm and improves at least alpha, could it be
 > applied to current and pulled-up to NetBSD-6?

    I think NetBSD-6 has a newer version of gcc since then, and I would want 
 to verify that the patch is still needed before commiting anything.

 Mike


 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: "Chuck Silvers" <chs@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/35448 CVS commit: src/sys/nfs
Date: Thu, 14 May 2015 17:35:54 +0000

 Module Name:	src
 Committed By:	chs
 Date:		Thu May 14 17:35:54 UTC 2015

 Modified Files:
 	src/sys/nfs: nfs_vnops.c

 Log Message:
 in nfs_writerpc(), avoid a signed/unsigned problem in computing the
 number of bytes to back up in the uio when we need to resend a write RPC
 (eg. after a server crash) on a 64-bit platform.  should fix PR 35448.


 To generate a diff of this commit:
 cvs rdiff -u -r1.307 -r1.308 src/sys/nfs/nfs_vnops.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Soren Jacobsen" <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/35448 CVS commit: [netbsd-7] src/sys/nfs
Date: Tue, 19 May 2015 04:56:46 +0000

 Module Name:	src
 Committed By:	snj
 Date:		Tue May 19 04:56:46 UTC 2015

 Modified Files:
 	src/sys/nfs [netbsd-7]: nfs_vnops.c

 Log Message:
 Pull up following revision(s) (requested by chs in ticket #769):
 	sys/nfs/nfs_vnops.c: revision 1.308
 in nfs_writerpc(), avoid a signed/unsigned problem in computing the
 number of bytes to back up in the uio when we need to resend a write RPC
 (eg. after a server crash) on a 64-bit platform.  should fix PR 35448.


 To generate a diff of this commit:
 cvs rdiff -u -r1.306 -r1.306.2.1 src/sys/nfs/nfs_vnops.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: analyzed->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Tue, 08 Sep 2015 09:06:12 +0000
State-Changed-Why:
Did the commit made in May fix the problem?


From: "Aaron J. Grier" <agrier@poofygoof.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-alpha/35448 (memory management fault trap during heavy network I/O)
Date: Tue, 8 Sep 2015 13:23:06 -0700

 I have been running a version of mhitch's original ADVANCE/RETREAT patch
 in for the last 8 years (now with NetBSD-5) on my Alpha without seeing
 the bug that the patch was created to fix.

 "works for me."

 -- 
   Aaron J. Grier | "Not your ordinary poofy goof." | agrier@poofygoof.com

State-Changed-From-To: feedback->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 09 Sep 2015 06:31:38 +0000
State-Changed-Why:
Call it fixed... if the committed patch is substantively different and
turns out not to work, be sure to let us know.


>Unformatted:


 sources CVSed 2007-01-18
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.