NetBSD Problem Report #46463

From www@NetBSD.org  Thu May 17 23:16:11 2012
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id 90F7E63B89C
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 17 May 2012 23:16:11 +0000 (UTC)
Message-Id: <20120517231610.9804C63B882@www.NetBSD.org>
Date: Thu, 17 May 2012 23:16:10 +0000 (UTC)
From: rhansen@bbn.com
Reply-To: rhansen@bbn.com
To: gnats-bugs@NetBSD.org
Subject: netbsd-6:  panic and filesystem corruption running tmux
X-Send-Pr-Version: www-1.0

>Number:         46463
>Category:       kern
>Synopsis:       netbsd-6:  panic and filesystem corruption running tmux
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    martin
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu May 17 23:20:00 +0000 2012
>Closed-Date:    Tue Jun 26 19:49:44 +0000 2012
>Last-Modified:  Sat Nov 24 21:45:02 +0000 2012
>Originator:     Richard Hansen
>Release:        6.0_BETA
>Organization:
>Environment:
NetBSD 6.0_BETA NetBSD 6.0_BETA (GENERIC) i386
>Description:
On i386 GENERIC netbsd-6 (nightly build from around 2012-04-28), I get a panic and significant filesystem corruption (ffs with log, noatime) if I do the following:

  1. ssh to the netbsd-6 machine
  2. install the tmux-1.4nb1 pkgsrc package
  3. run 'tmux new-session`
  4. ssh to the same machine from a different terminal
  5. run 'tmux list-sessions 0<&-'

For some reason, closing stdin in step #5 above is required to trigger the panic.

Using a lightly modified and older (2012-03-28) snapshot, I get the following backtrace in the close() syscall():

#0  maybe_dump (howto=260) at /usr/src/sys/arch/i386/i386/machdep.c:880
        s = 0
#1  0xc071c8ca in cpu_reboot (howto=260, bootstr=0x0)
    at /usr/src/sys/arch/i386/i386/machdep.c:899
        syncdone = false
        s = 0
#2  0xc09a79c4 in vpanic (fmt=0xc0deb79c "kernel %sassertion \"%s\" failed: file \"%s\", line %d ", 
    ap=0xddcd8b58 "&#65533;\267300\024\270300 \267300v\005")
    at /usr/src/sys/kern/subr_prf.c:308
        cii = 0
        ci = 0x0
        oci = 0x0
        bootopt = 260
        scratchstr = "kernel diagnostic assertion \"mutex_owned(&fdp->fd_lock)\" failed: file \"/usr/src/sys/kern/kern_event.c\", line 1398 ", '\000' <repeats 97 times>
#3  0xc0bbb8c3 in kern_assert (fmt=0xc0deb79c "kernel %sassertion \"%s\" failed: file \"%s\", line %d ")
    at /usr/src/sys/lib/libkern/kern_assert.c:50
        ap = 0xddcd8b58 "&#65533;\267300\024\270300 \267300v\005"
#4  0xc0679e3c in kqueue_doclose (kq=0xc6487a58, list=0xc63b194c, fd=0)
    at /usr/src/sys/kern/kern_event.c:1398
        kn = 0xc0c0
        fdp = 0xc64b1cc0
#5  0xc0679f00 in kqueue_close (fp=0xc5ecc200) at /usr/src/sys/kern/kern_event.c:1430
        kq = 0xc6487a58
        fdp = 0xc63b1940
        ff = 0xc63b1940
        i = 0
#6  0xc06737fb in closef (fp=0xc5ecc200) at /usr/src/sys/kern/kern_descrip.c:824
        lf = {l_start = -4184506505861690440, l_len = -4160666348713672704, l_pid = 0, l_type = 0, l_whence = 0}
        error = 0
#7  0xc0673411 in fd_close (fd=13) at /usr/src/sys/kern/kern_descrip.c:709
        lf = {l_start = -4629759700944188308, l_len = 28, l_pid = -573731704, l_type = -29420, l_whence = -8755}
        fdp = 0xc63b1940
        ff = 0xc5eda580
        fp = 0xc5ecc200
        p = 0xc5dea370
        l = 0xc6425800
        refcnt = 0
#8  0xc09c6e5d in sys_close (l=0xc6425800, uap=0xddcd8cec, retval=0xddcd8d14)
    at /usr/src/sys/kern/sys_descrip.c:486
No locals.
#9  0xc09d687b in sy_call (sy=0xc0f35608, l=0xc6425800, uap=0xddcd8cec, rval=0xddcd8d14)
    at /usr/src/sys/sys/syscallvar.h:61
        error = 0
#10 0xc09d6c1e in syscall (frame=0xddcd8d48) at /usr/src/sys/arch/x86/x86/syscall.c:179
        callp = 0xc0f35608
        p = 0xc5dea370
        l = 0xc6425800
        error = 0
        code = 6
        rval = {0, 0}
        rip_call = -1146338251
        args = {13, 11, -1077945480, -1145308313, 64, -1077945516, -1077945524, -1077945524, -1147969380, -1077945492}
#11 0xc01005d6 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
>How-To-Repeat:
  1. ssh to the netbsd-6 machine
  2. install the tmux-1.4nb1 pkgsrc package
  3. run 'tmux new-session`
  4. ssh to the same machine from a different terminal
  5. run 'tmux list-sessions 0<&-'
>Fix:

>Release-Note:

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: christos@NetBSD.org
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
Date: Fri, 18 May 2012 12:06:27 +0200

 On Thu, May 17, 2012 at 11:20:00PM +0000, rhansen@bbn.com wrote:
 > #4  0xc0679e3c in kqueue_doclose (kq=0xc6487a58, list=0xc63b194c, fd=0)
 >     at /usr/src/sys/kern/kern_event.c:1398
 >         kn = 0xc0c0
 >         fdp = 0xc64b1cc0

 Here fdp is kq->kq_fdp. We expect it to be locked.

 > #5  0xc0679f00 in kqueue_close (fp=0xc5ecc200) at /usr/src/sys/kern/kern_event.c:1430
 >         kq = 0xc6487a58
 >         fdp = 0xc63b1940
 >         ff = 0xc63b1940
 >         i = 0

 Here fdp is curlwp->l_fdp, and we lock it.

 Sounds clearly wrong to me, why is curlwp involved here (and why does a kqueue
 have its own filedesc_t)?

 Martin

From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running
 tmux
Date: Fri, 18 May 2012 08:50:10 -0600

 On Thu, 17 May 2012 23:20:00 +0000 (UTC)
 rhansen@bbn.com wrote:

 > >How-To-Repeat:
 >   1. ssh to the netbsd-6 machine
 >   2. install the tmux-1.4nb1 pkgsrc package
 >   3. run 'tmux new-session`
 >   4. ssh to the same machine from a different terminal
 >   5. run 'tmux list-sessions 0<&-'
 > >Fix:

 Step 2 is not needed -- tmux is in /usr/bin on netbsd-6, and replicates
 the issue...

 Later...

 Greg Oster

From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running
 tmux
Date: Fri, 18 May 2012 09:06:35 -0600

 On Thu, 17 May 2012 23:20:00 +0000 (UTC)
 rhansen@bbn.com wrote:

 > >Number:         46463
 > >Category:       kern
 > >Synopsis:       netbsd-6:  panic and filesystem corruption running
 > >tmux Confidential:   no
 > >Severity:       critical
 > >Priority:       high
 > >Responsible:    kern-bug-people
 > >State:          open
 > >Class:          sw-bug
 > >Submitter-Id:   net
 > >Arrival-Date:   Thu May 17 23:20:00 +0000 2012
 > >Originator:     Richard Hansen
 > >Release:        6.0_BETA
 > >Organization:
 > >Environment:
 > NetBSD 6.0_BETA NetBSD 6.0_BETA (GENERIC) i386
 > >Description:
 > On i386 GENERIC netbsd-6 (nightly build from around 2012-04-28), I
 > get a panic and significant filesystem corruption (ffs with log,
 > noatime) if I do the following:
 > 
 >   1. ssh to the netbsd-6 machine
 >   2. install the tmux-1.4nb1 pkgsrc package
 >   3. run 'tmux new-session`
 >   4. ssh to the same machine from a different terminal
 >   5. run 'tmux list-sessions 0<&-'
 > 
 > For some reason, closing stdin in step #5 above is required to
 > trigger the panic.
 > 
 > Using a lightly modified and older (2012-03-28) snapshot, I get the
 > following backtrace in the close() syscall():
 > 
 > #0  maybe_dump (howto=260)
 > at /usr/src/sys/arch/i386/i386/machdep.c:880 s = 0
 > #1  0xc071c8ca in cpu_reboot (howto=260, bootstr=0x0)
 >     at /usr/src/sys/arch/i386/i386/machdep.c:899
 >         syncdone = false
 >         s = 0
 > #2  0xc09a79c4 in vpanic (fmt=0xc0deb79c "kernel %sassertion \"%s\"
 > failed: file \"%s\", line %d ", ap=0xddcd8b58
 > "&#65533;\267300\024\270300 \267300v\005")
 > at /usr/src/sys/kern/subr_prf.c:308 cii = 0
 >         ci = 0x0
 >         oci = 0x0
 >         bootopt = 260
 >         scratchstr = "kernel diagnostic assertion
 > \"mutex_owned(&fdp->fd_lock)\" failed: file
 > \"/usr/src/sys/kern/kern_event.c\", line 1398 ", '\000' <repeats 97
 > times> #3  0xc0bbb8c3 in kern_assert (fmt=0xc0deb79c "kernel
 > times> %sassertion \"%s\" failed: file \"%s\", line %d ")
 > times> at /usr/src/sys/lib/libkern/kern_assert.c:50 ap = 0xddcd8b58
 > times> "&#65533;\267300\024\270300 \267300v\005"
 > #4  0xc0679e3c in kqueue_doclose (kq=0xc6487a58, list=0xc63b194c,
 > fd=0) at /usr/src/sys/kern/kern_event.c:1398
 >         kn = 0xc0c0
 >         fdp = 0xc64b1cc0
 > #5  0xc0679f00 in kqueue_close (fp=0xc5ecc200)
 > at /usr/src/sys/kern/kern_event.c:1430 kq = 0xc6487a58
 >         fdp = 0xc63b1940
 >         ff = 0xc63b1940
 >         i = 0
 > #6  0xc06737fb in closef (fp=0xc5ecc200)
 > at /usr/src/sys/kern/kern_descrip.c:824 lf = {l_start =
 > -4184506505861690440, l_len = -4160666348713672704, l_pid = 0, l_type
 > = 0, l_whence = 0} error = 0 #7  0xc0673411 in fd_close (fd=13)
 > at /usr/src/sys/kern/kern_descrip.c:709 lf = {l_start =
 > -4629759700944188308, l_len = 28, l_pid = -573731704, l_type =
 > -29420, l_whence = -8755} fdp = 0xc63b1940 ff = 0xc5eda580
 >         fp = 0xc5ecc200
 >         p = 0xc5dea370
 >         l = 0xc6425800
 >         refcnt = 0
 > #8  0xc09c6e5d in sys_close (l=0xc6425800, uap=0xddcd8cec,
 > retval=0xddcd8d14) at /usr/src/sys/kern/sys_descrip.c:486
 > No locals.
 > #9  0xc09d687b in sy_call (sy=0xc0f35608, l=0xc6425800,
 > uap=0xddcd8cec, rval=0xddcd8d14) at /usr/src/sys/sys/syscallvar.h:61
 >         error = 0
 > #10 0xc09d6c1e in syscall (frame=0xddcd8d48)
 > at /usr/src/sys/arch/x86/x86/syscall.c:179 callp = 0xc0f35608
 >         p = 0xc5dea370
 >         l = 0xc6425800
 >         error = 0
 >         code = 6
 >         rval = {0, 0}
 >         rip_call = -1146338251
 >         args = {13, 11, -1077945480, -1145308313, 64, -1077945516,
 > -1077945524, -1077945524, -1147969380, -1077945492} #11 0xc01005d6
 > in ?? () No symbol table info available.
 > Backtrace stopped: previous frame inner to this frame (corrupt stack?)
 > >How-To-Repeat:
 >   1. ssh to the netbsd-6 machine
 >   2. install the tmux-1.4nb1 pkgsrc package
 >   3. run 'tmux new-session`
 >   4. ssh to the same machine from a different terminal
 >   5. run 'tmux list-sessions 0<&-'
 > >Fix:

 The following patch from Martin Husemann gets rid of the panic for me.

 Any one else have comments on this patch?  (we think it's right, but
 hopefully someone who understands this code better will chime in...)

 Later...

 Greg Oster


 Index: kern_event.c
 ===================================================================
 RCS file: /cvsroot/src/sys/kern/kern_event.c,v
 retrieving revision 1.75
 diff -c -r1.75 kern_event.c
 *** kern_event.c        25 Jan 2012 00:28:35 -0000      1.75
 --- kern_event.c        18 May 2012 15:03:42 -0000
 ***************
 *** 1421,1427 ****
         int i;

         kq = fp->f_data;
 !       fdp = curlwp->l_fd;

         mutex_enter(&fdp->fd_lock);
         for (i = 0; i <= fdp->fd_lastkqfile; i++) {
 --- 1421,1427 ----
         int i;

         kq = fp->f_data;
 !       fdp = kq->kq_fdp;

         mutex_enter(&fdp->fd_lock);
         for (i = 0; i <= fdp->fd_lastkqfile; i++) {

From: "Jukka Ruohonen" <jruoho@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: src
Date: Fri, 18 May 2012 15:25:26 +0000

 Module Name:	src
 Committed By:	jruoho
 Date:		Fri May 18 15:25:26 UTC 2012

 Modified Files:
 	src/distrib/sets/lists/tests: mi
 	src/etc/mtree: NetBSD.dist.tests
 	src/tests/usr.bin: Makefile
 Added Files:
 	src/tests/usr.bin/tmux: Makefile t_tmux.sh

 Log Message:
 Add a test case for PR kern/46463. From Richard Hansen.


 To generate a diff of this commit:
 cvs rdiff -u -r1.467 -r1.468 src/distrib/sets/lists/tests/mi
 cvs rdiff -u -r1.75 -r1.76 src/etc/mtree/NetBSD.dist.tests
 cvs rdiff -u -r1.13 -r1.14 src/tests/usr.bin/Makefile
 cvs rdiff -u -r0 -r1.1 src/tests/usr.bin/tmux/Makefile \
     src/tests/usr.bin/tmux/t_tmux.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jukka Ruohonen" <jruoho@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: src/tests/usr.bin/tmux
Date: Sat, 19 May 2012 07:30:38 +0000

 Module Name:	src
 Committed By:	jruoho
 Date:		Sat May 19 07:30:38 UTC 2012

 Modified Files:
 	src/tests/usr.bin/tmux: t_tmux.sh

 Log Message:
 Make the test pass. It appears that this however does not reproduce the
 PR kern/46463, even though the command is exactly the same.


 To generate a diff of this commit:
 cvs rdiff -u -r1.1 -r1.2 src/tests/usr.bin/tmux/t_tmux.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Richard Hansen <rhansen@bbn.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running
 tmux
Date: Tue, 22 May 2012 17:13:44 -0400

 On 2012-05-18 11:10, Greg Oster wrote:
 >   The following patch from Martin Husemann gets rid of the panic for me.
 >
 >   Any one else have comments on this patch?  (we think it's right, but
 >   hopefully someone who understands this code better will chime in...)

 There are other functions in src/sys/kern/kern_event.c that have the line:

      fdp = curlwp->l_fd;

 Are those also wrong?

 Thanks,
 Richard

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
Date: Wed, 23 May 2012 00:16:08 +0200

 On Tue, May 22, 2012 at 09:15:04PM +0000, Richard Hansen wrote:
 >  There are other functions in src/sys/kern/kern_event.c that have the line:
 >  
 >       fdp = curlwp->l_fd;
 >  
 >  Are those also wrong?

 I think the one in knote_fdclose() is correct, the other are probably wrong
 too.

 It only matters if the same kernel object is watched via kqueue from
 different processes, which seems to be an uncommon operation. We need to
 add test cases for it...

 Martin

From: Richard Hansen <rhansen@bbn.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running
 tmux
Date: Tue, 22 May 2012 17:33:50 -0400

 I think I've found a way to test the same underlying bug without causing 
 a panic:

    sh -c 'tmux new-session -d && tmux list-sessions && read'

 The above should sit and wait for a line of input on stdin, but instead 
 /bin/sh exits with the following error:

    read: arg count

 (Not a very helpful error message.)  If I run the above with bash 4.2.0 
 from pkgsrc, I get:

    -bash: read: read error: 0: Resource temporarily unavailable

 This error is what led me to do 'tmux list-sessions 0<&-' in the first 
 place.  I had dismissed the issue as unrelated until I realized that 
 Linux doesn't have this problem (but NetBSD 5 does).

 I can't reproduce the above error when stdin is something other than a tty.

 This could be a completely unrelated issue.  I haven't yet tried the 
 proposed patch to see if it makes this error go away.

 -Richard

From: Richard Hansen <rhansen@bbn.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running
 tmux
Date: Tue, 22 May 2012 21:12:07 -0400

 On 2012-05-22 17:33, Richard Hansen wrote:
 > I think I've found a way to test the same underlying bug without causing
 > a panic:
 >
 >     sh -c 'tmux new-session -d && tmux list-sessions && read'
 >
 > The above should sit and wait for a line of input on stdin, but instead
 > /bin/sh exits with the following error:
 >
 >     read: arg count
 >
 > (Not a very helpful error message.) If I run the above with bash 4.2.0
 > from pkgsrc, I get:
 >
 >     -bash: read: read error: 0: Resource temporarily unavailable

 With Martin's patch the panic is gone, but I can still reproduce the 
 above error.  Perhaps the above error is unrelated to the panic.

 -Richard

Responsible-Changed-From-To: kern-bug-people->martin
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Wed, 23 May 2012 06:51:00 +0000
Responsible-Changed-Why:
Not that simple, needs closer investigation.


From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
Date: Thu, 31 May 2012 12:08:22 +0200

 So it turns out (subject to interpretation in detail and to be discussed
 on tech-kern) that we see a side effect of two bugs here.

 First, at the application level, in tmux, there is something wrong.
 The client attaching to a already established session does:
  (1) call kqueue() to be able to receive kevents
  (2) connect via a local socket to the already running tmux instance
  (3) pass stdin/stdout/stderr vis sendmsg with SCM_RIGHTS to the session mux

 Now in the case that triggered this PR, tmux was invoked with stdin closed.
 Not suprising, kqueue() in (1) returns 0. Later in (3), this file descriptor
 is passed to the mux (which expects to receive stdin).

 So, adding something like this:

   while ((fd = open("/dev/null", O_RDWR)) < 3)
      ;
   close(fd);

 early in tmux would make sure this does not happen, replacing the missing
 descriptors with /dev/null.

 Now for the kernel part. A kqueue was intended to be a local-proc-only thing.
 There is no sense passing it from one process to another. That is why they
 are not inherited during fork() for example. However, and I think this is a
 bug, they can be passed via SCM_RIGHTS messages.

 I can reproduce the crash with this test program:

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 #include <err.h>
 #include <sys/event.h>
 #include <sys/time.h>
 #include <sys/socket.h>
 #include <sys/wait.h>

 static int
 child_proc(int fd)
 {
 	int kq, storage = -1;
 	struct cmsghdr *msg;
 	struct iovec iov;
 	struct msghdr m;

 	msg = malloc(CMSG_SPACE(sizeof(int)));
 	iov.iov_base = &storage;
 	iov.iov_len = sizeof(int);
 	m.msg_iov = &iov;
 	m.msg_iovlen = 1;

 	if (recvmsg(fd, &m, 0) == -1)
 		err(1, "child: could not recvmsg");

 	kq = *(int *)CMSG_DATA(msg);
 	printf("child (pid %d): received kq fd %d\n", getpid(), kq);

 	return 0;
 }

 int main(void)
 {
 	pid_t child;
 	int s[2], storage, status, kq;
 	struct cmsghdr *msg;
 	struct iovec iov;
 	struct msghdr m;
 	struct kevent ev;

 	kq = kqueue();
 	if (kq < 0)
 		err(1, "could not create kqueue");

 	if (socketpair(AF_LOCAL, SOCK_STREAM, 0, s) == -1)
 		err(1, "could not create a socket pair");


 	child = fork();
 	if (child == 0) {
 		close(s[0]);
 		return child_proc(s[1]);
 	}
 	close(s[1]);

 	msg = malloc(CMSG_SPACE(sizeof(int)));
 	iov.iov_base = &storage;
 	iov.iov_len = sizeof(int);

 	msg->cmsg_level = SOL_SOCKET;
 	msg->cmsg_type = SCM_RIGHTS;
 	msg->cmsg_len = CMSG_LEN(sizeof(int));

 	m.msg_iov = &iov;
 	m.msg_iovlen = 1;
 	m.msg_name = NULL;
 	m.msg_namelen = 0;
 	m.msg_control = msg;
 	m.msg_controllen = CMSG_SPACE(sizeof(int));
 	*(int *)CMSG_DATA(msg) = kq;

 	EV_SET(&ev, 1, EVFILT_TIMER, EV_ADD|EV_ENABLE, 0, 1, 0);
 	if (kevent(kq, &ev, 1, NULL, 0, NULL) == -1)
 		err(1, "could not add timer event");

 	printf("parent (pid %d): sending kq fd %d\n", getpid(), kq);
 	if (sendmsg(s[0], &m, 0) == -1)
 		err(1, "could not sendmsg");

 	close(kq);

 	waitpid(child, &status, 0);

 	return 0;
 }


 IMHO the correct fix would be to filter kqueue descriptors out of SCM_RIGHTS
 at the sending side and document this (in unp_internalize break out and
 return EPERM if there is a file_t with f_type == DTYPE_KQUEUE).

 Martin

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, martin@NetBSD.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org, rhansen@bbn.com
Cc: 
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
Date: Thu, 31 May 2012 09:18:53 -0400

 On May 31, 10:10am, martin@duskware.de (Martin Husemann) wrote:
 -- Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption runnin

 | The following reply was made to PR kern/46463; it has been noted by GNATS.
 | 
 | From: Martin Husemann <martin@duskware.de>
 | To: gnats-bugs@NetBSD.org
 | Cc: 
 | Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
 | Date: Thu, 31 May 2012 12:08:22 +0200
 | 
 |  So it turns out (subject to interpretation in detail and to be discussed
 |  on tech-kern) that we see a side effect of two bugs here.
 |  
 |  First, at the application level, in tmux, there is something wrong.
 |  The client attaching to a already established session does:
 |   (1) call kqueue() to be able to receive kevents
 |   (2) connect via a local socket to the already running tmux instance
 |   (3) pass stdin/stdout/stderr vis sendmsg with SCM_RIGHTS to the session mux
 |  
 |  Now in the case that triggered this PR, tmux was invoked with stdin closed.
 |  Not suprising, kqueue() in (1) returns 0. Later in (3), this file descriptor
 |  is passed to the mux (which expects to receive stdin).
 |  
 |  So, adding something like this:
 |  
 |    while ((fd = open("/dev/null", O_RDWR)) < 3)
 |       ;
 |    close(fd);
 |  
 |  early in tmux would make sure this does not happen, replacing the missing
 |  descriptors with /dev/null.
 |  
 |  Now for the kernel part. A kqueue was intended to be a local-proc-only thing.
 |  There is no sense passing it from one process to another. That is why they
 |  are not inherited during fork() for example. However, and I think this is a
 |  bug, they can be passed via SCM_RIGHTS messages.
 |  
 |  I can reproduce the crash with this test program:
 |  
 |  #include <stdio.h>
 |  #include <stdlib.h>
 |  #include <string.h>
 |  #include <unistd.h>
 |  #include <err.h>
 |  #include <sys/event.h>
 |  #include <sys/time.h>
 |  #include <sys/socket.h>
 |  #include <sys/wait.h>
 |  
 |  static int
 |  child_proc(int fd)
 |  {
 |  	int kq, storage = -1;
 |  	struct cmsghdr *msg;
 |  	struct iovec iov;
 |  	struct msghdr m;
 |  
 |  	msg = malloc(CMSG_SPACE(sizeof(int)));
 |  	iov.iov_base = &storage;
 |  	iov.iov_len = sizeof(int);
 |  	m.msg_iov = &iov;
 |  	m.msg_iovlen = 1;
 |  
 |  	if (recvmsg(fd, &m, 0) == -1)
 |  		err(1, "child: could not recvmsg");
 |  
 |  	kq = *(int *)CMSG_DATA(msg);
 |  	printf("child (pid %d): received kq fd %d\n", getpid(), kq);
 |  
 |  	return 0;
 |  }
 |  
 |  int main(void)
 |  {
 |  	pid_t child;
 |  	int s[2], storage, status, kq;
 |  	struct cmsghdr *msg;
 |  	struct iovec iov;
 |  	struct msghdr m;
 |  	struct kevent ev;
 |  
 |  	kq = kqueue();
 |  	if (kq < 0)
 |  		err(1, "could not create kqueue");
 |  
 |  	if (socketpair(AF_LOCAL, SOCK_STREAM, 0, s) == -1)
 |  		err(1, "could not create a socket pair");
 |  
 |  
 |  	child = fork();
 |  	if (child == 0) {
 |  		close(s[0]);
 |  		return child_proc(s[1]);
 |  	}
 |  	close(s[1]);
 |  
 |  	msg = malloc(CMSG_SPACE(sizeof(int)));
 |  	iov.iov_base = &storage;
 |  	iov.iov_len = sizeof(int);
 |  
 |  	msg->cmsg_level = SOL_SOCKET;
 |  	msg->cmsg_type = SCM_RIGHTS;
 |  	msg->cmsg_len = CMSG_LEN(sizeof(int));
 |  
 |  	m.msg_iov = &iov;
 |  	m.msg_iovlen = 1;
 |  	m.msg_name = NULL;
 |  	m.msg_namelen = 0;
 |  	m.msg_control = msg;
 |  	m.msg_controllen = CMSG_SPACE(sizeof(int));
 |  	*(int *)CMSG_DATA(msg) = kq;
 |  
 |  	EV_SET(&ev, 1, EVFILT_TIMER, EV_ADD|EV_ENABLE, 0, 1, 0);
 |  	if (kevent(kq, &ev, 1, NULL, 0, NULL) == -1)
 |  		err(1, "could not add timer event");
 |  
 |  	printf("parent (pid %d): sending kq fd %d\n", getpid(), kq);
 |  	if (sendmsg(s[0], &m, 0) == -1)
 |  		err(1, "could not sendmsg");
 |  
 |  	close(kq);
 |  
 |  	waitpid(child, &status, 0);
 |  
 |  	return 0;
 |  }
 |  
 |  
 |  IMHO the correct fix would be to filter kqueue descriptors out of SCM_RIGHTS
 |  at the sending side and document this (in unp_internalize break out and
 |  return EPERM if there is a file_t with f_type == DTYPE_KQUEUE).

 What's the harm in letting them through and making them work properly in
 the new process?

 christos

From: Martin Husemann <martin@duskware.de>
To: Christos Zoulas <christos@zoulas.com>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
Date: Thu, 31 May 2012 15:23:18 +0200

 On Thu, May 31, 2012 at 09:18:53AM -0400, Christos Zoulas wrote:
 > What's the harm in letting them through and making them work properly in
 > the new process?

 None, probably. Needs a bit of carefull review. However, I don't see
 any use case where this would make sense from a userland perspective.

 Martin

From: christos@zoulas.com (Christos Zoulas)
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
Date: Thu, 31 May 2012 09:30:44 -0400

 On May 31,  3:23pm, martin@duskware.de (Martin Husemann) wrote:
 -- Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption runnin

 | On Thu, May 31, 2012 at 09:18:53AM -0400, Christos Zoulas wrote:
 | > What's the harm in letting them through and making them work properly in
 | > the new process?
 | 
 | None, probably. Needs a bit of carefull review. However, I don't see
 | any use case where this would make sense from a userland perspective.

 It would make sense if you wanted an unprivileged process monitor a
 file descriptor monitor something what does not have access to, without
 making the client setuid. From the orthogonality POV, I don't see why
 file descriptor handling should be special-cased. It should just work
 as expected, i.e. a file descriptor passed from one process to another
 should function properly regardless of type.

 christos

From: Martin Husemann <martin@duskware.de>
To: Christos Zoulas <christos@zoulas.com>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
Date: Thu, 31 May 2012 15:54:34 +0200

 On Thu, May 31, 2012 at 09:30:44AM -0400, Christos Zoulas wrote:
 > It would make sense if you wanted an unprivileged process monitor a
 > file descriptor monitor something what does not have access to, without
 > making the client setuid.

 OK, so I'll fix it that way.

 Martin

From: christos@zoulas.com (Christos Zoulas)
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
Date: Thu, 31 May 2012 10:49:31 -0400

 On May 31,  3:54pm, martin@duskware.de (Martin Husemann) wrote:
 -- Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption runnin

 | On Thu, May 31, 2012 at 09:30:44AM -0400, Christos Zoulas wrote:
 | > It would make sense if you wanted an unprivileged process monitor a
 | > file descriptor monitor something what does not have access to, without
 | > making the client setuid.
 | 
 | OK, so I'll fix it that way.

 Yes, and we can remove the close-on-fork part too I guess.

 christos

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
Date: Thu, 31 May 2012 17:09:31 +0200

 On Thu, May 31, 2012 at 02:50:06PM +0000, Christos Zoulas wrote:
 >  Yes, and we can remove the close-on-fork part too I guess.

 I'm not sure we should do that by default, but we could add a flag for
 kqueue1() to make it inheritable. Otherwise it might be a suprise with
 unclear consequences on existing apps (like now suddenly running into
 RLIMIT_NOFILE).

 However, is this realy worth the effort? All other OSes seem to not inherit
 this descriptors as well.

 Martin

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, martin@NetBSD.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org, rhansen@bbn.com
Cc: 
Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption running tmux
Date: Thu, 31 May 2012 11:28:18 -0400

 On May 31,  3:10pm, martin@duskware.de (Martin Husemann) wrote:
 -- Subject: Re: kern/46463: netbsd-6:  panic and filesystem corruption runnin

 |  On Thu, May 31, 2012 at 02:50:06PM +0000, Christos Zoulas wrote:
 |  >  Yes, and we can remove the close-on-fork part too I guess.
 |  
 |  I'm not sure we should do that by default, but we could add a flag for
 |  kqueue1() to make it inheritable. Otherwise it might be a suprise with
 |  unclear consequences on existing apps (like now suddenly running into
 |  RLIMIT_NOFILE).

 I don't think that this will happen in reality; there are very few programs
 using kqueue, and I think that most people don't treat the kqueue file
 descriptors specially. I.e. if you ask jill-random if kqueue descriptors are
 inherited on fork, she will probably say yes.

 |  However, is this realy worth the effort? All other OSes seem to not inherit
 |  this descriptors as well.

 I don't know. I just dislike the special treatment. I think that the power
 of unix comes from the fact that file-descriptors have all treated equally
 for common operations. Yes, I know we have added all kinds of warts through
 the years, but...

 christos

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: src/external/bsd/tmux/dist
Date: Thu, 31 May 2012 19:14:57 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Thu May 31 19:14:56 UTC 2012

 Modified Files:
 	src/external/bsd/tmux/dist: tmux.c

 Log Message:
 Make sure to have file descriptors 0-2 open before doing further setup.
 Otherwise strange invocations could trick us into passing the event
 descriptor as, for example, stdin to the server.
 See PR 46463 for details.


 To generate a diff of this commit:
 cvs rdiff -u -r1.1.1.2 -r1.2 src/external/bsd/tmux/dist/tmux.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: src/tests/lib/libc/sys
Date: Thu, 31 May 2012 20:31:07 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Thu May 31 20:31:07 UTC 2012

 Modified Files:
 	src/tests/lib/libc/sys: t_kevent.c

 Log Message:
 Add a (skipped for now) test case for PR 46463


 To generate a diff of this commit:
 cvs rdiff -u -r1.2 -r1.3 src/tests/lib/libc/sys/t_kevent.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: src/sys/kern
Date: Sat, 2 Jun 2012 16:16:16 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Sat Jun  2 16:16:16 UTC 2012

 Modified Files:
 	src/sys/kern: uipc_usrreq.c

 Log Message:
 Stopgap fix for PR kern/46463: disallow passing of kqueue descriptors
 via SCM_RIGHT anxiliary socket messages.


 To generate a diff of this commit:
 cvs rdiff -u -r1.136 -r1.137 src/sys/kern/uipc_usrreq.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->pending-pullups
State-Changed-By: martin@NetBSD.org
State-Changed-When: Sat, 02 Jun 2012 16:50:56 +0000
State-Changed-Why:
[pullup-6 #304], [pullup-5 #1766]


From: "Julian Coleman" <jdc@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: [netbsd-5] src/sys/kern
Date: Sun, 3 Jun 2012 08:47:28 +0000

 Module Name:	src
 Committed By:	jdc
 Date:		Sun Jun  3 08:47:28 UTC 2012

 Modified Files:
 	src/sys/kern [netbsd-5]: uipc_usrreq.c

 Log Message:
 Pull up revision 1.137 (requested by martin in ticket #1766).

 Stopgap fix for PR kern/46463: disallow passing of kqueue descriptors
 via SCM_RIGHT anxiliary socket messages.


 To generate a diff of this commit:
 cvs rdiff -u -r1.119.4.4 -r1.119.4.5 src/sys/kern/uipc_usrreq.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Julian Coleman" <jdc@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: [netbsd-5-0] src/sys/kern
Date: Sun, 3 Jun 2012 08:47:30 +0000

 Module Name:	src
 Committed By:	jdc
 Date:		Sun Jun  3 08:47:30 UTC 2012

 Modified Files:
 	src/sys/kern [netbsd-5-0]: uipc_usrreq.c

 Log Message:
 Pull up revision 1.137 (requested by martin in ticket #1766).

 Stopgap fix for PR kern/46463: disallow passing of kqueue descriptors
 via SCM_RIGHT anxiliary socket messages.


 To generate a diff of this commit:
 cvs rdiff -u -r1.119.4.2 -r1.119.4.2.2.1 src/sys/kern/uipc_usrreq.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Julian Coleman" <jdc@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: [netbsd-5-1] src/sys/kern
Date: Sun, 3 Jun 2012 08:47:36 +0000

 Module Name:	src
 Committed By:	jdc
 Date:		Sun Jun  3 08:47:36 UTC 2012

 Modified Files:
 	src/sys/kern [netbsd-5-1]: uipc_usrreq.c

 Log Message:
 Pull up revision 1.137 (requested by martin in ticket #1766).

 Stopgap fix for PR kern/46463: disallow passing of kqueue descriptors
 via SCM_RIGHT anxiliary socket messages.


 To generate a diff of this commit:
 cvs rdiff -u -r1.119.4.3 -r1.119.4.3.2.1 src/sys/kern/uipc_usrreq.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: [netbsd-6] src/sys/kern
Date: Mon, 11 Jun 2012 23:20:38 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Mon Jun 11 23:20:38 UTC 2012

 Modified Files:
 	src/sys/kern [netbsd-6]: uipc_usrreq.c

 Log Message:
 Pull up following revision(s) (requested by martin in ticket #304):
 	sys/kern/uipc_usrreq.c: revision 1.137
 Stopgap fix for PR kern/46463: disallow passing of kqueue descriptors
 via SCM_RIGHT anxiliary socket messages.


 To generate a diff of this commit:
 cvs rdiff -u -r1.136 -r1.136.8.1 src/sys/kern/uipc_usrreq.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Tue, 26 Jun 2012 19:49:44 +0000
State-Changed-Why:
Crash is fixed and pulled up.


From: "Julian Coleman" <jdc@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: [netbsd-6] src
Date: Sat, 24 Nov 2012 21:40:03 +0000

 Module Name:	src
 Committed By:	jdc
 Date:		Sat Nov 24 21:40:03 UTC 2012

 Modified Files:
 	src/lib/libc/sys [netbsd-6]: kqueue.2
 	src/sys/kern [netbsd-6]: kern_descrip.c kern_event.c
 	src/tests/lib/libc/sys [netbsd-6]: t_kevent.c

 Log Message:
 Pull up revisions:
   src/sys/kern/kern_event.c revision 1.79
   src/sys/kern/kern_descrip.c revision 1.219
   src/lib/libc/sys/kqueue.2 revision 1.33
   src/tests/lib/libc/sys/t_kevent.c revision 1.2-1.5
 (requested by christos in ticket #716).

 - initialize kn_id
 - in close, invalidate f_data and f_type early to prevent accidental re-use
 - add a DIAGNOSTIC for when we use unsupported fd's and a KASSERT for f_event
   being NULL.

 Return EOPNOTSUPP for fnullop_kqfilter to prevent registration of unsupported
 fds. XXX: We should really fix the fd's to be supported in the future.
 Unsupported fd's have a NULL f_event, so registering crashes the kernel with
 a NULL function dereference of f_event.

 mention that kevent returns now EOPNOTSUPP.

 Move the references to PRs from code comments to the test description. Once
 ATF has the ability to output the metadata in the HTML reports, it should be
 easy to traverse between releng and gnats -reports via links.

 Add a (skipped for now) test case for PR 46463

 adapt to new reality

 Add a test for adding an event to an unsupported fd.


 To generate a diff of this commit:
 cvs rdiff -u -r1.32 -r1.32.2.1 src/lib/libc/sys/kqueue.2
 cvs rdiff -u -r1.218 -r1.218.2.1 src/sys/kern/kern_descrip.c
 cvs rdiff -u -r1.75 -r1.75.2.1 src/sys/kern/kern_event.c
 cvs rdiff -u -r1.1 -r1.1.2.1 src/tests/lib/libc/sys/t_kevent.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Julian Coleman" <jdc@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46463 CVS commit: [netbsd-6-0] src
Date: Sat, 24 Nov 2012 21:40:08 +0000

 Module Name:	src
 Committed By:	jdc
 Date:		Sat Nov 24 21:40:08 UTC 2012

 Modified Files:
 	src/lib/libc/sys [netbsd-6-0]: kqueue.2
 	src/sys/kern [netbsd-6-0]: kern_descrip.c kern_event.c
 	src/tests/lib/libc/sys [netbsd-6-0]: t_kevent.c

 Log Message:
 Pull up revisions:
   src/sys/kern/kern_event.c revision 1.79
   src/sys/kern/kern_descrip.c revision 1.219
   src/lib/libc/sys/kqueue.2 revision 1.33
   src/tests/lib/libc/sys/t_kevent.c revision 1.2-1.5
 (requested by christos in ticket #716).

 - initialize kn_id
 - in close, invalidate f_data and f_type early to prevent accidental re-use
 - add a DIAGNOSTIC for when we use unsupported fd's and a KASSERT for f_event
   being NULL.

 Return EOPNOTSUPP for fnullop_kqfilter to prevent registration of unsupported
 fds. XXX: We should really fix the fd's to be supported in the future.
 Unsupported fd's have a NULL f_event, so registering crashes the kernel with
 a NULL function dereference of f_event.

 mention that kevent returns now EOPNOTSUPP.

 Move the references to PRs from code comments to the test description. Once
 ATF has the ability to output the metadata in the HTML reports, it should be
 easy to traverse between releng and gnats -reports via links.

 Add a (skipped for now) test case for PR 46463

 adapt to new reality

 Add a test for adding an event to an unsupported fd.


 To generate a diff of this commit:
 cvs rdiff -u -r1.32 -r1.32.8.1 src/lib/libc/sys/kqueue.2
 cvs rdiff -u -r1.218 -r1.218.8.1 src/sys/kern/kern_descrip.c
 cvs rdiff -u -r1.75 -r1.75.6.1 src/sys/kern/kern_event.c
 cvs rdiff -u -r1.1 -r1.1.6.1 src/tests/lib/libc/sys/t_kevent.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.