NetBSD Problem Report #50430

From paul@pokey.whooppee.com  Sun Nov 15 10:27:10 2015
Return-Path: <paul@pokey.whooppee.com>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id ACF1BA654F
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 15 Nov 2015 10:27:10 +0000 (UTC)
Message-Id: <20151115102706.8CE4A16E7D@pokey.whooppee.com>
Date: Sun, 15 Nov 2015 18:27:06 +0800 (PHT)
From: paul@whooppee.com
Reply-To: paul@whooppee.com
To: gnats-bugs@NetBSD.org
Subject: syscall_disestablish() can remove active syscalls
X-Send-Pr-Version: 3.95

>Number:         50430
>Category:       kern
>Synopsis:       syscall_disestablish() can remove active syscalls
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    pgoyette
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Nov 15 10:30:00 +0000 2015
>Closed-Date:    Wed Nov 18 23:00:21 +0000 2015
>Last-Modified:  Wed Nov 18 23:00:21 +0000 2015
>Originator:     paul@whooppee.com
>Release:        NetBSD 7.99.21
>Organization:
+------------------+--------------------------+-------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+------------------+--------------------------+-------------------------+
>Environment:


System: NetBSD pokey.whooppee.com 7.99.21 NetBSD 7.99.21 (POKEY 2015-11-14 05:10:46) #0: Sat Nov 14 17:06:13 PHT 2015 paul@pokey.whooppee.com:/build/netbsd-local/obj/amd64/sys/arch/amd64/compile/POKEY amd64
Architecture: x86_64
Machine: amd64
>Description:
syscall_disestablish() checks only the last syscall (stored in struct lwp
member l_sysent) when checking to see if a syscall is still active.  In
most cases this works fine.

But if the syscall was stalled or blocked for some reason (waiting for an
I/O device for example), and a signal is subsequently delivered to the
lwp, the lwp's signal handler can call another syscall prior to return
from the first one.  In the case, the new syscall overwrites the l_sysent
member and there is no record of the original syscall, other than in the
process's stack.

If the original syscall is provided by a dynamically-loaded module, and
that module is now requested to be unloaded, syscall_disestablish() will
not be able to return EBUSY, so the module will be unloaded.  When the
original syscall is resumed, it will execute at an address which will no
longer contain the syscall code, likely resulting in some sort of trap.

On the other hand, it is also possible for syscall_disestablish() to
return a "false positive" result, indicating that a syscall is still
active when in fact it has been deactivated (consider an application
that calls logjmp(3) to return to a procedure beyond/above the one that
called the syscall).  In such case, the l_sysent member will still
contain an entry for a syscall that will never return.


>How-To-Repeat:

See above.
>Fix:
Unknown.  Perhaps the l_sysent member should be replaced with a "stack"
of entries and an associated stack-pointer, so that all nested syscalls
can be recorded.  But what would be the "right" size for the stack?  And
would we really want to provide for re-sizing it if needed?

If the stack approach were taken, and if the stack were re-sizable, that
could provide an opportunity for a malicious program to consume large
amounts of kernel memory, possibly resulting in a denial-of-service.  :(



>Release-Note:

>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Mon, 16 Nov 2015 07:23:49 +1100

 until bug is fixed, can we disable the autounload?


 .mrg.

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, paul@whooppee.com
Cc: 
Subject: re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Sun, 15 Nov 2015 16:45:37 -0500

 On Nov 15,  8:25pm, mrg@eterna.com.au (matthew green) wrote:
 -- Subject: re: kern/50430: syscall_disestablish() can remove active syscalls

 |  until bug is fixed, can we disable the autounload?

 Sure, that sounds prudent. It is difficult to fix properly. One way to do
 this would be to mark all lwps that have used compat syscalls with a bit
 depending on the module they have used, and refuse to unload the module
 until the lwp is gone.

 - when load a module that has compat syscalls, assign to it a bit.
 - mark a flags field of all syscalls that were loaded with that module
   with that bit.
 - or the lwp flags with the syscall flags on each syscall.
 - when it is time to unload check that no lwp has that bit in the flags set.
 - instead of keeping l_sysent, keep l_sysmodflags or something.

 christos

From: Paul Goyette <paul@vps1.whooppee.com>
To: Christos Zoulas <christos@zoulas.com>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org, 
    netbsd-bugs@netbsd.org
Subject: re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Mon, 16 Nov 2015 06:12:01 +0800 (PHT)

 On Sun, 15 Nov 2015, Christos Zoulas wrote:

 > Sure, that sounds prudent. ...

 Also realize that disabling autounload won't prevent manual unloading
 of modules via modunload(8).


 >                      ... It is difficult to fix properly. One way to do
 > this would be to mark all lwps that have used compat syscalls with a bit
 > depending on the module they have used, and refuse to unload the module
 > until the lwp is gone.
 >
 > - when load a module that has compat syscalls, assign to it a bit.
 > - mark a flags field of all syscalls that were loaded with that module
 >  with that bit.
 > - or the lwp flags with the syscall flags on each syscall.
 > - when it is time to unload check that no lwp has that bit in the flags
 >  set.
 > - instead of keeping l_sysent, keep l_sysmodflags or something.

 It's not only "compat" modules that need to worry about this.  It is
 potentially any unloadable module that uses syscall_establish().  (We
 could skip allocating/setting the flag for built-in modules, but I
 don't think that syscall_establish() is aware of this.)


 +------------------+--------------------------+-------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
 +------------------+--------------------------+-------------------------+

From: matthew green <mrg@eterna.com.au>
To: paul@whooppee.com
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
    gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
    Christos Zoulas <christos@zoulas.com>
Subject: re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Mon, 16 Nov 2015 09:22:46 +1100

 Paul Goyette writes:
 > On Sun, 15 Nov 2015, Christos Zoulas wrote:
 > 
 > > Sure, that sounds prudent. ...
 > 
 > Also realize that disabling autounload won't prevent manual unloading
 > of modules via modunload(8).

 we also don't prevent root from modloading a call to panic() ;)

 > It's not only "compat" modules that need to worry about this.  It is
 > potentially any unloadable module that uses syscall_establish().  (We
 > could skip allocating/setting the flag for built-in modules, but I
 > don't think that syscall_establish() is aware of this.)

 yup.  this problem is pretty severe and i'm quite amazed we've
 not noticed problems from it way more often.

 thanks.


 .mrg.

From: Paul Goyette <paul@vps1.whooppee.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Mon, 16 Nov 2015 06:39:03 +0800 (PHT)

 From: matthew green <mrg@eterna.com.au>
 Reply-To: gnats-bugs@NetBSD.org
 To: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
      netbsd-bugs@netbsd.org, paul@whooppee.com


 On Sun, 15 Nov 2015, Christos Zoulas wrote:

 > - when load a module that has compat syscalls, assign to it a bit.
 > - mark a flags field of all syscalls that were loaded with that module
 >  with that bit.
 > - or the lwp flags with the syscall flags on each syscall.
 > - when it is time to unload check that no lwp has that bit in the flags set.
 > - instead of keeping l_sysent, keep l_sysmodflags or something.

 Just curious about any suggestions you might have for allocating the
 flag bit to be used....

 1. Seems to me we would need to keep a 32- or 64-bit mask for each
     syscall.  When a new request to sc_establish() comes in, we could
     calculate the logical-OR of all current bits, complement the
     result, and find-least-significant-bit-set.

 2. What do we do if we run out of bits?  Perhaps not too likely,
     but we would need to handle it.  What would be an appropriate
     value for sc_establish() to return if no bit is available?

 If everyone agrees on this approach

  	* replace l_sysent with a bitmask l_sc_bitmask,
  	* add complementary sy_bitmask to struct sysent (or just a
  	  short bit_number sy_mask_bitnum) to the syscall table, and
  	* have sc_disestablish() return an error if any active lwp
  	  has _ever_ called a related syscall)

 I can get started immediately on making this happen.


 +------------------+--------------------------+-------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
 +------------------+--------------------------+-------------------------+

From: Paul Goyette <paul@vps1.whooppee.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Mon, 16 Nov 2015 09:36:53 +0800 (PHT)

 On Mon, 16 Nov 2015, Paul Goyette wrote:

 > 1. Seems to me we would need to keep a 32- or 64-bit mask for each
 >   syscall.  When a new request to sc_establish() comes in, we could
 >   calculate the logical-OR of all current bits, complement the
 >   result, and find-least-significant-bit-set.

 Calculating the logical-OR of all current bits requires us to know how
 many entries there are in the emul->e_sysent table.  Many ports have
 __HAVE_MINIMAL_EMUL defined, which prevents creation of the sy_nsysent
 member.

 For now, I am making the sy_nsysent member unconditional;  this affects
 a number of compat_* and rump/* files, but it appears that they were
 all prepared for this.

 > 2. What do we do if we run out of bits?  Perhaps not too likely,
 >   but we would need to handle it.  What would be an appropriate
 >   value for sc_establish() to return if no bit is available?

 For now, I've selected ENOFILE; simple to change if a better value is
 agreed upon.

From: Masao Uebayashi <uebayasi@gmail.com>
To: Christos Zoulas <christos@zoulas.com>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org, Paul Goyette <paul@whooppee.com>
Subject: Re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Mon, 16 Nov 2015 17:08:42 +0900

 On Mon, Nov 16, 2015 at 6:45 AM, Christos Zoulas <christos@zoulas.com> wrote:
 > On Nov 15,  8:25pm, mrg@eterna.com.au (matthew green) wrote:
 > -- Subject: re: kern/50430: syscall_disestablish() can remove active syscalls
 >
 > |  until bug is fixed, can we disable the autounload?
 >
 > Sure, that sounds prudent. It is difficult to fix properly. One way to do
 > this would be to mark all lwps that have used compat syscalls with a bit
 > depending on the module they have used, and refuse to unload the module
 > until the lwp is gone.
 >
 > - when load a module that has compat syscalls, assign to it a bit.
 > - mark a flags field of all syscalls that were loaded with that module
 >   with that bit.
 > - or the lwp flags with the syscall flags on each syscall.
 > - when it is time to unload check that no lwp has that bit in the flags set.
 > - instead of keeping l_sysent, keep l_sysmodflags or something.

 What happens if signal handler does longjmp(3) and interrupted syscall
 never returns?

From: Paul Goyette <paul@vps1.whooppee.com>
To: Masao Uebayashi <uebayasi@gmail.com>
Cc: Christos Zoulas <christos@zoulas.com>, gnats-bugs@netbsd.org, 
    kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Mon, 16 Nov 2015 16:20:55 +0800 (PHT)

 On Mon, 16 Nov 2015, Masao Uebayashi wrote:

 >> Sure, that sounds prudent. It is difficult to fix properly. One way to do
 >> this would be to mark all lwps that have used compat syscalls with a bit
 >> depending on the module they have used, and refuse to unload the module
 >> until the lwp is gone.
 >>
 >> - when load a module that has compat syscalls, assign to it a bit.
 >> - mark a flags field of all syscalls that were loaded with that module
 >>   with that bit.
 >> - or the lwp flags with the syscall flags on each syscall.
 >> - when it is time to unload check that no lwp has that bit in the flags set.
 >> - instead of keeping l_sysent, keep l_sysmodflags or something.
 >
 > What happens if signal handler does longjmp(3) and interrupted syscall
 > never returns?

 Whether or not the interrupted syscall returns, as long as the lwp is
 still alive it will prevent the syscall from being disestablished.

 In effect, it is a "false positive" but it allows us to err on the side
 of caution.  I'd rather have the module remain loaded even if nothing is
 currently referencing its resources, rather than have it get unloaded
 and then the kernel crashes.



 +------------------+--------------------------+-------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
 +------------------+--------------------------+-------------------------+

From: Masao Uebayashi <uebayasi@gmail.com>
To: Paul Goyette <paul@whooppee.com>
Cc: Christos Zoulas <christos@zoulas.com>, gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Mon, 16 Nov 2015 17:42:04 +0900

 On Mon, Nov 16, 2015 at 5:20 PM, Paul Goyette <paul@vps1.whooppee.com> wrote:
 > On Mon, 16 Nov 2015, Masao Uebayashi wrote:
 >
 >>> Sure, that sounds prudent. It is difficult to fix properly. One way to do
 >>> this would be to mark all lwps that have used compat syscalls with a bit
 >>> depending on the module they have used, and refuse to unload the module
 >>> until the lwp is gone.
 >>>
 >>> - when load a module that has compat syscalls, assign to it a bit.
 >>> - mark a flags field of all syscalls that were loaded with that module
 >>>   with that bit.
 >>> - or the lwp flags with the syscall flags on each syscall.
 >>> - when it is time to unload check that no lwp has that bit in the flags
 >>> set.
 >>> - instead of keeping l_sysent, keep l_sysmodflags or something.
 >>
 >>
 >> What happens if signal handler does longjmp(3) and interrupted syscall
 >> never returns?
 >
 >
 > Whether or not the interrupted syscall returns, as long as the lwp is
 > still alive it will prevent the syscall from being disestablished.
 >
 > In effect, it is a "false positive" but it allows us to err on the side
 > of caution.  I'd rather have the module remain loaded even if nothing is
 > currently referencing its resources, rather than have it get unloaded
 > and then the kernel crashes.

 Ah.  That's simpler. :)

From: christos@zoulas.com (Christos Zoulas)
To: paul@whooppee.com, gnats-bugs@NetBSD.org
Cc: 
Subject: re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Mon, 16 Nov 2015 07:55:30 -0500

 On Nov 16,  6:39am, paul@vps1.whooppee.com (Paul Goyette) wrote:
 -- Subject: re: kern/50430: syscall_disestablish() can remove active syscalls

 | On Sun, 15 Nov 2015, Christos Zoulas wrote:
 | 
 | > - when load a module that has compat syscalls, assign to it a bit.
 | > - mark a flags field of all syscalls that were loaded with that module
 | >  with that bit.
 | > - or the lwp flags with the syscall flags on each syscall.
 | > - when it is time to unload check that no lwp has that bit in the flags set.
 | > - instead of keeping l_sysent, keep l_sysmodflags or something.
 | 
 | Just curious about any suggestions you might have for allocating the
 | flag bit to be used....

 Just look for a free one. The bit is per module, not per syscall. It is
 allocated only for modules that do syscall_establish(). If you run out,
 you don't load the module.

 | 
 | 1. Seems to me we would need to keep a 32- or 64-bit mask for each
 |     syscall.  When a new request to sc_establish() comes in, we could
 |     calculate the logical-OR of all current bits, complement the
 |     result, and find-least-significant-bit-set.

 Yes. We need a 32 or 64 bit mask per lwp. On each syscall we "or" the
 flag of the syscall (which is set when the module loads it) with the
 mask in lwp. A 64 bit mask would allow us to load 64 modules that allocate
 syscalls.

 On module unload, we scan the lwps to make sure that none has the bit of
 the module we are trying to unload set.

 | 2. What do we do if we run out of bits?  Perhaps not too likely,
 |     but we would need to handle it.  What would be an appropriate
 |     value for sc_establish() to return if no bit is available?

 ESRCH/ENOMEM?

 | If everyone agrees on this approach
 | 
 |  	* replace l_sysent with a bitmask l_sc_bitmask,
 |  	* add complementary sy_bitmask to struct sysent (or just a
 |  	  short bit_number sy_mask_bitnum) to the syscall table, and
 |  	* have sc_disestablish() return an error if any active lwp
 |  	  has _ever_ called a related syscall)
 | 
 | I can get started immediately on making this happen.

 I don't know, perhaps someone else has a better idea.

 christos

From: christos@zoulas.com (Christos Zoulas)
To: Masao Uebayashi <uebayasi@gmail.com>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, 
	Paul Goyette <paul@whooppee.com>
Subject: Re: kern/50430: syscall_disestablish() can remove active syscalls
Date: Mon, 16 Nov 2015 08:10:01 -0500

 On Nov 16,  5:08pm, uebayasi@gmail.com (Masao Uebayashi) wrote:
 -- Subject: Re: kern/50430: syscall_disestablish() can remove active syscalls

 | On Mon, Nov 16, 2015 at 6:45 AM, Christos Zoulas <christos@zoulas.com> wrote:
 | > On Nov 15,  8:25pm, mrg@eterna.com.au (matthew green) wrote:
 | > -- Subject: re: kern/50430: syscall_disestablish() can remove active syscalls
 | >
 | > |  until bug is fixed, can we disable the autounload?
 | >
 | > Sure, that sounds prudent. It is difficult to fix properly. One way to do
 | > this would be to mark all lwps that have used compat syscalls with a bit
 | > depending on the module they have used, and refuse to unload the module
 | > until the lwp is gone.
 | >
 | > - when load a module that has compat syscalls, assign to it a bit.
 | > - mark a flags field of all syscalls that were loaded with that module
 | >   with that bit.
 | > - or the lwp flags with the syscall flags on each syscall.
 | > - when it is time to unload check that no lwp has that bit in the flags set.
 | > - instead of keeping l_sysent, keep l_sysmodflags or something.
 | 
 | What happens if signal handler does longjmp(3) and interrupted syscall
 | never returns?

 It does not matter. The flag is never cleared (even when that does not
 happen). Once an lwp executes a syscall from a compat module, that module
 cannot be unloaded until the lwp exits.

 christos

Responsible-Changed-From-To: kern-bug-people->pgoyette
Responsible-Changed-By: pgoyette@NetBSD.org
Responsible-Changed-When: Wed, 18 Nov 2015 22:57:33 +0000
Responsible-Changed-Why:
I've been working on it.


State-Changed-From-To: open->closed
State-Changed-By: pgoyette@NetBSD.org
State-Changed-When: Wed, 18 Nov 2015 23:00:21 +0000
State-Changed-Why:
Based on further investigation, this PR seems not to be a problem
after all!

With a fairly simple test program, and a few extra printf()s in
strategic locations, I've determined that we never have "nested"
active syscalls.  If a signal gets delivered to the process, any
currently-active syscall is terminated, with status of either
EINTR or ERESTART.  In the ERESTART case, the syscall will be
automatically restarted.  (I'm still trying to find the code that
implements these empirical results.)


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.