NetBSD Problem Report #42724

From www@NetBSD.org  Wed Feb  3 00:07:01 2010
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 392B263C462
	for <gnats-bugs@gnats.NetBSD.org>; Wed,  3 Feb 2010 00:07:01 +0000 (UTC)
Message-Id: <20100203000700.EA98563B886@www.NetBSD.org>
Date: Wed,  3 Feb 2010 00:07:00 +0000 (UTC)
From: eravin@panix.com
Reply-To: eravin@panix.com
To: gnats-bugs@NetBSD.org
Subject: select(2)  and poll(2) can return non-error status on bad file descriptors
X-Send-Pr-Version: www-1.0

>Number:         42724
>Category:       kern
>Synopsis:       select(2)  and poll(2) can return non-error status on bad file descriptors
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Feb 03 00:10:00 +0000 2010
>Originator:     Ed Ravin
>Release:        5.0.1
>Organization:
PANIX Public Access Networks Corp
>Environment:
NetBSD panix3.panix.com 5.0.1 NetBSD 5.0.1 (PANIX-USER) #0: Thu Nov  5 22:13:39
EST 2009  root@juggler.panix.com:/devel/netbsd/5.0.1/src/sys/arch/i386/compile/PANIX-USER i386
>Description:
we repeatedly see programs like emacs, mutt, elm, pine, trn, and nn go into infinite loops polling for input when the end user has lost their telnet or ssh session.

Here's a sample ktrace:
 19399      1 emacs-21.3 select(0x1, 0x8211000, 0, 0, 0xbf7fe7e8) = 1
 19399      1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d1744) Err#9 EBADF
 19399      1 emacs-21.3 getpid()                  = 19399, 7766
 19399      1 emacs-21.3 kill(0x4bc7, 0x1)         = 0
 19399      1 emacs-21.3 read(0, 0xbf7d1748, 0xfff) = 0
       ""
 19399      1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d174c) Err#9 EBADF
 19399      1 emacs-21.3 getpid()                  = 19399, 7766
 19399      1 emacs-21.3 kill(0x4bc7, 0x1)         = 0
 19399      1 emacs-21.3 read(0, 0xbf7d1750, 0xfff) = 0
       ""
 19399      1 emacs-21.3 select(0x1, 0x8211000, 0, 0, 0xbf7fe7e8) = 1
 19399      1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d1744) Err#9 EBADF
 19399      1 emacs-21.3 getpid()                  = 19399, 7766
 19399      1 emacs-21.3 kill(0x4bc7, 0x1)         = 0
 19399      1 emacs-21.3 read(0, 0xbf7d1748, 0xfff) = 0
       ""
 19399      1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d174c) Err#9 EBADF
 19399      1 emacs-21.3 getpid()                  = 19399, 7766
 19399      1 emacs-21.3 kill(0x4bc7, 0x1)         = 0
 19399      1 emacs-21.3 read(0, 0xbf7d1750, 0xfff) = 0
       ""


And so on ad infinitum.  Note that file descriptor #0 has been closed:
# fstat -p 19399
USER     CMD          PID   FD MOUNT       INUM MODE         SZ|DV R/W
zzz      emacs-21.3 19399   wd /net/u   6552785 drwx------    8192 r
zzz      emacs-21.3 19399    0 -         -        none    -
zzz      emacs-21.3 19399    1 -         -        none    -
zzz      emacs-21.3 19399    2 -         -        none    -

And here's the FD list:

(gdb) x/32 0x8211000
0x8211000:      0x00000001      0x00000000      0x00000000      0x00000000
0x8211010:      0x00000000      0x00000000      0x00000000      0x00000000
0x8211020:      0x1821cc34      0x00000000      0x00000000      0x00000000
0x8211030:      0x00000000      0x00000000      0x00000000      0x00000000
0x8211040:      0x00000001      0x00000000      0x00000000      0x00000000
0x8211050:      0x00000000      0x00000000      0x00000000      0x00000000
0x8211060:      0x00000000      0x00000000      0x00000000      0x00000000
0x8211070:      0x00000000      0x00000000      0x00000000      0x00000000

The version of lsof we have on this box seems to not fully understand the broken file descriptors:
root@panix2 ~: # lsof-NetBSD-i386-5.0_BETA -p 19399
lsof-NetBSD-i386-5.0_BETA: WARNING: compiled for NetBSD release 5.0_BETA; this is 5.0.1.
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
emacs-21. 19399  zzz  cwd   VDIR   11,3     8192 6552785 /net/u/1/k/zzz/News
emacs-21. 19399  zzz  txt   VREG  142,0  4561480  638894 /usr/local/bin/emacs-21.3
emacs-21. 19399  zzz  txt   VREG  142,0  1120316  249256 /lib/libc.so.12.164
emacs-21. 19399  zzz  txt   VREG  142,0   125014  249277 /lib/libm.so.0.6
emacs-21. 19399  zzz  txt   VREG  142,0     3790  249279 /lib/libm387.so.0.1
emacs-21. 19399  zzz  txt   VREG  142,0    12875  249268 /lib/libtermcap.so.0.6
emacs-21. 19399  zzz  txt   VREG  142,0    11263  636496 /usr/lib/libossaudio.so.0.0
emacs-21. 19399  zzz  txt   VREG  142,0    65173  635885 /libexec/ld.elf_so
emacs-21. 19399  zzz    0u                               unknown file system type: 0
emacs-21. 19399  zzz    1u                               unknown file system type: 0
emacs-21. 19399  zzz    2u                               unknown file system type: 0



Note that process 19399 has lost its telnetd or sshd and has only a controlling shell which is parented by init:

#  pstree -p 19399
-+= 00000 root [system]
 \-+= 00001 root init
   \-+= 07766 zzz -tcsh (tcsh-6.13.00)
     \--= 19399 zzz emacs (emacs-21.3)

Here's what I believe the scenario to be - when a user gets disconnected abnormally from an ssh or telnet session, the process should receive a HUP signal.  Perhaps select(2) or poll(2) are sleeping waiting on input at the time, and something goes wrong.  But the HUP does not get processed properly, and the process continues with its select/read loop, and assumes select is sleeping for it to wait on input.

However, select keeps returning error value 1, saying that one FD is ready to read, even though the FD supplied to select(2) was invalid.  The process tries to read, gets zero data available (that doesn't sound right either, shouldn't read(2) return EBADF here?), and goes back to select(2) to try again.  Since the process expected select(2) to sleep until I/O was available, and select(2) is now returning immediately, the process goes into a tight loop and hogs the CPU. 

Although it's clear that emacs in this case has a chance to see something's wrong (note the ioctl call that returns EBADF), I don't think the app is really at fault, since as previously stated this happens to multiple applications and they all exhibit the same symptoms.

We have also seen this with the poll(2) syscall.



>How-To-Repeat:
run a multi-user system with many shell users using interactive programs like emacs, mutt, elm, pine, trn, and nn.

wait for some of them to get accidentally disconnected.

eventually, this will happen.  we usually see it once every few days. 
>Fix:
have select return EBADF when it is given an invalid or closed FD in its list.

read(2) should also return EBADF when it is given an invalid or closed FD.

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.