NetBSD Problem Report #57053

From wiz@yt.nih.at  Wed Oct 12 12:37:17 2022
Return-Path: <wiz@yt.nih.at>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 0BFC91A9239
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 12 Oct 2022 12:37:17 +0000 (UTC)
Message-Id: <20221012110228.4B8811CB6AF4@yt.nih.at>
Date: Wed, 12 Oct 2022 13:02:28 +0200 (CEST)
From: Thomas Klausner <wiz@NetBSD.org>
Reply-To: Thomas Klausner <wiz@NetBSD.org>
To: gnats-bugs@NetBSD.org
Subject: continuation problem in shell pipelines
X-Send-Pr-Version: 3.95

>Number:         57053
>Category:       pkg
>Synopsis:       continuation problem in shell pipelines
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    pkg-manager
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Oct 12 12:40:00 +0000 2022
>Closed-Date:    Sun Nov 06 20:59:23 +0000 2022
>Last-Modified:  Sun Nov 06 20:59:23 +0000 2022
>Originator:     Thomas Klausner
>Release:        NetBSD 9.99.100
>Organization:

>Environment:


Architecture: x86_64
Machine: amd64
>Description:
I've been using the following shell function for ages:

dir() { ls -al "$@" | less; }

On -current (9.99.100 kernel from Oct 9, Userland from Sep 21, zsh
from May), when I CTRL-Z the less(1) and then want to go back in, it
doesn't work and I see the following:

> dir
zsh: done       ls -al "$@" |
zsh: suspended
> fg
[1]  + done       ls -al "$@" |
       continued
zsh: done                    ls -al "$@" |
zsh: suspended (tty output)
zsh: done                    ls -al "$@" |
zsh: suspended (tty output)

That happens every time I try to 'fg' it.

This was working fine not so long ago, but I don't remember exactly
when it started happening.

Others can reproduce this problem (see current-users) in /bin/sh and /bin/zsh.

/bin/bash and ksh93 seem to work fine.

RVP mentioned:
> Since less sucks up its entire input, the ls command is "done"
> and has exited in that pipeline (if you page forward a few screens).
> The ls exiting in the pipeline seems to confuse zsh and /bin/sh.

> If you run the function on a large dir. and suspend it at the 1st
> screen, then /bin/sh also works because ls is still running and
> can be suspended.

> ksh also says "Done", but it still allows the pipeline to be fg'd
> correctly:

> [1] + Done                 ls -al "$@" |
> Stopped              less

>How-To-Repeat:
See above.

>Fix:
Please

>Release-Note:

>Audit-Trail:
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/57053: continuation problem in shell pipelines
Date: Wed, 12 Oct 2022 20:31:01 +0700

 As I indicated in the e-mail exchange that preceded this PR, the
 issue with /bin/sh is completely different, and unrelated to the
 problem being observed with zsh.

 sh is failing to find the process to restart - I know why that is,
 and it had already been fixed (not yet committed) before this issue
 was reported.

 zsh is restarting the job, after which it stops again.   That will
 be a process group related issue - when restarting a job, the controlling
 tty needs to have its pgrp changed to the pgrp of the job being given
 control (otherwise when less attempts the ioctl to change the modes of
 the terminal back to cbreak mode (or what used to be called that), the
 process will stop).

 Either zsh isn't using the correct method to change the tty pgrp, or
 isn't doing it quickly enough.   Since it used to work (and is an old
 binary apparently) the only explanation for the former I can think of
 would be if zsh is adapting its behaviour based upon the kernel version
 (ioctls are one of the few userland interfaces where that makes sense)
 and has been confused by the change to 9.99.100.

 Perhaps someone who knows the zsh sources might be able to check that out.

 The latter would indicate that the timings have changed somehow, though
 I am having trouble imaging a change which could cause that - maybe it
 used to be that after sending a signal, the process kept running, and
 zsh is doing things in the wrong order - sending SIGCONT before changing
 the tty pgrp - and now the awoken process is getting to run before the
 parent (zsh) gets a chance to continue, so the child (less) is starting
 while the tty pgrp still belongs to zsh.

 FWIW in addition to the shells that others mentioned, I also tested
 the FreeBSD sh (running on NetBSD) dash yash bosh and mksh, and they
 all work fine as well.   The pgrp problem is not a problem with /bin/sh
 either, once its "find the process" problem is fixed.

 kre

From: "Robert Elz" <kre@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57053 CVS commit: src/bin/sh
Date: Sun, 30 Oct 2022 01:46:17 +0000

 Module Name:	src
 Committed By:	kre
 Date:		Sun Oct 30 01:46:17 UTC 2022

 Modified Files:
 	src/bin/sh: jobs.c

 Log Message:
 PR bin/57053 is related (peripherally) here.

 sh has been remembering the process group of a job for a while now, but
 using that for almost nothing.

 The old way to resume a job, was to try each pid in the job with a
 SIGCONT (using it as the process group identifier via killpg()) until
 one worked (or none did, in which case resuming would be impossible,
 but that never actually happened).   This wasn't as bad as it seems,
 as in practice the first process attempted was *always* the correct
 one.  Why the loop was considered necessary I am not sure.  Nothing
 but the first could possibly work.

 This worked until a fix for an obscure possible bug was added a
 while ago - now a process which has already finished, and had its
 zombie collected via wait*() is no longer ever considered to have
 a pid which is a candidate for use in any system call.  That's
 because the kernel might have reassigned that pid for some newly
 created process (we have no idea how much time might have passed
 since the pid was returned to the kernel for reuse, it might have
 happened weeks ago).

 This is where the example in bin/57053 revealed a problem.

 That PR is really about a quite different problem in zsh (from pksrc)
 and should be pkg/57053, but as the test case also hit the problem
 here, it was assumed (by some) they were the same issue.

 The example is (in a small directory)
 	ls | less
 which is then suspended (^Z), and resumed (fg).   Since the directory
 is small, ls will be finished, and reaped by sh - so the code would
 now refuse to use its pid for the killpg() call to send the SIGCONT.
 The (useless) loop would attempt to use less's pid for this purpose
 (it is still alive at this point) but that would fail, as that pid
 is not a process group identifier, of anything.   Hence the job
 could not be resumed.

 Before the PR (or preceding mailing list discussion) the change here
 had already been made (part of a much bigger set of changes, some of
 which might follow - sometime).   We now actually use the job's
 remembered process group identifier when we want the process group
 identifier, instead of trying to guess which pid it happens to be
 (which actually never took any guessing, it was, and is always the
 pid of the first process created for the job).   A couple of minor
 fixes to how the pgrp is obtained, and used, accompany the changes
 to use it when appropriate.


 To generate a diff of this commit:
 cvs rdiff -u -r1.116 -r1.117 src/bin/sh/jobs.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: bin-bug-people->pkg-manager
Responsible-Changed-By: kre@NetBSD.org
Responsible-Changed-When: Sun, 30 Oct 2022 01:58:47 +0000
Responsible-Changed-Why:
This PR is about a problem in shells/zsh (from pkgsrc)
It belongs in category pkg, not bin.
A different problem (even different symptom) but revealed
by the same test case in /bin/sh has been corrected.
.


From: Thomas Klausner <wiz@NetBSD.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/57053 (continuation problem in shell pipelines)
Date: Sun, 30 Oct 2022 08:59:27 +0100

 The zsh bug has been filed upstream, but it hasn't been resolved yet.

 Start of the thread:

 https://zsh.org/mla/workers/2022/msg01115.html
  Thomas

State-Changed-From-To: open->closed
State-Changed-By: wiz@NetBSD.org
State-Changed-When: Sun, 06 Nov 2022 20:59:23 +0000
State-Changed-Why:
Fixed in zsh 5.9nb2.


>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2022 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.