NetBSD Problem Report #56422

From gson@gson.org  Wed Sep 29 14:01:21 2021
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 95DA51A921F
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 29 Sep 2021 14:01:21 +0000 (UTC)
Message-Id: <20210929140105.E771525417E@guava.gson.org>
Date: Wed, 29 Sep 2021 17:01:05 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: zgrep -l sometimes hangs
X-Send-Pr-Version: 3.95

>Number:         56422
>Category:       bin
>Synopsis:       zgrep -l sometimes hangs
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    hannken
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Sep 29 14:05:00 +0000 2021
>Closed-Date:    Sun Oct 10 08:14:18 +0000 2021
>Last-Modified:  Sun Oct 10 08:14:18 +0000 2021
>Originator:     Andreas Gustafsson
>Release:        NetBSD 9.2 and -current
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

When I run "zgrep -l *.gz" on a large collection of compressed log
files, it sometimes hangs.

>How-To-Repeat:

This is difficult to reproduce exactly as originally experienced
because a given log file may reliably trigger the problem on one
machine but not on another.

However, I have come up with an analog test case that reliably
reproduces the problem on multiple machines under NetBSD 9.2 and
-current.  It uses synthetic data generated using lptest(1) in place of
the log files, and a pipeline of "gunzip | head -1" in place of "zgrep -l":

  for n in $(seq 100000); do lptest 10 $n | gzip | gunzip | head -1; done >log

Run the above command in one terminal and observe the size of the log
file in another terminal by repeatedly running

  wc -l log

The expected behavior is that the log file slowly grows to a size of
100000 lines and the shell command that generated it exits.  The
observed behavior is that at some point before that, the log file
stops growing, and the shell command generating it has hung with the
"head" process in the "DE" state:

# ps -glaxwd
UID   PID PPID   CPU PRI NI   VSZ    RSS WCHAN   STAT TTY      TIME COMMAND
[...]
  0  7920 1604 28360  79  0 24012   3732 wait    I    tty00 0:13.81 |   |-- -sh
  0 20716 7920     0   0  0 22232   1724 -       DE   tty00 0:00.00 |   | |-- (head)
  0 22105 7920 28360  79  0 24584   1768 pipe_wr I    tty00 0:00.00 |   | `-- gunzip

>Fix:

>Release-Note:

>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/56422: zgrep -l sometimes hangs
Date: Wed, 29 Sep 2021 20:25:42 -0000 (UTC)

 gson@gson.org (Andreas Gustafsson) writes:

 >  0  7920 1604 28360  79  0 24012   3732 wait    I    tty00 0:13.81 |   |-- -sh
 >  0 20716 7920     0   0  0 22232   1724 -       DE   tty00 0:00.00 |   | |-- (head)
 >  0 22105 7920 28360  79  0 24584   1768 pipe_wr I    tty00 0:00.00 |   | `-- gunzip


 head is exiting waiting for rpipe->pipe_draincv:

 fp ffffc000ba1d7b30 mi_switch() at ffffc0000057435c mi_switch+0x1fc
 fp ffffc000ba1d7b90 sleepq_block() at ffffc00000570cb8 sleepq_block+0xa8
 fp ffffc000ba1d7bd0 cv_wait_sig() at ffffc00000536e14 cv_wait_sig+0x44
 fp ffffc000ba1d7c00 pipeclose.part.0() at ffffc000005bcab0 pipeclose.part.0+0xac
 fp ffffc000ba1d7c30 pipe_close() at ffffc000005bccf8 pipe_close+0x18
 fp ffffc000ba1d7c40 closef() at ffffc0000053a554 closef+0x60
 fp ffffc000ba1d7c80 fd_free() at ffffc0000053cb68 fd_free+0x104
 fp ffffc000ba1d7d00 exit1() at ffffc00000548130 exit1+0xfc
 fp ffffc000ba1d7df0 sys_exit() at ffffc00000548a68 sys_exit+0x38
 fp ffffc000ba1d7e20 syscall() at ffffc000000adb14 syscall+0x194
 fp ffffc000ba1d7ed0 el0_trap() at ffffc000000b0ff0 el0_trap


 gunzip still waits writing to the pipe for wpipe->pipe_wcv.

 fp ffffc000ba85fc60 mi_switch() at ffffc0000057435c mi_switch+0x1fc
 fp ffffc000ba85fcc0 sleepq_block() at ffffc00000570cb8 sleepq_block+0xa8
 fp ffffc000ba85fd00 cv_wait_sig() at ffffc00000536e14 cv_wait_sig+0x44
 fp ffffc000ba85fd30 pipe_write() at ffffc000005bd224 pipe_write+0x1d4
 fp ffffc000ba85fda0 dofilewrite() at ffffc000005b7d60 dofilewrite+0x70
 fp ffffc000ba85fe20 syscall() at ffffc000000adb14 syscall+0x194
 fp ffffc000ba85fed0 el0_trap() at ffffc000000b0ff0 el0_trap


 killing gunzip with SIGPIPE releases the lock and gunzip and head
 exit. But unlike the first event after more than 8000 loops, the
 next hang happens after a few iterations.

From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/56422: zgrep -l sometimes hangs
Date: Thu, 30 Sep 2021 14:23:40 +0000 (UTC)

 On Wed, 29 Sep 2021, Andreas Gustafsson wrote:

 > When I run "zgrep -l *.gz" on a large collection of compressed log
 > files, it sometimes hangs.
 >

 Can be reproduced even with cat(1):

 $ for n in $(seq 90000 100000)
 do      lptest 10 $n | cat | cat | cat | head -n1
          printf '%d\r' $n >&2
 done > /dev/null
 90166
 ^C
 $

 -RVP

From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/56422: zgrep -l sometimes hangs
Date: Fri, 1 Oct 2021 07:13:56 +0000 (UTC)

 Well, compiling the kernel with PIPE_SOCKETPAIR does seem to
 fix _this_ issue, but then we run afoul of PR #55690.

 -RVP

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: NetBSD GNATS <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: bin/56422: zgrep -l sometimes hangs
Date: Fri, 1 Oct 2021 10:24:35 +0200

 --Apple-Mail=_F96C3087-5ABC-4251-A0DF-3EE8023164D3
 Content-Type: multipart/mixed;
 	boundary="Apple-Mail=_8A66DE72-488D-452C-8D71-E5933D0728BA"


 --Apple-Mail=_8A66DE72-488D-452C-8D71-E5933D0728BA
 Content-Transfer-Encoding: 7bit
 Content-Type: text/plain;
 	charset=us-ascii

 This deadlock happens whenever the writer has more data and no space
 while the reader is closing the pipe.

 Simply swapping the EOF test and the wait_sig in pipe_write() should
 fix it.  Diff attached, please give it a try.

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

 --Apple-Mail=_8A66DE72-488D-452C-8D71-E5933D0728BA
 Content-Disposition: attachment;
 	filename=002_pipe_stall.diff
 Content-Type: application/octet-stream;
 	x-unix-mode=0644;
 	name="002_pipe_stall.diff"
 Content-Transfer-Encoding: 7bit

 pipe_stall

 Fix a deadlock where one thread writes to a pipe, has more data
 and no space in the pipe and waits on "pipe_wcv" while the reader
 is closing the pipe and waits on "pipe_draincv".

 Swap the test for "PIPE_EOF" and the "cv_wait_sig()" in "pipe_write()".

 PR bin/56422 "zgrep -l sometimes hangs"

 diff -r c0d659b6a82c -r 4b3d02a50e12 sys/kern/sys_pipe.c
 --- sys/kern/sys_pipe.c
 +++ sys/kern/sys_pipe.c
 @@ -694,11 +694,6 @@ pipe_write(file_t *fp, off_t *offset, st
  				break;
  			}

 -			pipeunlock(wpipe);
 -			error = cv_wait_sig(&wpipe->pipe_wcv, lock);
 -			(void)pipelock(wpipe, false);
 -			if (error != 0)
 -				break;
  			/*
  			 * If read side wants to go away, we just issue a signal
  			 * to ourselves.
 @@ -707,6 +702,12 @@ pipe_write(file_t *fp, off_t *offset, st
  				error = EPIPE;
  				break;
  			}
 +
 +			pipeunlock(wpipe);
 +			error = cv_wait_sig(&wpipe->pipe_wcv, lock);
 +			(void)pipelock(wpipe, false);
 +			if (error != 0)
 +				break;
  			wakeup_state = wpipe->pipe_state;
  		}
  	}

 --Apple-Mail=_8A66DE72-488D-452C-8D71-E5933D0728BA--

 --Apple-Mail=_F96C3087-5ABC-4251-A0DF-3EE8023164D3
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename=signature.asc
 Content-Type: application/pgp-signature;
 	name=signature.asc
 Content-Description: Message signed with OpenPGP

 -----BEGIN PGP SIGNATURE-----

 iQEzBAEBCAAdFiEE2BL3ha7Xao4WUZVYKoaVJdNr+uEFAmFWxcMACgkQKoaVJdNr
 +uFP8wf+OGOuRE1cfL/BBDkC6cLJkLoEG38KTRB+CpFgpeck+PSpMzpAStiYSzb0
 mSa4Ukomg/WOk6zbgGygtX8ZIMG54kJJJUh4dmMD51KunyzGu5RjYCOMvxni2Rup
 U0T2i1hIoE+3EW2WTpizXEBTShYmJdGUFO9FLR4qvWRYNY7ml42Cuz5ofQRabN0v
 6lUp/ony86XfnKuBMaezFPAUR67g2dSyf30Mdr8WgrPuOiwxha2DkBUdC/9eTkIR
 zttzoAkiC970x12VHAV4VDWi/gPhZhvMa+ZOqs3U/OhwvXd5BHv8d9bEu/8KKNCI
 zIBA8/rNS+3jcx5NKStznCYtGdQs2A==
 =ODAS
 -----END PGP SIGNATURE-----

 --Apple-Mail=_F96C3087-5ABC-4251-A0DF-3EE8023164D3--

From: Andreas Gustafsson <gson@gson.org>
To: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: bin/56422: zgrep -l sometimes hangs
Date: Fri, 1 Oct 2021 19:28:05 +0300

 J. Hannken-Illjes wrote:
 >  This deadlock happens whenever the writer has more data and no space
 >  while the reader is closing the pipe.

 Thank you for investigating this.

 >  Simply swapping the EOF test and the wait_sig in pipe_write() should
 >  fix it.  Diff attached, please give it a try.

 With the patch, my test passes.  I also ran the ATF tests and there
 were no new failures.
 -- 
 Andreas Gustafsson, gson@gson.org

From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: re: bin/56422: zgrep -l sometimes hangs
Date: Sat, 02 Oct 2021 14:11:36 +1000

 hannken's description of what is going on makes me think that this
 is related to an uncommon issue i where where "command | less"
 may hang when i try to ^Z it.  it doesn't happen often, but when
 it does, the "command" has not yet finished output and less is
 still reading from it.  i've never seen the hang after the command
 has exited.  when it happens, the whole shell is unavailable and
 i have to kill the command from another shell.

 i'll test this (either before commit or after), but my problem was
 only a more than weekly event so it's never been easy to test fixes
 (christos tried a couple of times.)


 .mrg.

From: "Juergen Hannken-Illjes" <hannken@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56422 CVS commit: src/sys/kern
Date: Sat, 2 Oct 2021 07:35:41 +0000

 Module Name:	src
 Committed By:	hannken
 Date:		Sat Oct  2 07:35:41 UTC 2021

 Modified Files:
 	src/sys/kern: sys_pipe.c

 Log Message:
 Fix a deadlock where one thread writes to a pipe, has more data
 and no space in the pipe and waits on "pipe_wcv" while the reader
 is closing the pipe and waits on "pipe_draincv".

 Swap the test for "PIPE_EOF" and the "cv_wait_sig()" in "pipe_write()".

 PR bin/56422 "zgrep -l sometimes hangs"


 To generate a diff of this commit:
 cvs rdiff -u -r1.156 -r1.157 src/sys/kern/sys_pipe.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->needs-pullups
State-Changed-By: gson@NetBSD.org
State-Changed-When: Sat, 02 Oct 2021 09:48:48 +0000
State-Changed-Why:
hannken committed a fix.


Responsible-Changed-From-To: bin-bug-people->hannken
Responsible-Changed-By: hannken@NetBSD.org
Responsible-Changed-When: Thu, 07 Oct 2021 08:36:04 +0000
Responsible-Changed-Why:
Take.


State-Changed-From-To: needs-pullups->pending-pullups
State-Changed-By: hannken@NetBSD.org
State-Changed-When: Thu, 07 Oct 2021 08:36:04 +0000
State-Changed-Why:
Pullups requested: -8 as 1698, -9 as 1357.


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56422 CVS commit: [netbsd-9] src/sys/kern
Date: Fri, 8 Oct 2021 14:58:08 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Fri Oct  8 14:58:08 UTC 2021

 Modified Files:
 	src/sys/kern [netbsd-9]: sys_pipe.c

 Log Message:
 Pull up following revision(s) (requested by hannken in ticket #1357):

 	sys/kern/sys_pipe.c: revision 1.157

 Fix a deadlock where one thread writes to a pipe, has more data
 and no space in the pipe and waits on "pipe_wcv" while the reader
 is closing the pipe and waits on "pipe_draincv".

 Swap the test for "PIPE_EOF" and the "cv_wait_sig()" in "pipe_write()".

 PR bin/56422 "zgrep -l sometimes hangs"


 To generate a diff of this commit:
 cvs rdiff -u -r1.148 -r1.148.2.1 src/sys/kern/sys_pipe.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56422 CVS commit: [netbsd-8] src/sys/kern
Date: Fri, 8 Oct 2021 14:59:59 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Fri Oct  8 14:59:59 UTC 2021

 Modified Files:
 	src/sys/kern [netbsd-8]: sys_pipe.c

 Log Message:
 Pull up following revision(s) (requested by hannken in ticket #1698):

 	sys/kern/sys_pipe.c: revision 1.157

 Fix a deadlock where one thread writes to a pipe, has more data
 and no space in the pipe and waits on "pipe_wcv" while the reader
 is closing the pipe and waits on "pipe_draincv".

 Swap the test for "PIPE_EOF" and the "cv_wait_sig()" in "pipe_write()".

 PR bin/56422 "zgrep -l sometimes hangs"


 To generate a diff of this commit:
 cvs rdiff -u -r1.140.12.1 -r1.140.12.2 src/sys/kern/sys_pipe.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: hannken@NetBSD.org
State-Changed-When: Sun, 10 Oct 2021 08:14:18 +0000
State-Changed-Why:
Pullups to -8 and -9 complete -- autobuild donbe.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.