NetBSD Problem Report #59766

From dholland@netbsd.org  Sun Nov 16 19:18:00 2025
Return-Path: <dholland@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 66A911A9239
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 16 Nov 2025 19:18:00 +0000 (UTC)
Message-Id: <20251116191759.908DF85681@mail.netbsd.org>
Date: Sun, 16 Nov 2025 19:17:59 +0000 (UTC)
From: dholland@NetBSD.org
Reply-To: dholland@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: awk does not handle RS="\0"
X-Send-Pr-Version: 3.95

>Number:         59766
>Category:       bin
>Synopsis:       awk does not handle RS="\0"
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Nov 16 19:20:00 +0000 2025
>Last-Modified:  Tue Nov 18 06:00:01 +0000 2025
>Originator:     David A. Holland
>Release:        NetBSD 11.99.4 (20251116)
>Organization:
Windmill containment office
>Environment:
System: n/a
Architecture: irrelevant
Machine: irrelevant

>Description:

Awk does the wrong thing if you try to set the record separator to
NUL; it prints only the first record. This would be useful to have
working in conjunction with find -print0.

>How-To-Repeat:

valkyrie% mkdir /tmp/test
valkyrie% cd /tmp/test
valkyrie% mkdir foo
valkyrie% mkdir bar
valkyrie% mkdir bar/baz
valkyrie% touch foo/foo.c
valkyrie% touch bar/baz.c
valkyrie% find . -type f -print0 | hexdump -C
00000000  2e 2f 66 6f 6f 2f 66 6f  6f 2e 63 00 2e 2f 62 61  |./foo/foo.c../ba|
00000010  72 2f 62 61 7a 2e 63 00                           |r/baz.c.|
00000018
valkyrie% find . -type f -print0 | awk 'BEGIN { RS="\0"; } { print; }'
./foo/foo.c
valkyrie% 

(In fact, that's what it prints with the default RS too, probably
because it treats the whole input as one line but loses the part after
the first NUL)

>Fix:

Likely a pain :-)

>Audit-Trail:
From: Christos Zoulas <christos@zoulas.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Subject: Re: bin/59766: awk does not handle RS="\0"
Date: Sun, 16 Nov 2025 14:34:44 -0500

 --Apple-Mail=_E44D379A-6585-41F5-A115-6471C1A327FB
 Content-Transfer-Encoding: quoted-printable
 Content-Type: text/plain;
 	charset=us-ascii

 It is trivial to fix:

 Index: lib.c
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 RCS file: /cvsroot/src/external/historical/nawk/dist/lib.c,v
 retrieving revision 1.16
 diff -u -p -u -r1.16 lib.c
 --- lib.c       18 Aug 2024 16:51:05 -0000      1.16
 +++ lib.c       16 Nov 2025 19:33:24 -0000
 @@ -252,13 +252,7 @@ int readrec(char **pbuf, int *pbufsize,=20
                 isrec =3D found !=3D 0 || *buf !=3D '\0';
 =20
         } else {
 -               if ((sep =3D *rs) =3D=3D 0) {
 -                       sep =3D '\n';
 -                       while ((c=3Dgetc(inf)) =3D=3D '\n' && c !=3D =
 EOF)       /* skip leading \n's */
 -                               ;
 -                       if (c !=3D EOF)
 -                               ungetc(c, inf);
 -               }
 +              sep =3D *rs;
                 for (rr =3D buf; ; ) {
                         for (; (c=3Dgetc(inf)) !=3D sep && c !=3D EOF; ) =
 {
                                 if (rr-buf+1 > bufsize)
 @@ -267,7 +261,7 @@ int readrec(char **pbuf, int *pbufsize,=20
                                                 FATAL("input record =
 `%.30s...' too long", buf);
                                 *rr++ =3D c;
                         }
 -                       if (*rs =3D=3D sep || c =3D=3D EOF)
 +                       if (c =3D=3D EOF)
                                 break;
                         if ((c =3D getc(inf)) =3D=3D '\n' || c =3D=3D =
 EOF)        /* 2 in a row */
                                 break;


 The question is why special-case it this way? gawk does not do this.

 christos


 --Apple-Mail=_E44D379A-6585-41F5-A115-6471C1A327FB
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename=signature.asc
 Content-Type: application/pgp-signature;
 	name=signature.asc
 Content-Description: Message signed with OpenPGP

 -----BEGIN PGP SIGNATURE-----
 Comment: GPGTools - http://gpgtools.org

 iF0EARECAB0WIQS+BJlbqPkO0MDBdsRxESqxbLM7OgUCaRonVAAKCRBxESqxbLM7
 OiYcAJ9AXgX3zERCjBRvwK/yrlnauTZsFgCfei6pvUIs7wAG0icV8i+k+CwLvUA=
 =3Rsn
 -----END PGP SIGNATURE-----

 --Apple-Mail=_E44D379A-6585-41F5-A115-6471C1A327FB--

From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/59766: awk does not handle RS="\0"
Date: Mon, 17 Nov 2025 00:08:10 +0000 (UTC)

 On Sun, 16 Nov 2025, dholland@NetBSD.org wrote:

 >> Description:
 >
 > Awk does the wrong thing if you try to set the record separator to
 > NUL; it prints only the first record. This would be useful to have
 > working in conjunction with find -print0.
 >

 The usual way to do this portably is to use find(1) with `-exec' and turn the
 filenames into cmd. line args.:

 ```
 find . -type f -exec awk 'BEGIN { for (i=1; i<ARGC; i++) print ARGV[i] }' {} +
 ```

 You can also use xargs(1) even if neither it nor find(1) supports `-0' or `-print0':

 ```
 $ cat qstr.awk
 #!/usr/bin/awk -f
 #
 # quote strings (typically, filenames) for use with xargs(1)

 BEGIN {
  	if (ARGC < 2)
  		exit 1
  	pat = "[\"'\\\\[:space:]]"	# " ' \ [:space:]
  	for (i = 1; i < ARGC; i++) {
  		if (match(ARGV[i], pat)) {
  			gsub(pat, "\\\\&", ARGV[i])

  			#for (j = 1; j <= length(ARGV[i]); j++) {
  			#	c = substr(ARGV[i], j, 1)
  			#	s = s ((c ~ pat) ? "\\"c : c)
  			#}
  		}
  		s = s ARGV[i] ((i < ARGC-1) ? " " : "")
  	}
  	print s
 }

 $ find . -type f -exec ./qstr.awk {} + | xargs printf '>%s<\n'
 >./polipo.pid<
 >./.X0-lock<
 >./qemu.sh<
 >./boot-com.iso<
 >./boot.iso<
 >./SHA512<
 >./qstr.awk<

 $ ./qstr.awk $'\n\nhello\nworld\n\n' $'a\n\\\nb\n*!$$\n\n' | xargs printf '>%s<\n'
 >

 hello
 world

 <
 >a
 \
 b
 *!$$

 <

 $
 ```

 HTH,

 -RVP

From: Martin Neitzel <neitzel@hackett.marshlabs.gaertner.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/59766: awk does not handle RS="\0"
Date: Mon, 17 Nov 2025 16:49:44 +0100 (CET)

 I'm not at all against fixing things which are easily fixed
 but for this report the following context should be kept
 in mind:

 (1) POSIX/Single Unix Specification says (in "Shell & Utilies", "awk"):

 	Input files to the awk program from any of the following sources
 	shall be text files:

 with "text files" having a specific meaning, clarified in the "Base
 Definitions", Definitions", "Text File":

 	A file that contains characters organized into one or more
 	lines.  The lines do not contain NUL characters [...]

 That is, awk(1) is strictly speaking not the proper tool to deal
 with "find -print0" output in first place, and any support for
 that would be a (non-portable) extension.


 (2) awk's RS has a special meaning when it "is NULL":  paragraphs
 (separated by empty lines) become the records, lines the fields.
 This was historically new with nawk ("the one true awk"), and
 POSIX demands it, too:

 RS
 	The first character of the string value of RS shall be the
 	input record separator; a <newline> by default. If RS
 	contains more than one character, the results are unspecified.
 	If RS is null, then records are separated by sequences
 	consisting of a <newline> plus one or more blank lines,
 	leading or trailing blank lines shall not result in empty
 	records at the beginning or end of the input, and a <newline>
 	shall always be a field separator, no matter what the value
 	of FS is.

 The NetBSD-9-stable awk(1) man page is failing to point this out but
 implements it just nicely:

 $ man awk | awk -v RS= '/split/ {print NR, $0 "\n"}'
 man: Formatting manual page...
 14      An input line is normally made up of fields separated by white space, or
      by regular expression FS.  The fields are denoted $1, $2, ..., while $0
      refers to the entire line.  If FS is null, the input line is split into
      one field per character.

 49      split(s, a, [fs])
              splits the string s into array elements a[1], a[2], ..., a[n],
              and returns n.  The separation is done with the regular
              expression fs or with the field separator FS if fs is not given.
              An empty string as field separator splits the string into one
              array element per character.

 $ 

 An   RS=""   is the canonical way to set this within a script, and
 I'd assume an   RS="\0"  to act not any different.


 I haven't had a look at the suggested patch but this "paragraph
 behaviour" should certainly not be broken.

 Martin

From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/59766: awk does not handle RS="\0"
Date: Tue, 18 Nov 2025 05:55:32 +0000 (UTC)

 On Mon, 17 Nov 2025, Martin Neitzel via gnats wrote:

 > (2) awk's RS has a special meaning when it "is NULL":  paragraphs
 > (separated by empty lines) become the records, lines the fields.
 > This was historically new with nawk ("the one true awk"), and
 > POSIX demands it, too:
 > [...]
 > An   RS=""   is the canonical way to set this within a script, and
 > I'd assume an   RS="\0"  to act not any different.
 >

 Yeah. You'd have to distinguish between `RS=""', `RS="\0"', `RS="[\0]"', at least.
 Possibly also, `RS="\0\0"', `RS="\0\0\0"', etc. Then too, assigning to RS from
 another variable. Doesn't look simple w/o major reworking of awk's innards.

 But, then, there's another way to do this right now: use multi-char. (ie. regex)
 delimiters:

 ```
 $ find . -type f -exec printf '%s__DelimiTEr__' {} + |
      awk -vRS=__DelimiTEr__ '{ printf ">%s<\n", $0 }'

 $ printf '%s__DelimiTEr__' $'hello\n\nworld\n\n' '1' '?' '!' '$$' '*' |
      awk -vRS=__DelimiTEr__ '{ printf ">%s<\n", $0 }'
 ```

 Any unique string would do (SHA512 hashes, UUID strings, ...).

 -RVP

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2025 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.