NetBSD Problem Report #58619

From www@netbsd.org  Tue Aug 20 07:09:00 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits)
	 client-signature RSA-PSS (2048 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id EC8BB1A923F
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 20 Aug 2024 07:08:59 +0000 (UTC)
Message-Id: <20240820070858.A124D1A9242@mollari.NetBSD.org>
Date: Tue, 20 Aug 2024 07:08:58 +0000 (UTC)
From: rokuyama.rk@gmail.com
Reply-To: rokuyama.rk@gmail.com
To: gnats-bugs@NetBSD.org
Subject: nawk 2024-08-17 broken and incompatible for non-UTF-8 and non-C locales
X-Send-Pr-Version: www-1.0

>Number:         58619
>Category:       bin
>Synopsis:       nawk 2024-08-17 broken and incompatible for non-UTF-8 and non-C locales
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    bin-bug-people
>State:          analyzed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Aug 20 07:10:00 +0000 2024
>Closed-Date:    
>Last-Modified:  Tue Sep 03 01:35:28 +0000 2024
>Originator:     Rin Okuyama
>Release:        10.99.11
>Organization:
Internet Initiative Japan Inc.
>Environment:
NetBSD rp64 10.99.11 NetBSD 10.99.11 (GENERIC64) #2: Tue Aug 20 13:15:56 JST 2024  rin@dancena:/home/rin/src/sys/arch/evbarm/compile/GENERIC64 evbarm
>Description:
nawk 2024-08-17 has recently been imported as /usr/bin/awk.

This version is based on "2nd edition", but compatibility for
8-bit-clean single-byte locales like "C" seems to be improved:

https://github.com/onetrueawk/awk/commit/1087d46

(BTW, their documentation is *REALLY* poor.)

However, still, it gives broken results for non-UTF-8 multibyte
locales. Not only broken, results are incompatible with older
versions, at least for non-8-bit-clean multibyte locales.

For example, in the previous versions, length() builtin counts
number of bytes for, e.g., ja_JP.eucJP. However, the new version
counts number of characters, misinterpreted as UTF-8 :(
>How-To-Repeat:
Try euc.txt, which I converted to EUC-JP from
http://www.jp.netbsd.org/ja/JP/index.html

---
$ ftp https://www.netbsd.org/~rin/euc.txt
...
$ env LC_CTYPE=ja_JP.eucJP \
awk 'BEGIN{sum = 0} {sum += length($0)} END{print sum}'
---

Older versions and 2024-08-17 give 10978 and 10418, respectively.
>Fix:
Just for example above:

https://gist.github.com/rokuyama/c7e6d12b6a7bcad0704f706c4f7e9569

However, still, I'm not very sure whether "2nd edition" of
nawk should be used or not...

>Release-Note:

>Audit-Trail:
From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/58619: nawk 2024-08-17 broken and incompatible for non-UTF-8
 and non-C locales
Date: Tue, 20 Aug 2024 10:36:34 +0000 (UTC)

 On Tue, 20 Aug 2024, rokuyama.rk@gmail.com wrote:

 > (BTW, their documentation is *REALLY* poor.)
 >

 Ya, the BSD extensions aren't documented in the `bsd-features' branch man-page.

 > Try euc.txt, which I converted to EUC-JP from
 > http://www.jp.netbsd.org/ja/JP/index.html
 >
 > ---
 > $ ftp https://www.netbsd.org/~rin/euc.txt
 > ...
 > $ env LC_CTYPE=ja_JP.eucJP \
 > awk 'BEGIN{sum = 0} {sum += length($0)} END{print sum}'
 > ---
 >
 > Older versions and 2024-08-17 give 10978 and 10418, respectively.
 >> Fix:
 > Just for example above:
 >
 > https://gist.github.com/rokuyama/c7e6d12b6a7bcad0704f706c4f7e9569
 >

 Well, I guess it's a pain prepending `LC_ALL=C' on all non-UTF-8 locales, so:

 ```
 diff -urN nawk.orig/dist/main.c nawk/dist/main.c
 --- nawk.orig/dist/main.c	2024-08-18 03:11:06.691688756 +0000
 +++ nawk/dist/main.c	2024-08-20 10:24:10.089804741 +0000
 @@ -32,6 +32,7 @@
   #include <stdio.h>
   #include <ctype.h>
   #include <locale.h>
 +#include <langinfo.h>
   #include <stdlib.h>
   #include <string.h>
   #include <signal.h>
 @@ -143,6 +144,8 @@

   	setlocale(LC_CTYPE, "");
   	setlocale(LC_NUMERIC, "C"); /* for parsing cmdline & prog */
 +	if (strcmp(nl_langinfo(CODESET), "UTF-8"))
 +		setlocale(LC_ALL, "C");	/* not UTF-8, force "C" */
   	awk_mb_cur_max = MB_CUR_MAX;
   	cmdname = argv[0];
   	if (argc == 1) {
 ```

 -RVP

From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/58619: nawk 2024-08-17 broken and incompatible for non-UTF-8
 and non-C locales
Date: Tue, 20 Aug 2024 11:00:01 +0000 (UTC)

 On Tue, 20 Aug 2024, gnats-admin@netbsd.org wrote:

 > +		setlocale(LC_ALL, "C");	/* not UTF-8, force "C" */
 >

 Make that:	setlocale(LC_CTYPE, "C");

 -RVP

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
 RVP <rvp@SDF.ORG>
Cc: 
Subject: Re: bin/58619: nawk 2024-08-17 broken and incompatible for non-UTF-8
 and non-C locales
Date: Fri, 30 Aug 2024 18:01:36 +0900

 Ah, thanks. Your fix seems much smarter.

 I'm still not very convinced that we should have "2nd edition" awk.
 But it would be a good idea to have this fix in -current for now.

 We should upstream this also, but they seem to believe everyone
 uses UTF-8 or ASCII ;)

 Thanks,
 rin

 On 2024/08/20 20:05, gnats-admin@netbsd.org wrote:
 > The following reply was made to PR bin/58619; it has been noted by GNATS.
 > 
 > From: RVP <rvp@SDF.ORG>
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: bin/58619: nawk 2024-08-17 broken and incompatible for non-UTF-8
 >   and non-C locales
 > Date: Tue, 20 Aug 2024 11:00:01 +0000 (UTC)
 > 
 >   On Tue, 20 Aug 2024, gnats-admin@netbsd.org wrote:
 >   
 >   > +		setlocale(LC_ALL, "C");	/* not UTF-8, force "C" */
 >   >
 >   
 >   Make that:	setlocale(LC_CTYPE, "C");
 >   
 >   -RVP
 >   

From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/58619 CVS commit: src/external/historical/nawk/dist
Date: Sun, 1 Sep 2024 10:45:39 -0400

 Module Name:	src
 Committed By:	christos
 Date:		Sun Sep  1 14:45:39 UTC 2024

 Modified Files:
 	src/external/historical/nawk/dist: main.c

 Log Message:
 PR/58619: Rin Okuyama: Force C locale for non-utf-8 (from RVP)


 To generate a diff of this commit:
 cvs rdiff -u -r1.12 -r1.13 src/external/historical/nawk/dist/main.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
 "christos@netbsd.org" <christos@NetBSD.org>
Cc: 
Subject: Re: PR/58619 CVS commit: src/external/historical/nawk/dist
Date: Mon, 2 Sep 2024 00:17:10 +0900

 Hi,

 > From: "Christos Zoulas" <christos@netbsd.org>
 > To: gnats-bugs@gnats.NetBSD.org
 > Cc:
 > Subject: PR/58619 CVS commit: src/external/historical/nawk/dist
 > Date: Sun, 1 Sep 2024 10:45:39 -0400
 > 
 >   Module Name:	src
 >   Committed By:	christos
 >   Date:		Sun Sep  1 14:45:39 UTC 2024
 >   
 >   Modified Files:
 >   	src/external/historical/nawk/dist: main.c
 >   
 >   Log Message:
 >   PR/58619: Rin Okuyama: Force C locale for non-utf-8 (from RVP)
 >   
 >   
 >   To generate a diff of this commit:
 >   cvs rdiff -u -r1.12 -r1.13 src/external/historical/nawk/dist/main.c
 >   
 >   Please note that diffs are not public domain; they are subject to the
 >   copyright notices on the relevant files.

 Thanks, but `setlocale(LC_CTYPE, "")` preceding nl_langinfo(3) cannot
 be dropped. Without it, `nl_langinfo(CODESET)` always returns "646"
 (== ASCII), regardless of environment variables.

 Thanks,
 rin

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@NetBSD.org, netbsd-bugs@NetBSD.org
Cc: 
Subject: Re: PR/58619 CVS commit: src/external/historical/nawk/dist
Date: Tue, 3 Sep 2024 10:29:09 +0900

 -------- Forwarded Message --------
 Subject: CVS commit: src/external/historical/nawk/dist
 Date: Mon, 2 Sep 2024 01:04:35 +0000
 From: Rin Okuyama <rin@netbsd.org>
 Reply-To: source-changes-d@NetBSD.org
 To: source-changes-full@NetBSD.org

 Module Name:	src
 Committed By:	rin
 Date:		Mon Sep  2 01:04:35 UTC 2024

 Modified Files:
 	src/external/historical/nawk/dist: main.c

 Log Message:
 nawk: Fix previous for UTF-8 locales

 We need `setlocale(LC_CTYPE, "")` before `nl_langinfo(CODESET)`.
 Otherwise, `nl_langinfo(CODESET)` returns "646" (== ASCII)
 regardless of environment variables.

 This fixes regression for usr.bin/awk/t_awk:multibyte.


 To generate a diff of this commit:
 cvs rdiff -u -r1.13 -r1.14 src/external/historical/nawk/dist/main.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.


State-Changed-From-To: open->analyzed
State-Changed-By: rin@NetBSD.org
State-Changed-When: Tue, 03 Sep 2024 01:35:28 +0000
State-Changed-Why:
- Suggested patch committed
- Fallout for usr.bin/awk/t_awk/multibyte fixed
- No need to pullup to release branches

Let us observe carefully other unexpected regressions for 2nd edition nawk,
at least for a little more while.

Thanks RVP again for improving the patch!


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.