NetBSD Problem Report #58619
From www@netbsd.org Tue Aug 20 07:09:00 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
key-exchange X25519 server-signature RSA-PSS (2048 bits)
client-signature RSA-PSS (2048 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id EC8BB1A923F
for <gnats-bugs@gnats.NetBSD.org>; Tue, 20 Aug 2024 07:08:59 +0000 (UTC)
Message-Id: <20240820070858.A124D1A9242@mollari.NetBSD.org>
Date: Tue, 20 Aug 2024 07:08:58 +0000 (UTC)
From: rokuyama.rk@gmail.com
Reply-To: rokuyama.rk@gmail.com
To: gnats-bugs@NetBSD.org
Subject: nawk 2024-08-17 broken and incompatible for non-UTF-8 and non-C locales
X-Send-Pr-Version: www-1.0
>Number: 58619
>Category: bin
>Synopsis: nawk 2024-08-17 broken and incompatible for non-UTF-8 and non-C locales
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: bin-bug-people
>State: analyzed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Aug 20 07:10:00 +0000 2024
>Closed-Date:
>Last-Modified: Tue Sep 03 01:35:28 +0000 2024
>Originator: Rin Okuyama
>Release: 10.99.11
>Organization:
Internet Initiative Japan Inc.
>Environment:
NetBSD rp64 10.99.11 NetBSD 10.99.11 (GENERIC64) #2: Tue Aug 20 13:15:56 JST 2024 rin@dancena:/home/rin/src/sys/arch/evbarm/compile/GENERIC64 evbarm
>Description:
nawk 2024-08-17 has recently been imported as /usr/bin/awk.
This version is based on "2nd edition", but compatibility for
8-bit-clean single-byte locales like "C" seems to be improved:
https://github.com/onetrueawk/awk/commit/1087d46
(BTW, their documentation is *REALLY* poor.)
However, still, it gives broken results for non-UTF-8 multibyte
locales. Not only broken, results are incompatible with older
versions, at least for non-8-bit-clean multibyte locales.
For example, in the previous versions, length() builtin counts
number of bytes for, e.g., ja_JP.eucJP. However, the new version
counts number of characters, misinterpreted as UTF-8 :(
>How-To-Repeat:
Try euc.txt, which I converted to EUC-JP from
http://www.jp.netbsd.org/ja/JP/index.html
---
$ ftp https://www.netbsd.org/~rin/euc.txt
...
$ env LC_CTYPE=ja_JP.eucJP \
awk 'BEGIN{sum = 0} {sum += length($0)} END{print sum}'
---
Older versions and 2024-08-17 give 10978 and 10418, respectively.
>Fix:
Just for example above:
https://gist.github.com/rokuyama/c7e6d12b6a7bcad0704f706c4f7e9569
However, still, I'm not very sure whether "2nd edition" of
nawk should be used or not...
>Release-Note:
>Audit-Trail:
From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/58619: nawk 2024-08-17 broken and incompatible for non-UTF-8
and non-C locales
Date: Tue, 20 Aug 2024 10:36:34 +0000 (UTC)
On Tue, 20 Aug 2024, rokuyama.rk@gmail.com wrote:
> (BTW, their documentation is *REALLY* poor.)
>
Ya, the BSD extensions aren't documented in the `bsd-features' branch man-page.
> Try euc.txt, which I converted to EUC-JP from
> http://www.jp.netbsd.org/ja/JP/index.html
>
> ---
> $ ftp https://www.netbsd.org/~rin/euc.txt
> ...
> $ env LC_CTYPE=ja_JP.eucJP \
> awk 'BEGIN{sum = 0} {sum += length($0)} END{print sum}'
> ---
>
> Older versions and 2024-08-17 give 10978 and 10418, respectively.
>> Fix:
> Just for example above:
>
> https://gist.github.com/rokuyama/c7e6d12b6a7bcad0704f706c4f7e9569
>
Well, I guess it's a pain prepending `LC_ALL=C' on all non-UTF-8 locales, so:
```
diff -urN nawk.orig/dist/main.c nawk/dist/main.c
--- nawk.orig/dist/main.c 2024-08-18 03:11:06.691688756 +0000
+++ nawk/dist/main.c 2024-08-20 10:24:10.089804741 +0000
@@ -32,6 +32,7 @@
#include <stdio.h>
#include <ctype.h>
#include <locale.h>
+#include <langinfo.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
@@ -143,6 +144,8 @@
setlocale(LC_CTYPE, "");
setlocale(LC_NUMERIC, "C"); /* for parsing cmdline & prog */
+ if (strcmp(nl_langinfo(CODESET), "UTF-8"))
+ setlocale(LC_ALL, "C"); /* not UTF-8, force "C" */
awk_mb_cur_max = MB_CUR_MAX;
cmdname = argv[0];
if (argc == 1) {
```
-RVP
From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/58619: nawk 2024-08-17 broken and incompatible for non-UTF-8
and non-C locales
Date: Tue, 20 Aug 2024 11:00:01 +0000 (UTC)
On Tue, 20 Aug 2024, gnats-admin@netbsd.org wrote:
> + setlocale(LC_ALL, "C"); /* not UTF-8, force "C" */
>
Make that: setlocale(LC_CTYPE, "C");
-RVP
From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
RVP <rvp@SDF.ORG>
Cc:
Subject: Re: bin/58619: nawk 2024-08-17 broken and incompatible for non-UTF-8
and non-C locales
Date: Fri, 30 Aug 2024 18:01:36 +0900
Ah, thanks. Your fix seems much smarter.
I'm still not very convinced that we should have "2nd edition" awk.
But it would be a good idea to have this fix in -current for now.
We should upstream this also, but they seem to believe everyone
uses UTF-8 or ASCII ;)
Thanks,
rin
On 2024/08/20 20:05, gnats-admin@netbsd.org wrote:
> The following reply was made to PR bin/58619; it has been noted by GNATS.
>
> From: RVP <rvp@SDF.ORG>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: bin/58619: nawk 2024-08-17 broken and incompatible for non-UTF-8
> and non-C locales
> Date: Tue, 20 Aug 2024 11:00:01 +0000 (UTC)
>
> On Tue, 20 Aug 2024, gnats-admin@netbsd.org wrote:
>
> > + setlocale(LC_ALL, "C"); /* not UTF-8, force "C" */
> >
>
> Make that: setlocale(LC_CTYPE, "C");
>
> -RVP
>
From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/58619 CVS commit: src/external/historical/nawk/dist
Date: Sun, 1 Sep 2024 10:45:39 -0400
Module Name: src
Committed By: christos
Date: Sun Sep 1 14:45:39 UTC 2024
Modified Files:
src/external/historical/nawk/dist: main.c
Log Message:
PR/58619: Rin Okuyama: Force C locale for non-utf-8 (from RVP)
To generate a diff of this commit:
cvs rdiff -u -r1.12 -r1.13 src/external/historical/nawk/dist/main.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
"christos@netbsd.org" <christos@NetBSD.org>
Cc:
Subject: Re: PR/58619 CVS commit: src/external/historical/nawk/dist
Date: Mon, 2 Sep 2024 00:17:10 +0900
Hi,
> From: "Christos Zoulas" <christos@netbsd.org>
> To: gnats-bugs@gnats.NetBSD.org
> Cc:
> Subject: PR/58619 CVS commit: src/external/historical/nawk/dist
> Date: Sun, 1 Sep 2024 10:45:39 -0400
>
> Module Name: src
> Committed By: christos
> Date: Sun Sep 1 14:45:39 UTC 2024
>
> Modified Files:
> src/external/historical/nawk/dist: main.c
>
> Log Message:
> PR/58619: Rin Okuyama: Force C locale for non-utf-8 (from RVP)
>
>
> To generate a diff of this commit:
> cvs rdiff -u -r1.12 -r1.13 src/external/historical/nawk/dist/main.c
>
> Please note that diffs are not public domain; they are subject to the
> copyright notices on the relevant files.
Thanks, but `setlocale(LC_CTYPE, "")` preceding nl_langinfo(3) cannot
be dropped. Without it, `nl_langinfo(CODESET)` always returns "646"
(== ASCII), regardless of environment variables.
Thanks,
rin
From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@NetBSD.org, netbsd-bugs@NetBSD.org
Cc:
Subject: Re: PR/58619 CVS commit: src/external/historical/nawk/dist
Date: Tue, 3 Sep 2024 10:29:09 +0900
-------- Forwarded Message --------
Subject: CVS commit: src/external/historical/nawk/dist
Date: Mon, 2 Sep 2024 01:04:35 +0000
From: Rin Okuyama <rin@netbsd.org>
Reply-To: source-changes-d@NetBSD.org
To: source-changes-full@NetBSD.org
Module Name: src
Committed By: rin
Date: Mon Sep 2 01:04:35 UTC 2024
Modified Files:
src/external/historical/nawk/dist: main.c
Log Message:
nawk: Fix previous for UTF-8 locales
We need `setlocale(LC_CTYPE, "")` before `nl_langinfo(CODESET)`.
Otherwise, `nl_langinfo(CODESET)` returns "646" (== ASCII)
regardless of environment variables.
This fixes regression for usr.bin/awk/t_awk:multibyte.
To generate a diff of this commit:
cvs rdiff -u -r1.13 -r1.14 src/external/historical/nawk/dist/main.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->analyzed
State-Changed-By: rin@NetBSD.org
State-Changed-When: Tue, 03 Sep 2024 01:35:28 +0000
State-Changed-Why:
- Suggested patch committed
- Fallout for usr.bin/awk/t_awk/multibyte fixed
- No need to pullup to release branches
Let us observe carefully other unexpected regressions for 2nd edition nawk,
at least for a little more while.
Thanks RVP again for improving the patch!
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.