NetBSD Problem Report #57616
From www@netbsd.org Mon Sep 11 13:20:23 2023
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id CBAEF1A9238
for <gnats-bugs@gnats.NetBSD.org>; Mon, 11 Sep 2023 13:20:23 +0000 (UTC)
Message-Id: <20230911132022.06D941A9239@mollari.NetBSD.org>
Date: Mon, 11 Sep 2023 13:20:21 +0000 (UTC)
From: marc.fege@uni-bonn.de
Reply-To: marc.fege@uni-bonn.de
To: gnats-bugs@NetBSD.org
Subject: sed(1) is unable to process multibyte unicode characters properly
X-Send-Pr-Version: www-1.0
>Number: 57616
>Category: bin
>Synopsis: sed(1) is unable to process multibyte unicode characters properly
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Sep 11 13:25:00 +0000 2023
>Last-Modified: Mon Sep 11 16:55:01 +0000 2023
>Originator: Marc Fege
>Release: 9.3 evbarm/i386/amd64
>Organization:
>Environment:
NetBSD rpi 9.3 NetBSD 9.3 (RPI) #0: Thu Aug 4 15:30:37 UTC 2022 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/evbarm/compile/RPI evbarm
>Description:
Hello dear community,
sed(1) has a problem processing multibyte unicode characters properly. For example, if you try to seperate a sequence of chars into single characters, NetBSD's sed(1) passes out, when a char is longer than a byte of ASCII code, e.g. when faced with unicode characters such as German umlauts "Ä/ä, Ö/ö or Ü/ü" for example, which are two bytes long or such as the big variant German ß, "ẞ", which is represented by even three bytes per character.
I played with shell env variables $LC_CTYPE and $LC_ALL, as well as $LANG, both with "C" as their contents, as well as "de_DE.UTF-8". No effect so far.
All I try to achieve is to space characters, which were dynamically stored in a shell variable by a running shell script. The string will be processed by a command comparable to the following:
echo "abcÄÖÜxyz" | sed 's/./& /g'
I expect the following output format for further processing:
"a b c Ä Ö Ü x y z "
But the output of NetBSD's sed(1) produces either a deletion of the multibyte character(s) in question or two garbled "?" characters according to the byte length of the non-processable multibyte characters depending on the env variables setting in advance. But sed(1) never a produces proper output as mentioned in the desired format above. So NetBSD's sed(1) outputs either
"a b c x y z " ,
"a b c x y z"
or
"a b c ? ? ? ? ? ? x y z " ,
rendering some shell scripts useless, when they dare to expect full unicode support of a shell userland in the 2020's.
I tested the desired behaviour also with current GNU sed (gsed) from pkgsrc, as well as FreeBSD's sed implementation of their current 13.2 release on a proper FreeBSD system. Both of them understand multibyte characters without any issue and process them as one actual character, independent of byte length out of the box and represent them properly. Unfortunately, NetBSDs sed(1) implementation needs to be fixed in that regard according to POLA, because a common user nowadays find it rather annoying, if not confusing, if a common shell tool, such as sed is producing that kind of an undesired behaviour, when he is requesting the program to edit an input stream with the proper syntax.
Anyway: you are doing a great job, guy's!
Thumb's up!
Best regards,
Marc.
>How-To-Repeat:
On a shell that understands unicode (e.g. env LANG="de_DE.UTF-8") type german umlauts, echo them and try to pipe the echoed output into sed(1):
echo "abcÄÖÜxyz" | sed 's/./& /g'
>Fix:
>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/57616: sed(1) is unable to process multibyte unicode characters properly
Date: Mon, 11 Sep 2023 15:03:24 -0000 (UTC)
marc.fege@uni-bonn.de writes:
>NetBSD rpi 9.3 NetBSD 9.3 (RPI) #0: Thu Aug 4 15:30:37 UTC 2022 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/evbarm/compile/RPI evbarm
>sed(1) has a problem processing multibyte unicode characters properly.
> echo "abcÄÖÜxyz" | sed 's/./& /g'
>I expect the following output format for further processing:
> "a b c Ä Ö Ü x y z "
It's not actually about sed failing but what the underlying regexp
library can do.
Wide char support ("NLS") from FreeBSD was integrated in 2021 and
will be in NetBSD-10.
From: "Fege, Marc Daniel" <marc.fege@uni-bonn.de>
To: gnats-bugs@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: bin/57616: sed(1) is unable to process multibyte unicode
characters properly
Date: Mon, 11 Sep 2023 17:40:15 +0200
--_=_swift_1694446815_6a72f414bbd60b0874d6b9b6c0c01343_=_
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Hello Michael,
thank's a lot for your quick reply.
> Wide char support ("NLS") from FreeBSD was integrated in 2021 and
> will be in NetBSD-10.
That's fantastic news. So it seems, that I'm a little late with my bug
report, then, even though, 9.3 is the most recent stable release, and
the issue is at least valid for 9.x the branch.
However: are there plans to backport that stuff to a possible NetBSD
9.4 or do we actually have to wait for possible 10 release?
> It's not actually about sed failing but what the underlying regexp
> library can do.
Due to the fact, that I'm just an ordinary user, not a developer in
any way, I was unable to state details beyond surface level
diagnostic. What I see as a user as frontend of all of that underlying
stuff is just a program called sed(1). That's why I was referring to
it in a certain use case.
Thank's alot!
=20
Am Montag, den 11.09.2023 um 17:05 schrieb mlelstv@serpens.de (michael
van elst):
The following reply was made to PR bin/57616; it has been noted by
GNATS.
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:=20
Subject: Re: bin/57616: sed(1) is unable to process multibyte unicode
characters properly
Date: Mon, 11 Sep 2023 15:03:24 -0000 (UTC)
marc.fege@uni-bonn.de writes:
>NetBSD rpi 9.3 NetBSD 9.3 (RPI) #0: Thu Aug=C2=A0=C2=A04 15:30:37 UTC
2022=C2=A0=C2=A0mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/evbarm/compile=
/RPI
evbarm
>sed(1) has a problem processing multibyte unicode characters
properly.
>=C2=A0=C2=A0=C2=A0=C2=A0 echo "abc???xyz" | sed 's/./& /g'
>I expect the following output format for further processing:
>=C2=A0=C2=A0=C2=A0=C2=A0 "a b c ? ? ? x y z "
It's not actually about sed failing but what the underlying regexp
library can do.
Wide char support ("NLS") from FreeBSD was integrated in 2021 and
will be in NetBSD-10.
--_=_swift_1694446815_6a72f414bbd60b0874d6b9b6c0c01343_=_
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<html>
<head>
<style type=3D"text/css" id=3D"groupoffice-email-style">
h6 {
font-size: 11px;
line-height: 14px;
font-weight: bold;
color: var(--fg-secondary-text);
}
h4 {
font-size: 14px;
line-height: 21px;
letter-spacing: 0.4px;
color: var(--fg-text);
font-weight: normal;
}
h5 {
font-size: 12px;
color: var(--fg-secondary-text);
font-weight: normal;
}
h3 {
font-size: 16px;
line-height: 21px;
font-weight: normal;
letter-spacing: 0.6px;
color: var(--fg-base);
}
h2 {
font-size: 21px;
line-height: 28px;
font-weight: normal;
letter-spacing: 0.6px;
color: var(--fg-base);
}
h1 {
font-size: 30px;
line-height: 35px;
font-weight: normal;
letter-spacing: 0.6px;
color: var(--fg-base);
}
body, p, span, div {
font-family: Helvetica, Arial, sans-serif;
font-size: 14px;
color: var(--fg-text);
font-weight: normal;
line-height: 21px;
background-color: white;
}
@media screen and (max-device-width: 1200px) {
body, p, span, div {
font-size: 16px;
line-height: 24px;
}
}
code {
border: 1px solid var(--fg-line);
background-color: var(--bg-background);
padding: 7px;
margin: 14px 0;
display: block;
font-family: "Courier New", Courier, monospace;
color: var(--fg-base);
border-radius: 3.5px;
}
ul {
display: block;
list-style-type: disc;
list-style-position: outside;
margin: 0;
padding: 0 0 0 2em;
}
ul > ul {
list-style-type: circle;
}
ul > ul > ul {
list-style-type: square;
}
ol {
display: block;
list-style-type: decimal;
list-style-position: outside;
margin: 0;
padding: 0 0 0 2em;
}
ol > ol {
list-style-type: lower-alpha;
}
ol > ol > ol {
list-style-type: lower-roman;
}
</style>
</head>
<body><style></style>Hello Michael,<br><div><br></div><div>thank's a lot =
for your quick reply.<br></div><div><br></div><div>> Wide char suppor=
t ("NLS") from FreeBSD was integrated in 2021 and<br>> will be in Net=
BSD-10.<style></style></div><div><br></div><div>That's fantastic news. S=
o it seems, that I'm a little late with my bug report, <style></style>th=
en, even though, 9.3 is the most recent stable release, and the issue is=
at least valid for 9.x the branch.<br></div><div>However: are there pla=
ns to backport that stuff to a possible NetBSD 9.4 or do we actually hav=
e to wait for possible 10 release?<br></div><div><br></div><div>> It'=
s not actually about sed failing but what the underlying regexp<br>
> library can do.<style></style></div><div><br></div><div>Due to the =
fact, that I'm just an ordinary user, not a developer in any way, I was =
unable to state details beyond surface level diagnostic. What I see as a=
user as frontend of all of that underlying stuff is just a program call=
ed sed(1). That's why I was referring to it in a certain use case.</div>=
<div><br></div><div>Thank's alot!<br><style></style></div>
<br>Am Montag, den 11.09.2023 um 17:05 schrieb <a href=3D"mailto:mlelstv@se=
rpens.de" class=3D"normal-link normal-link-email" target=3D"_blank" rel=3D=
"noopener noreferrer">mlelstv@serpens.de</a> (michael van elst):<br><blo=
ckquote style=3D"border:0;border-left: 2px solid #22437f; padding:0px; mar=
gin:0px; padding-left:5px; margin-left: 5px; "><div class=3D"msg">The foll=
owing reply was made to PR bin/57616; it has been noted by GNATS.<br>
<br>
From: <a class=3D"normal-link" href=3D"mailto:mlelstv@serpens.de">mlelstv=
@serpens.de</a> (Michael van Elst)<br>
To: <a class=3D"normal-link" href=3D"mailto:gnats-bugs@netbsd.org">gnats-=
bugs@netbsd.org</a><br>
Cc: <br>
Subject: Re: bin/57616: sed(1) is unable to process multibyte unicode cha=
racters properly<br>
Date: Mon, 11 Sep 2023 15:03:24 -0000 (UTC)<br>
<br>
<a class=3D"normal-link" href=3D"mailto:marc.fege@uni-bonn.de">marc.fege=
@uni-bonn.de</a> writes:<br>
<br>
>NetBSD rpi 9.3 NetBSD 9.3 (RPI) #0: Thu Aug 4 15:30:37 UT=
C 2022 <a href=3D"mailto:mkrepro@mkrepro.NetBSD.org" class=3D=
"normal-link normal-link-email" target=3D"_blank" rel=3D"noopener norefe=
rrer">mkrepro@mkrepro.NetBSD.org</a>:/usr/src/sys/arch/evbarm/compile/RP=
I evbarm<br>
<br>
>sed(1) has a problem processing multibyte unicode characters properl=
y.<br>
<br>
> echo "abc???xyz" | sed 's/./& /g'<br>
>I expect the following output format for further processing:<br>
> "a b c ? ? ? x y z "<br>
<br>
<br>
It's not actually about sed failing but what the underlying regexp<br>
library can do.<br>
<br>
Wide char support ("NLS") from FreeBSD was integrated in 2021 and<br>
will be in NetBSD-10.</div></blockquote></body></html>
--_=_swift_1694446815_6a72f414bbd60b0874d6b9b6c0c01343_=_--
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.