NetBSD Problem Report #57616

From www@netbsd.org  Mon Sep 11 13:20:23 2023
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id CBAEF1A9238
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 11 Sep 2023 13:20:23 +0000 (UTC)
Message-Id: <20230911132022.06D941A9239@mollari.NetBSD.org>
Date: Mon, 11 Sep 2023 13:20:21 +0000 (UTC)
From: marc.fege@uni-bonn.de
Reply-To: marc.fege@uni-bonn.de
To: gnats-bugs@NetBSD.org
Subject: sed(1) is unable to process multibyte unicode characters properly
X-Send-Pr-Version: www-1.0

>Number:         57616
>Category:       bin
>Synopsis:       sed(1) is unable to process multibyte unicode characters properly
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Sep 11 13:25:00 +0000 2023
>Last-Modified:  Mon Sep 11 16:55:01 +0000 2023
>Originator:     Marc Fege
>Release:        9.3 evbarm/i386/amd64
>Organization:
>Environment:
NetBSD rpi 9.3 NetBSD 9.3 (RPI) #0: Thu Aug  4 15:30:37 UTC 2022  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/evbarm/compile/RPI evbarm
>Description:
Hello dear community,

sed(1) has a problem processing multibyte unicode characters properly. For example, if you try to seperate a sequence of chars into single characters, NetBSD's sed(1) passes out, when a char is longer than a byte of ASCII code, e.g. when faced with unicode characters such as German umlauts "Ä/ä, Ö/ö or Ü/ü" for example, which are two bytes long or such as the big variant German ß, "&#7838;", which is represented by even three bytes per character.

I played with shell env variables $LC_CTYPE and $LC_ALL, as well as $LANG, both with "C" as their contents, as well as "de_DE.UTF-8". No effect so far.

All I try to achieve is to space characters, which were dynamically stored in a shell variable by a running shell script. The string will be processed by a command comparable to the following:

     echo "abcÄÖÜxyz" | sed 's/./& /g'

I expect the following output format for further processing:
     "a b c Ä Ö Ü x y z "

But the output of NetBSD's sed(1) produces either a deletion of the multibyte character(s) in question or two garbled "?" characters according to the byte length of the non-processable multibyte characters depending on the env variables setting in advance. But sed(1) never a produces proper output as mentioned in the desired format above. So NetBSD's sed(1) outputs either
     "a b c x y z "  ,
     "a b c       x y z"
           or
     "a b c ? ? ? ? ? ? x y z "  ,
rendering some shell scripts useless, when they dare to expect full unicode support of a shell userland in the 2020's.

I tested the desired behaviour also with current GNU sed (gsed) from pkgsrc, as well as FreeBSD's sed implementation of their current 13.2 release on a proper FreeBSD system. Both of them understand multibyte characters without any issue and process them as one actual character, independent of byte length out of the box and represent them properly. Unfortunately, NetBSDs sed(1) implementation needs to be fixed in that regard according to POLA, because a common user nowadays find it rather annoying, if not confusing, if a common shell tool, such as sed is producing that kind of an undesired behaviour, when he is requesting the program to edit an input stream with the proper syntax.

Anyway: you are doing a great job, guy's!
Thumb's up!

Best regards,
Marc.
>How-To-Repeat:
On a shell that understands unicode (e.g. env LANG="de_DE.UTF-8") type german umlauts, echo them and try to pipe the echoed output into sed(1):

echo "abcÄÖÜxyz" | sed 's/./& /g'
>Fix:

>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/57616: sed(1) is unable to process multibyte unicode characters properly
Date: Mon, 11 Sep 2023 15:03:24 -0000 (UTC)

 marc.fege@uni-bonn.de writes:

 >NetBSD rpi 9.3 NetBSD 9.3 (RPI) #0: Thu Aug  4 15:30:37 UTC 2022  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/evbarm/compile/RPI evbarm

 >sed(1) has a problem processing multibyte unicode characters properly.

 >     echo "abcÄÖÜxyz" | sed 's/./& /g'
 >I expect the following output format for further processing:
 >     "a b c Ä Ö Ü x y z "


 It's not actually about sed failing but what the underlying regexp
 library can do.

 Wide char support ("NLS") from FreeBSD was integrated in 2021 and
 will be in NetBSD-10.

From: "Fege, Marc Daniel" <marc.fege@uni-bonn.de>
To: gnats-bugs@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: bin/57616: sed(1) is unable to process multibyte unicode
 characters properly
Date: Mon, 11 Sep 2023 17:40:15 +0200

 --_=_swift_1694446815_6a72f414bbd60b0874d6b9b6c0c01343_=_
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: quoted-printable

 Hello Michael,


 thank's a lot for your quick reply.


 > Wide char support ("NLS") from FreeBSD was integrated in 2021 and
 > will be in NetBSD-10.

 That's fantastic news. So it seems, that I'm a little late with my bug
 report, then, even though, 9.3 is the most recent stable release, and
 the issue is at least valid for 9.x the branch.

 However: are there plans to backport that stuff to a possible NetBSD
 9.4 or do we actually have to wait for possible 10 release?


 > It's not actually about sed failing but what the underlying regexp
 > library can do.

 Due to the fact, that I'm just an ordinary user, not a developer in
 any way, I was unable to state details beyond surface level
 diagnostic. What I see as a user as frontend of all of that underlying
 stuff is just a program called sed(1). That's why I was referring to
 it in a certain use case.

 Thank's alot!
 =20
 Am Montag, den 11.09.2023 um 17:05 schrieb mlelstv@serpens.de (michael
 van elst):



 The following reply was made to PR bin/57616; it has been noted by
 GNATS.

 From: mlelstv@serpens.de (Michael van Elst)
 To: gnats-bugs@netbsd.org
 Cc:=20
 Subject: Re: bin/57616: sed(1) is unable to process multibyte unicode
 characters properly
 Date: Mon, 11 Sep 2023 15:03:24 -0000 (UTC)

 marc.fege@uni-bonn.de writes:

 >NetBSD rpi 9.3 NetBSD 9.3 (RPI) #0: Thu Aug=C2=A0=C2=A04 15:30:37 UTC
 2022=C2=A0=C2=A0mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/evbarm/compile=
 /RPI
 evbarm

 >sed(1) has a problem processing multibyte unicode characters
 properly.

 >=C2=A0=C2=A0=C2=A0=C2=A0 echo "abc???xyz" | sed 's/./& /g'
 >I expect the following output format for further processing:
 >=C2=A0=C2=A0=C2=A0=C2=A0 "a b c ? ? ? x y z "


 It's not actually about sed failing but what the underlying regexp
 library can do.

 Wide char support ("NLS") from FreeBSD was integrated in 2021 and
 will be in NetBSD-10.

 --_=_swift_1694446815_6a72f414bbd60b0874d6b9b6c0c01343_=_
 Content-Type: text/html; charset=UTF-8
 Content-Transfer-Encoding: quoted-printable

 <html>
 <head>
 <style type=3D"text/css" id=3D"groupoffice-email-style">
 h6 {
   font-size: 11px;
   line-height: 14px;
   font-weight: bold;
   color: var(--fg-secondary-text);
 }
 h4 {
   font-size: 14px;
   line-height: 21px;
   letter-spacing: 0.4px;
   color: var(--fg-text);
   font-weight: normal;
 }
 h5 {
   font-size: 12px;
   color: var(--fg-secondary-text);
   font-weight: normal;
 }
 h3 {
   font-size: 16px;
   line-height: 21px;
   font-weight: normal;
   letter-spacing: 0.6px;
   color: var(--fg-base);
 }
 h2 {
   font-size: 21px;
   line-height: 28px;
   font-weight: normal;
   letter-spacing: 0.6px;
   color: var(--fg-base);
 }
 h1 {
   font-size: 30px;
   line-height: 35px;
   font-weight: normal;
   letter-spacing: 0.6px;
   color: var(--fg-base);
 }
 body, p, span, div {
   font-family: Helvetica, Arial, sans-serif;
   font-size: 14px;
   color: var(--fg-text);
   font-weight: normal;
   line-height: 21px;
   background-color: white;
 }
 @media screen and (max-device-width: 1200px) {
   body, p, span, div {
     font-size: 16px;
     line-height: 24px;
   }
 }
 code {
   border: 1px solid var(--fg-line);
   background-color: var(--bg-background);
   padding: 7px;
   margin: 14px 0;
   display: block;
   font-family: "Courier New", Courier, monospace;
   color: var(--fg-base);
   border-radius: 3.5px;
 }
 ul {
   display: block;
   list-style-type: disc;
   list-style-position: outside;
   margin: 0;
   padding: 0 0 0 2em;
 }
 ul > ul {
   list-style-type: circle;
 }
 ul > ul > ul {
   list-style-type: square;
 }
 ol {
   display: block;
   list-style-type: decimal;
   list-style-position: outside;
   margin: 0;
   padding: 0 0 0 2em;
 }
 ol > ol {
   list-style-type: lower-alpha;
 }
 ol > ol > ol {
   list-style-type: lower-roman;
 }
 </style>
 </head>
 <body><style></style>Hello Michael,<br><div><br></div><div>thank's a lot =
 for your quick reply.<br></div><div><br></div><div>&gt; Wide char suppor=
 t ("NLS") from FreeBSD was integrated in 2021 and<br>&gt; will be in Net=
 BSD-10.<style></style></div><div><br></div><div>That's fantastic news. S=
 o it seems, that I'm a little late with my bug report, <style></style>th=
 en, even though, 9.3 is the most recent stable release, and the issue is=
  at least valid for 9.x the branch.<br></div><div>However: are there pla=
 ns to backport that stuff to a possible NetBSD 9.4 or do we actually hav=
 e to wait for possible 10 release?<br></div><div><br></div><div>&gt; It'=
 s not actually about sed failing but what the underlying regexp<br>
  &gt; library can do.<style></style></div><div><br></div><div>Due to the =
 fact, that I'm just an ordinary user, not a developer in any way, I was =
 unable to state details beyond surface level diagnostic. What I see as a=
  user as frontend of all of that underlying stuff is just a program call=
 ed sed(1). That's why I was referring to it in a certain use case.</div>=
 <div><br></div><div>Thank's alot!<br><style></style></div>
 <br>Am Montag, den 11.09.2023 um 17:05 schrieb <a href=3D"mailto:mlelstv@se=
 rpens.de" class=3D"normal-link normal-link-email" target=3D"_blank" rel=3D=
 "noopener noreferrer">mlelstv@serpens.de</a> (michael van elst):<br><blo=
 ckquote style=3D"border:0;border-left: 2px solid #22437f; padding:0px; mar=
 gin:0px; padding-left:5px; margin-left: 5px; "><div class=3D"msg">The foll=
 owing reply was made to PR bin/57616; it has been noted by GNATS.<br>
 <br>
 From: <a class=3D"normal-link" href=3D"mailto:mlelstv@serpens.de">mlelstv=
 @serpens.de</a> (Michael van Elst)<br>
 To: <a class=3D"normal-link" href=3D"mailto:gnats-bugs@netbsd.org">gnats-=
 bugs@netbsd.org</a><br>
 Cc: <br>
 Subject: Re: bin/57616: sed(1) is unable to process multibyte unicode cha=
 racters properly<br>
 Date: Mon, 11 Sep 2023 15:03:24 -0000 (UTC)<br>
 <br>
  <a class=3D"normal-link" href=3D"mailto:marc.fege@uni-bonn.de">marc.fege=
 @uni-bonn.de</a> writes:<br>
  <br>
  &gt;NetBSD rpi 9.3 NetBSD 9.3 (RPI) #0: Thu Aug&nbsp;&nbsp;4 15:30:37 UT=
 C 2022&nbsp;&nbsp;<a href=3D"mailto:mkrepro@mkrepro.NetBSD.org" class=3D=
 "normal-link normal-link-email" target=3D"_blank" rel=3D"noopener norefe=
 rrer">mkrepro@mkrepro.NetBSD.org</a>:/usr/src/sys/arch/evbarm/compile/RP=
 I evbarm<br>
  <br>
  &gt;sed(1) has a problem processing multibyte unicode characters properl=
 y.<br>
  <br>
  &gt;&nbsp;&nbsp;&nbsp;&nbsp; echo "abc???xyz" | sed 's/./&amp; /g'<br>
  &gt;I expect the following output format for further processing:<br>
  &gt;&nbsp;&nbsp;&nbsp;&nbsp; "a b c ? ? ? x y z "<br>
  <br>
  <br>
  It's not actually about sed failing but what the underlying regexp<br>
  library can do.<br>
  <br>
  Wide char support ("NLS") from FreeBSD was integrated in 2021 and<br>
  will be in NetBSD-10.</div></blockquote></body></html>

 --_=_swift_1694446815_6a72f414bbd60b0874d6b9b6c0c01343_=_--

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.