NetBSD Problem Report #57544

From www@netbsd.org  Wed Jul 26 17:00:38 2023
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id ECFAA1A923A
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 26 Jul 2023 17:00:37 +0000 (UTC)
Message-Id: <20230726170036.742011A923B@mollari.NetBSD.org>
Date: Wed, 26 Jul 2023 17:00:36 +0000 (UTC)
From: tlaronde@polynum.com
Reply-To: tlaronde@polynum.com
To: gnats-bugs@NetBSD.org
Subject: sed(1) and regex(3) problem with encoding
X-Send-Pr-Version: www-1.0

>Number:         57544
>Category:       bin
>Synopsis:       sed(1) and regex(3) problem with encoding
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jul 26 17:05:00 +0000 2023
>Last-Modified:  Tue Aug 01 08:25:01 +0000 2023
>Originator:     Thierry LARONDE
>Release:        NetBSD 10.0_BETA
>Organization:
>Environment:
NetBSD cauchy.polynum.local 10.0_BETA NetBSD 10.0_BETA (cauchy) #0: Mon Feb 27 11:28:34 CET 2023  tlaronde@cauchy.polynum.local:/usr/obj/polynum.NODECONF-cauchy.polynum.local_netbsd-9.3-amd64_netbsd-amd64/netbsd/obj/sys/arch/amd64/compile/cauchy amd64

>Description:
$ export LC_CTYPE=fr_FR.ISO8859-15

and then:

$ echo "éé" | sed 's/é/\&eacute;/g'
sed: 1: "s/é/\&eacute;/g": RE error: trailing backslash (\)

$ export LC_CTYPE=POSIX.ISO8859-15 # incorrect setting but...
$ echo "éé" | sed 's/é/\&eacute;/g'
&eacute;&eacute;

From a test by Martin HUSEMANN, the problem is on arch where
char == signed char. (On Apple POWERMAC_G5.MP, as expected.)

Note: this is a regression from 9.3 and can be not solved, but masked,
by:

-   (void) setlocale(LC_ALL, "");
+   (void) setlocale(LC_ALL, "POSIX");

probably in every text utility using regex(3). 


>How-To-Repeat:
$ export LC_CTYPE=fr_FR.ISO8859-15
$ echo "éé" | sed 's/é/\&eacute;/g'
sed: 1: "s/é/\&eacute;/g": RE error: trailing backslash (\)

(On arch where char == signed char as amd64)
>Fix:
Not fixing: problem is lurking. Circumventing:

-   (void) setlocale(LC_ALL, "");
+   (void) setlocale(LC_ALL, "POSIX");

>Audit-Trail:
From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/57544: sed(1) and regex(3) problem with encoding
Date: Wed, 26 Jul 2023 17:19:44 +0000 (UTC)

 On Wed, 26 Jul 2023, tlaronde@polynum.com wrote:

 > $ export LC_CTYPE=fr_FR.ISO8859-15
 >
 > and then:
 >
 > $ echo "??" | sed 's/??\&eacute;/g'
 > sed: 1: "s/??\&eacute;/g": RE error: trailing backslash (\)
 >

 Not running NetBSD right now, but, FreeBSD 13.2 has the same issue which
 can be seen even with a plain grep(1)--as it relies on the libc regexp
 engine.

 Can you try the patch below (it is for NetBSD):

 ```
 diff -urN src/lib/libc/regex.orig/regcomp.c src/lib/libc/regex/regcomp.c
 --- src/lib/libc/regex.orig/regcomp.c	2022-12-21 17:44:15.000000000 +0000
 +++ src/lib/libc/regex/regcomp.c	2023-07-26 17:05:50.832242252 +0000
 @@ -898,7 +898,7 @@
   	handled = false;

   	assert(MORE());		/* caller should have ensured this */
 -	c = GETNEXT();
 +	c = (unsigned char)GETNEXT();
   	if (c == '\\') {
   		(void)REQUIRE(MORE(), REG_EESCAPE);
   		cc = GETNEXT();
 ```

 -RVP

From: tlaronde@polynum.com
To: gnats-bugs@netbsd.org
Cc: RVP <rvp@SDF.ORG>, Martin Husemann <martin@duskware.de>,
        Taylor R Campbell <campbell+netbsd-tech-userlevel@mumble.net>
Subject: Re: bin/57544: sed(1) and regex(3) problem with encoding
Date: Mon, 31 Jul 2023 10:52:07 +0200

 RVP has indeed found the culprit so the above diff:

 Index: regcomp.c
 ===================================================================
 RCS file: /pub/NetBSD-CVS/src/lib/libc/regex/regcomp.c,v
 retrieving revision 1.46
 diff -u -r1.46 regcomp.c
 --- regcomp.c	11 Mar 2021 15:00:29 -0000	1.46
 +++ regcomp.c	31 Jul 2023 08:32:56 -0000
 @@ -900,10 +900,10 @@
  	handled = false;

  	assert(MORE());		/* caller should have ensured this */
 -	c = GETNEXT();
 +	c = (unsigned char)GETNEXT();
  	if (c == '\\') {
  		(void)REQUIRE(MORE(), REG_EESCAPE);
 -		cc = GETNEXT();
 +		cc = (unsigned char)GETNEXT();
  		c = BACKSL | cc;
  #ifdef REGEX_GNU_EXTENSIONS
  		if (p->gnuext) {

 solves the problem.

 Explanation: the regex(3) is decorating a char or a sequence treatment
 by using an int and, in p_simp_re() was setting in the int the bit
 immediately left to the bits needed for a char to 1:

 #       define  BACKSL  (1<<CHAR_BIT)

 when it was an escaped sequence before accessing the next char. And the
 treatment was after, testing for this flag.

 On a machine with signed chars and two-complement, where the sign bit
 is "extended", every negative char was then tested as been an escaped
 sequence.

 From a cursory look, the difference between setting LC_CTYPE=C (no
 problem) or LC_CTYPE=fr_FR.ISO8859-15 (just as an example) is perhaps
 that in the first case extended RE are assumed, while in the latter case
 legacy is used, hence not following the same path (legacy using
 p_simp_re() while ERE uses p_ere_exp()). 

 But the whole code should be reviewed by someone knowing the
 intrincasies between the locales and ctype, and the problem of
 signed/unsigned (and to add more, two-complement) needs also a more
 thorough review.

 Ironically, in WHATSNEW (dating BSD 4.4...) there is this:

 Most uses of "uchar" are gone; it's all chars now.  Char/uchar
 parameters are now written int/unsigned, to avoid possible portability
 problems with unpromoted parameters.  Some unsigned casts have been
 introduced to minimize portability problems with shifting into sign
 bits.

 So signed/unsigned and portability problems are not new...
 -- 
         Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
                     http://kertex.kergis.com/
 Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc: tlaronde@polynum.com
Subject: Re: bin/57544: sed(1) and regex(3) problem with encoding
Date: Mon, 31 Jul 2023 16:45:00 +0000 (UTC)

 On Mon, 31 Jul 2023, tlaronde@polynum.com wrote:

 > From a cursory look, the difference between setting LC_CTYPE=C (no
 > problem) or LC_CTYPE=fr_FR.ISO8859-15 (just as an example) is perhaps
 > that in the first case extended RE are assumed, while in the latter case
 > legacy is used, hence not following the same path (legacy using
 > p_simp_re() while ERE uses p_ere_exp()).
 >

 No, it's the other half of the same test on line 1030 returning true/false.
 In the fr_FR.ISO8859-1 locale, may_escape() returns false for `0xE9' because
 it _is_ an alpha char. In the C/POSIX locate, may_escape() returns true
 as `0xE9' is _not_ an alpha char. there.

 Incidentally, that isalpha() test in may_escape() should really use iswalpha()
 because `ch' is of type `wint_t':

 ```
 diff -urN regex.orig/regcomp.c regex/regcomp.c
 --- regex.orig/regcomp.c	2022-12-21 17:44:15.000000000 +0000
 +++ regex/regcomp.c	2023-07-31 16:25:38.458547000 +0000
 @@ -1422,7 +1422,7 @@

   	if ((p->pflags & PFLAG_LEGACY_ESC) != 0)
   		return (true);
 -	if (isalpha(ch) || ch == '\'' || ch == '`')
 +	if (iswalpha(ch) || ch == '\'' || ch == '`')
   		return (false);
   	return (true);
   #ifdef NOTYET
 ```

 As you said, this code ought to be carefully audited. :)

 -RVP

From: tlaronde@polynum.com
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/57544: sed(1) and regex(3) problem with encoding
Date: Tue, 1 Aug 2023 10:22:11 +0200

 On Mon, Jul 31, 2023 at 04:50:02PM +0000, RVP wrote:
 > The following reply was made to PR bin/57544; it has been noted by GNATS.
 > 
 > From: RVP <rvp@SDF.ORG>
 > To: gnats-bugs@netbsd.org
 > Cc: tlaronde@polynum.com
 > Subject: Re: bin/57544: sed(1) and regex(3) problem with encoding
 > Date: Mon, 31 Jul 2023 16:45:00 +0000 (UTC)
 > 

 I stand corrected. As RVP wrote and we agree, the whole code (_with_ the
 locale dependencies) should be audited and as I'm too much involved in other
 stuff right now I will be of no help on this (the proof with my
 "guessing" about the C vs french locale difference that was a poor try
 of finding any reason about a variation in behavior without looking
 carefully at the code so that I could go back to what I'm doing...).
 -- 
         Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
                     http://kertex.kergis.com/
 Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.