NetBSD Problem Report #59041

From ym3by-nb@yahoo.com  Thu Jan 30 18:59:22 2025
Return-Path: <ym3by-nb@yahoo.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id D136F1A923A
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 30 Jan 2025 18:59:22 +0000 (UTC)
Message-Id: <1133394373.3530611.1738261675367@mail.yahoo.com>
Date: Thu, 30 Jan 2025 18:27:55 +0000 (UTC)
From: "ym3by-nb@yahoo.com" <ym3by-nb@yahoo.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>
Subject: /usr/bin/grep includes '[' in named character classes for UTF-8;
 also affects grep -w
References: <1133394373.3530611.1738261675367.ref@mail.yahoo.com>

>Number:         59041
>Category:       bin
>Synopsis:       /usr/bin/grep includes '[' in named character classes for UTF-8; also affects grep -w
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    gnats-admin
>State:          open
>Class:          sw-bug
>Submitter-Id:   unknown
>Arrival-Date:   Thu Jan 30 19:00:01 +0000 2025
>Last-Modified:  Wed Mar 05 20:14:09 +0000 2025
>Originator:     Mike Burrows
>Release:        NetBSD 10.1
>Organization:
>Environment:
System: NetBSD wombat 10.1 NetBSD 10.1 (GENERIC) #0: Mon Dec 16 13:08:11 UT=
C 2024 mkrepro@mkrepro.NetBSD.org:/usr/src/sys
/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
/usr/bin/grep mis-parses named character classe=
s (such as [:alnum:])=C2=A0
when using multibyte characters, such as UTF-8.
The mis-parsing leads to it adding the '[' to t=
he character class.=C2=A0
Thus, when using en_US.UTF-8, the character cla=
ss [:digit:] will
include '[' in addition to '0','1',2','3','4','=
5','6','7','8','9'.
The problem affects "grep -w" because "-w" caus=
es grep to surround the=C2=A0
user's pattern with
\(^\|[^[:alnum:]_]\=
)\(=C2=A0and=C2=A0\)\([^[:alnum:]_]\|$\)
This means that, when using a UTF-8 character s=
et, "grep -w foo" won't
find instances of " foo[", because the '[' is t=
reated incorrectly as a=C2=A0
character that is part of a word.(I firs=
t noticed the problem because
grep was failing to find instances of array nam=
es in .c files when=C2=A0
using "-w".)The problem affects egrep al=
so, because it uses the same
code.It does not affect "fgrep -w" even =
though it's the same binary=C2=A0
because fgrep does not use the DFA code that co=
ntains the problem.
>How-To-Repeat:
echo ' foo[' | env =
-i LC_CTYPE=3Den_US.UTF-8 /usr/bin/grep -w foo
This should output the line ' foo[', but does n=
ot.
To see that the problem is to do with named cha=
racter classes:
echo '[' | env -i L=
C_CTYPE=3Den_US.UTF-8 /usr/bin/grep '[[:digit:]]'
which outputs the line '[', even though the lin=
e contains no digits.
To see that this affects only multibyte charact=
ers:
echo '[' | env -i L=
C_CTYPE=3DC /usr/bin/grep '[[:digit:]]'
which outputs nothing, correctly.
>Fix:
I believe that /usr/bin/grep is built from the =
sources in=C2=A0
/usr/src/external/g=
pl2/grep/dist/src
and that the error is in the routine
parse_bracket_exp_m=
b()=C2=A0
Line 508: the code notices the start of a chara=
cter class:if (wc =3D=3D L'[' && ...
Line 512: wc ('[') is copied into wc1:wc=
1 =3D wc;
Line 516: start of parse of named character cla=
ss:if (cur_mb_len =3D=3D 1 && (wc =3D=3D L':' || wc =3D=3D L'.' || w=
c =3D=3D L'=3D'))
Line 592: wc is set to -1, but wc1 continues to=
 hold '[':wc =3D -1;
Lines 593-648:uses of wc1 here do not mo=
dify it
Line 649: the '[' in wc1 is copied into wc, for=
 the next iteration:while ((wc =3D wc1) !=3D L']');
And so the '[' that started the named character=
 class is effectively=C2=A0
appended to it on the next iteration of the do-=
while loop.
This does not affect single-byte character sets=
, because their named=C2=A0
character classes are handled separately, start=
ing at line 1021.
I believe that if wc1 were set to -1 at line 59=
2, it would fix the=C2=A0
problem.e.g., make line 592 be: wc1 =3D =
wc =3D -1;
That change seems to work for the cases given a=
bove, but I have not
done enough testing to be certain that there ar=
e no unwanted
side-effects.

>Release-Note:

>Audit-Trail:

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2025 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.