NetBSD Problem Report #58612
From www@netbsd.org Sat Aug 17 19:36:16 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
client-signature RSA-PSS (2048 bits) client-digest SHA256)
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 2A91E1A9243
for <gnats-bugs@gnats.NetBSD.org>; Sat, 17 Aug 2024 19:36:16 +0000 (UTC)
Message-Id: <20240817193614.717111A9244@mollari.NetBSD.org>
Date: Sat, 17 Aug 2024 19:36:14 +0000 (UTC)
From: campbell+netbsd@mumble.net
Reply-To: campbell+netbsd@mumble.net
To: gnats-bugs@NetBSD.org
Subject: c8rtomb/c16rtomb/c32rtomb yield suboptimal shift sequences
X-Send-Pr-Version: www-1.0
>Number: 58612
>Category: lib
>Synopsis: c8rtomb/c16rtomb/c32rtomb yield suboptimal shift sequences
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: riastradh
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Aug 17 19:40:00 +0000 2024
>Last-Modified: Tue Aug 20 17:45:05 +0000 2024
>Originator: Taylor R Campbell
>Release: current
>Organization:
The NetBSD Shift Sequence tomb
>Environment:
>Description:
The new c8rtomb/c16rtomb/c32rtomb functions in libc, introduced in C11
(and C23 for c8rtomb), use _citrus_iconv_convert to convert a single
Unicode scalar value (specifically, a four-byte UTF-32LE byte
sequence) to the locale-dependent multibyte character encoding.
Other than buffering the UTF-8/16 decoding in the cases of c8rtomb and
c16rtomb, this conversion is stateless, so if it previously produced a
shift sequence to a non-initial state, such as ESC ( J in ISO-2022-JP
to switch from US-ASCII to ISO/IEC 646:JP, it will also produce a
shift sequence back to the initial state.
Although this output may be correct, it is suboptimal -- and may not
fit in the output buffer of MB_CUR_MAX bytes.
>How-To-Repeat:
char buf[128];
c16rtomb(&buf[0], L'A', NULL); /* LATIN SMALL LETTER A */
c16rtomb(&buf[1], 0xe3a5, NULL); /* YEN SIGN */
This should produce four bytes of output (three bytes to shift from
US-ASCII to ISO/IEC 646:JP, one byte for YEN SIGN in ISO/IEC 646:JP),
but instead it produces seven bytes (shift, YEN SIGN, shift back). A
subsequent c16rtomb with U+e3a5 (YEN SIGN) should only produce another
one byte of output because it has already shifted to ISO/IEC 646:JP,
but instead it produces another seven bytes.
>Fix:
Figure out how to use the internal Citrus API to convert a Unicode
scalar value to locale-dependent wchar_t, and use wcrtomb instead of
_citrus_iconv_convert in c32rtomb in order to produce the output with
state. (Since c8rtomb and c16rtomb are defined in terms of c32rtomb,
nothing else is needed for them.)
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: lib-bug-people->riastradh
Responsible-Changed-By: riastradh@NetBSD.org
Responsible-Changed-When: Sat, 17 Aug 2024 23:43:18 +0000
Responsible-Changed-Why:
I have tests and a fix pending
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/58612 CVS commit: src/tests/lib/libc/locale
Date: Sun, 18 Aug 2024 04:51:16 +0000
Module Name: src
Committed By: riastradh
Date: Sun Aug 18 04:51:16 UTC 2024
Modified Files:
src/tests/lib/libc/locale: t_c16rtomb.c t_c8rtomb.c
Log Message:
c8rtomb(3), c16rtomb(3), c32rtomb(3): Test stateful shift sequences.
PR lib/58612: c8rtomb/c16rtomb/c32rtomb yield suboptimal shift
sequences
To generate a diff of this commit:
cvs rdiff -u -r1.3 -r1.4 src/tests/lib/libc/locale/t_c16rtomb.c \
src/tests/lib/libc/locale/t_c8rtomb.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/58612 CVS commit: src/tests/lib/libc/locale
Date: Sun, 18 Aug 2024 05:00:20 +0000
Module Name: src
Committed By: riastradh
Date: Sun Aug 18 05:00:20 UTC 2024
Modified Files:
src/tests/lib/libc/locale: t_c8rtomb.c
Log Message:
c8rtomb(3): Fix digit error in shift sequence test.
PR lib/58612: c8rtomb/c16rtomb/c32rtomb yield suboptimal shift
sequences
To generate a diff of this commit:
cvs rdiff -u -r1.4 -r1.5 src/tests/lib/libc/locale/t_c8rtomb.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/58612 CVS commit: src/tests/lib/libc/locale
Date: Mon, 19 Aug 2024 16:21:47 +0000
Module Name: src
Committed By: riastradh
Date: Mon Aug 19 16:21:47 UTC 2024
Modified Files:
src/tests/lib/libc/locale: t_c16rtomb.c t_c8rtomb.c
Log Message:
t_c8rtomb, t_c16rtomb: Simplify comment.
ESC $ B is technically rather the JIS X 0208-1983 shift sequence, but
since I don't see any way to provoke the JIS X 0208-1978 shift
sequence to come flying out of this conversion (ESC $ @), and I'm not
sure there's any difference in the interpretation, let's just say JIS
X 0208.
PR lib/58612: c8rtomb/c16rtomb/c32rtomb yield suboptimal shift
sequences
To generate a diff of this commit:
cvs rdiff -u -r1.4 -r1.5 src/tests/lib/libc/locale/t_c16rtomb.c
cvs rdiff -u -r1.5 -r1.6 src/tests/lib/libc/locale/t_c8rtomb.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/58612 CVS commit: src
Date: Mon, 19 Aug 2024 16:22:10 +0000
Module Name: src
Committed By: riastradh
Date: Mon Aug 19 16:22:10 UTC 2024
Modified Files:
src/lib/libc/locale: c32rtomb.c c32rtomb.h
src/tests/lib/libc/locale: t_c16rtomb.c t_c8rtomb.c
Log Message:
c32rtomb(3): Use conversion state to handle shift sequences.
For conversion of Unicode scalar values to coding systems requiring
shift sequences, such as ISO-2022-JP, _citrus_iconv_convert will
always produce:
1. a shift sequence from the initial state to some nondefault state,
like from US-ASCII to JIS X 0208
2. the encoding of the desired characater
3. a shift sequence restoring the initial state
This is unnecessary if the output is already in the state needed to
encoded the desired character. For example, this method produces
seven bytes to encode each YEN SIGN in ISO-2022-JP -- and fourteen,
to encode two consecutive ones -- even though the shift sequence is
only three bytes long and once shifted YEN SIGN takes only one byte.
Instead, convert the Unicode scalar value to a locale-dependent wide
character and encode that, by composing
- _citrus_iconv_convert
=> gives us a multibyte encoding of the character from the initial
state (and restoring the initial state afterward)
- mbrtowc with initial conversion state
=> gives us the single wide character representation
XXX If combining characters are possible here, this may fail.
- wcrtomb with caller's conversion tsate
=> gives us a state-dependent multibyte encoding of the character
XXX Is there a cheaper way to convert from Unicode scalar value to
locale-dependent wide character? It is not obvious to me from the
largely undocumented Citrus machinery, but it would obviously be
better than this somewhat circuitous Rube Goldberg contraption of
chained multibyte APIs.
PR lib/58612: c8rtomb/c16rtomb/c32rtomb yield suboptimal shift
sequences
To generate a diff of this commit:
cvs rdiff -u -r1.3 -r1.4 src/lib/libc/locale/c32rtomb.c
cvs rdiff -u -r1.1 -r1.2 src/lib/libc/locale/c32rtomb.h
cvs rdiff -u -r1.5 -r1.6 src/tests/lib/libc/locale/t_c16rtomb.c
cvs rdiff -u -r1.6 -r1.7 src/tests/lib/libc/locale/t_c8rtomb.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/58612 CVS commit: src/lib/libc/locale
Date: Tue, 20 Aug 2024 17:43:41 +0000
Module Name: src
Committed By: riastradh
Date: Tue Aug 20 17:43:41 UTC 2024
Modified Files:
src/lib/libc/locale: c32rtomb.c
Log Message:
c32rtomb(3): Fix type of wcrtomb_l return value.
This was from `int wctomb_l(...)' in an earlier draft and I didn't
update it to size_t when I changed the draft to wcrtomb_l. Caught by
lint.
`wc_len' mirrors `mb_len' in the complementary code in mbrtoc32(3) to
avoid clash with standard C function mblen(3).
PR lib/58612: c8rtomb/c16rtomb/c32rtomb yield suboptimal shift
sequences
To generate a diff of this commit:
cvs rdiff -u -r1.4 -r1.5 src/lib/libc/locale/c32rtomb.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.