NetBSD Problem Report #57798

From ryo@tetera.org  Wed Dec 27 12:44:36 2023
Return-Path: <ryo@tetera.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 7332A1A9238
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 27 Dec 2023 12:44:36 +0000 (UTC)
Message-Id: <4e184a7f8c1ccb4adb263909604ea124.ryo@tetera.org>
Date: Wed, 27 Dec 2023 21:44:29 +0900
From: ryo@tetera.org
Reply-To: ryo@tetera.org
To: gnats-bugs@NetBSD.org
Subject: With src/share/locale/ctype/en_US.UTF-8.src, wcwidth() returns 3 when ja_JP.UTF-8 locale is used
X-Send-Pr-Version: 3.95

>Number:         57798
>Category:       lib
>Synopsis:       With src/share/locale/ctype/en_US.UTF-8.src, wcwidth() returns 3 when ja_JP.UTF-8 locale is used
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    lib-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Dec 27 12:45:00 +0000 2023
>Closed-Date:    Sun Jan 14 16:48:06 +0000 2024
>Last-Modified:  Sun Jan 14 16:48:06 +0000 2024
>Originator:     Ryo ONODERA
>Release:        NetBSD 10.99.10
>Organization:
Ryo ONODERA // ryo@tetera.org
PGP fingerprint = 82A2 DC91 76E0 A10A 8ABB  FD1B F404 27FA C7D1 15F3
>Environment:


System: NetBSD castella 10.99.10 NetBSD 10.99.10 (DTRACE9) #0: Mon Dec 25 05:18:50 JST 2023 ryoon@castella:/usr/world/10.99/amd64/obj/sys/arch/amd64/compile/DTRACE9 amd64
Architecture: x86_64
Machine: amd64
>Description:
The following test program prints 3 for 兆 character.

=== === === === === === === === === === === === === === === ===
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>

int
main(void)
{
        const char *str = "兆";// -> 3, should be 2

        setlocale(LC_ALL, "ja_JP.UTF-8");
        //setlocale(LC_ALL, "C");

        wchar_t cp;

        int len = mbtowc(&cp, str, 5);
        if (len < 1)
                return -1;

        int w = wcwidth(cp);
        printf("w = %d\n", w);

        return 0;
}
=== === === === === === === === === === === === === === === ===

TODIGIT   < 0x5146 1000000000000 >

in

src/share/locale/ctype/en_US.UTF-8.src

was introduced in r1.8 and it seems that it makes wcwidth of 兆 character as 3.

>How-To-Repeat:

>Fix:
I have no idea.

>Release-Note:

>Audit-Trail:
From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@netbsd.org, lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Cc: 
Subject: Re: lib/57798: With src/share/locale/ctype/en_US.UTF-8.src, wcwidth()
 returns 3 when ja_JP.UTF-8 locale is used
Date: Wed, 27 Dec 2023 22:52:04 +0900

 兆 (U+5146) is a Chinese symbol for 1,000,000,000,000.

 I've confirmed that wcwidth(3) wrongly returns 3 for U+5146.
 while it correctly returns 2 for 億 (U+5140: 100,000,000).

 wcwidth(3) also returns 3 for U+16B60 (a Pahawn Hmong
 character for 10,000,000,000).

 These failures should be due to broken TODIGIT supports in
 mklocale(1). Its man page says:

  > TODIGIT    Defines a map from runes to their digit value.
  >            (snip) Only values up to 255 are allowed.

 OpenBSD has already dropped support to TODIGIT from mklocale(1):
 https://github.com/OpenBSD/src/commit/4efe9bdeb34

 If this commit is mechanically applied to netbsd-10,
 wcwidth(3) correctly reports 2 for U+5146.

 I will commit it and send a pullup request to netbsd-10,
 if there's no objections.

 Thanks,
 rin

From: Valery Ushakov <uwe@stderr.spb.ru>
To: Rin Okuyama <rokuyama.rk@gmail.com>
Cc: gnats-bugs@netbsd.org
Subject: Re: lib/57798: With src/share/locale/ctype/en_US.UTF-8.src,
 wcwidth() returns 3 when ja_JP.UTF-8 locale is used
Date: Wed, 27 Dec 2023 16:52:28 +0300

 On Wed, Dec 27, 2023 at 22:52:04 +0900, Rin Okuyama wrote:

 > https://github.com/OpenBSD/src/commit/4efe9bdeb34
 > 
 > If this commit is mechanically applied to netbsd-10,
 > wcwidth(3) correctly reports 2 for U+5146.

 Please, can you leave a comment somewhere around the new "DIGITMAP
 mapignore" rule that mentions that we now ignore this information, so
 that it's obvious without consulting the version history?

 And of course s/Ox/Nx/ in the man page diff.

 Thanks!

 -uwe

From: "Rin Okuyama" <rin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57798 CVS commit: src/usr.bin/mklocale
Date: Thu, 28 Dec 2023 03:49:35 +0000

 Module Name:	src
 Committed By:	rin
 Date:		Thu Dec 28 03:49:35 UTC 2023

 Modified Files:
 	src/usr.bin/mklocale: mklocale.1 yacc.y

 Log Message:
 mklocale: XXX: Neglect TODIGIT at the moment

 PR lib/57798

 It was implemented with an assumption that all digit characters
 can be mapped to numerical values <= 255.

 This is no longer true for Unicode, and results in, e.g., wrong
 return values of wcwidth(3) for U+5146 or U+16B60.

 As a workaround, neglect TODIGIT for now, as done for OpenBSD:
 https://github.com/OpenBSD/src/commit/4efe9bdeb34

 XXX
 At least netbsd-10 should be fixed, but it requires some tests.


 To generate a diff of this commit:
 cvs rdiff -u -r1.17 -r1.18 src/usr.bin/mklocale/mklocale.1
 cvs rdiff -u -r1.34 -r1.35 src/usr.bin/mklocale/yacc.y

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: Valery Ushakov <uwe@stderr.spb.ru>, ryo@tetera.org
Cc: gnats-bugs@netbsd.org, netbsd-bugs@NetBSD.org
Subject: Re: lib/57798: With src/share/locale/ctype/en_US.UTF-8.src, wcwidth()
 returns 3 when ja_JP.UTF-8 locale is used
Date: Thu, 28 Dec 2023 12:57:08 +0900

 On 2023/12/27 22:52, Valery Ushakov wrote:
 > On Wed, Dec 27, 2023 at 22:52:04 +0900, Rin Okuyama wrote:
 > 
 >> https://github.com/OpenBSD/src/commit/4efe9bdeb34
 >>
 >> If this commit is mechanically applied to netbsd-10,
 >> wcwidth(3) correctly reports 2 for U+5146.
 > 
 > Please, can you leave a comment somewhere around the new "DIGITMAP
 > mapignore" rule that mentions that we now ignore this information, so
 > that it's obvious without consulting the version history?
 > 
 > And of course s/Ox/Nx/ in the man page diff.

 Thank you uwe@ for careful review!

 I've committed it to -current. Let us see what happens for a while.

 It would be nice if we can pull this up (in some form) to netbsd-10,
 but I'm afraid there's no enough time left before 10.0 release...

 ryoon@, do you have some ideas to test this change? For what
 application have you encountered this problem? Is it fixed now
 without regression?

 Thanks,
 rin

From: Ryo ONODERA <ryo@tetera.org>
To: Rin Okuyama <rokuyama.rk@gmail.com>, Valery Ushakov <uwe@stderr.spb.ru>
Cc: gnats-bugs@netbsd.org
Subject: Re: lib/57798: With src/share/locale/ctype/en_US.UTF-8.src,
 wcwidth() returns 3 when ja_JP.UTF-8 locale is used
Date: Thu, 28 Dec 2023 13:27:53 +0900

 Hi,

 Rin Okuyama <rokuyama.rk@gmail.com> writes:

 > On 2023/12/27 22:52, Valery Ushakov wrote:
 >> On Wed, Dec 27, 2023 at 22:52:04 +0900, Rin Okuyama wrote:
 >> 
 >>> https://github.com/OpenBSD/src/commit/4efe9bdeb34
 >>>
 >>> If this commit is mechanically applied to netbsd-10,
 >>> wcwidth(3) correctly reports 2 for U+5146.
 >> 
 >> Please, can you leave a comment somewhere around the new "DIGITMAP
 >> mapignore" rule that mentions that we now ignore this information, so
 >> that it's obvious without consulting the version history?
 >> 
 >> And of course s/Ox/Nx/ in the man page diff.
 >
 > Thank you uwe@ for careful review!
 >
 > I've committed it to -current. Let us see what happens for a while.
 >
 > It would be nice if we can pull this up (in some form) to netbsd-10,
 > but I'm afraid there's no enough time left before 10.0 release...
 >
 > ryoon@, do you have some ideas to test this change? For what
 > application have you encountered this problem? Is it fixed now
 > without regression?

 TODIGIT for U+5146 and similar Kanji characters are always useless
 for me.

 I have forgotten to mention in the original PR, tmux in NetBSD base
 (/usr/bin/tmux) is affected by this problem.
 If you set LANG=ja_JP.UTF-8 and start tmux and display $BC{(B character
 in the tmux window, tmux exits unexpectedly with '[server exited unexpectedly]'
 error message. And all tmux sessions will be lost.
 Recent tmux assumes wcwidth always returns <= 2.

 See: 'ud->width >2' in utf8_from_data() in src/external/bsd/tmux/dist/utf8.c

 /* Get UTF-8 character from data. */
 enum utf8_state
 utf8_from_data(const struct utf8_data *ud, utf8_char *uc)
 {
         u_int   index;

         if (ud->width > 2)
                 fatalx("invalid UTF-8 width: %u", ud->width);


 Thank you.

 > Thanks,
 > rin

 -- 
 Ryo ONODERA // ryo@tetera.org
 PGP fingerprint = 82A2 DC91 76E0 A10A 8ABB  FD1B F404 27FA C7D1 15F3

From: Brett Lymn <blymn@internode.on.net>
To: gnats-bugs@netbsd.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
        ryo@tetera.org
Subject: Re: lib/57798: With src/share/locale/ctype/en_US.UTF-8.src,
 wcwidth() returns 3 when ja_JP.UTF-8 locale is used
Date: Fri, 29 Dec 2023 07:20:32 +1030

 On Thu, Dec 28, 2023 at 04:00:03AM +0000, Rin Okuyama wrote:
 >  
 >  ryoon@, do you have some ideas to test this change? For what
 >  application have you encountered this problem? Is it fixed now
 >  without regression?
 >  

 Our wide curses relies on wcwidth to determine cursor positioning and call widths, so any
 curses based application attempting to display these characters will have a corrupted
 display.>  

 -- 
 Brett Lymn
 --
 Sent from my NetBSD device.

 "We are were wolves",
 "You mean werewolves?",
 "No we were wolves, now we are something else entirely",
 "Oh"

From: Valery Ushakov <uwe@stderr.spb.ru>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: PR/57798 CVS commit: src/usr.bin/mklocale
Date: Fri, 29 Dec 2023 00:16:12 +0300

 On Thu, Dec 28, 2023 at 03:50:01 +0000, Rin Okuyama wrote:

 >  It was implemented with an assumption that all digit characters
 >  can be mapped to numerical values <= 255.

 Unicode has three different "numeric" values for a character

 Unicode Character Database
 https://unicode.org/reports/tr44/

   Numeric_Value is extracted based on the actual numeric value of the
   data in field 8 of UnicodeData.txt or the values of the
   kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric tags, for
   characters listed in the Unihan data files.

   Numeric_Type is extracted as follows.  If fields 6, 7, and 8 in
   UnicodeData.txt are all non-empty, then Numeric_Type=Decimal.
   Otherwise, if fields 7 and 8 are both non-empty, then
   Numeric_Type=Digit.  Otherwise, if field 8 is non-empty, then
   Numeric_Type=Numeric.  For characters listed in the Unihan data
   files, Numeric_Type=Numeric for characters that have
   kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric tags.  The
   default value is Numeric_Type=None.

 The intention of TODIGIT is likely to be able to eventually provide
 support for something like LC_TIME's alt_digits or glibc printf(3)
 extension that provides 'I' modifier for %d and friends - that use
 locale-specific digits, say u+0f20..u+0f29 for Tibetan/Dzongkha
 locales.

 But I don't really know much about those areas of locales...

 -uwe

From: "Rin Okuyama" <rin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57798 CVS commit: src/usr.bin/mklocale
Date: Fri, 5 Jan 2024 02:38:07 +0000

 Module Name:	src
 Committed By:	rin
 Date:		Fri Jan  5 02:38:06 UTC 2024

 Modified Files:
 	src/usr.bin/mklocale: mklocale.1 yacc.y

 Log Message:
 mklocale(1): Add range check for TODIGIT, rather than disabling it

 PR lib/57798

 Digit value specified by TODIGIT is storaged as lowest 8 bits of
 _RuneType, see lib/libc/locale/runetype_file.h:

 https://nxr.netbsd.org/xref/src/lib/libc/locale/runetype_file.h#56

 The symptom reported in the PR is due to missing range check for
 this value; values of 256 and above were mistakenly treated as
 other flag bits in _RuneType.

 For example, U+5146 has numerical value 1000,000,000,000 ==
 0xe8d4a51000 where __BITS(30, 31) == _RUNETYPE_SW3 are turned on.
 This is why wcwidth(3) returned 3 for this character.

 This apparently affected not only character width, but also other
 attributes storaged in _RuneType.

 IIUC, digit value attributes in _RuneType have never been utilized
 until now, but preserve these if digit fits within (0, 256). This
 should be safer for pulling this up into netbsd-10. Also, these
 attributes may be useful to implement some I18N features as
 suggested by uwe@ in the PR.

 netbsd-[98] is not affected as these use old UTF-8 ctype definitions.


 To generate a diff of this commit:
 cvs rdiff -u -r1.18 -r1.19 src/usr.bin/mklocale/mklocale.1
 cvs rdiff -u -r1.35 -r1.36 src/usr.bin/mklocale/yacc.y

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@netbsd.org, lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org, ryo@tetera.org
Cc: ryo@tetera.org, Brett Lymn <blymn@internode.on.net>,
 Valery Ushakov <uwe@stderr.spb.ru>, Martin Husemann <martin@duskware.de>
Subject: Re: PR/57798 CVS commit: src/usr.bin/mklocale
Date: Fri, 5 Jan 2024 12:05:53 +0900

 I've committed a revised fix to -current.

 As wrote in the commit log, the original problem affected not
 only wcwidth(3), but also other attributes storaged in _RuneType.

 I'll send a pullup request to netbsd-10 tomorrow, if there's no
 objections.

 On 2023/12/29 6:40, Valery Ushakov wrote:
 >   Unicode has three different "numeric" values for a character
 >   
 >   Unicode Character Database
 >   https://unicode.org/reports/tr44/
 >   
 >     Numeric_Value is extracted based on the actual numeric value of the
 >     data in field 8 of UnicodeData.txt or the values of the
 >     kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric tags, for
 >     characters listed in the Unihan data files.
 >   
 >     Numeric_Type is extracted as follows.  If fields 6, 7, and 8 in
 >     UnicodeData.txt are all non-empty, then Numeric_Type=Decimal.
 >     Otherwise, if fields 7 and 8 are both non-empty, then
 >     Numeric_Type=Digit.  Otherwise, if field 8 is non-empty, then
 >     Numeric_Type=Numeric.  For characters listed in the Unihan data
 >     files, Numeric_Type=Numeric for characters that have
 >     kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric tags.  The
 >     default value is Numeric_Type=None.
 >   
 >   The intention of TODIGIT is likely to be able to eventually provide
 >   support for something like LC_TIME's alt_digits or glibc printf(3)
 >   extension that provides 'I' modifier for %d and friends - that use
 >   locale-specific digits, say u+0f20..u+0f29 for Tibetan/Dzongkha
 >   locales.
 >   
 >   But I don't really know much about those areas of locales...

 Thank you for info. As far as I can see, most of characters in
 problem are categorized to Numeric_Type=Numeric, and it seems
 difficult to distinguish these with, e.g., [0-9a-f].

 On 2023/12/29 5:50, Brett Lymn wrote:
  > Our wide curses relies on wcwidth to determine cursor positioning and 
 call widths, so any
  > curses based application attempting to display these characters will 
 have a corrupted
  > display.>

 Yeah, /usr/bin/vi gets confused when edit message that contains
 U+5146 actually ;)

 I've roughly checked output from -d option of mklocale(1).
 Width (and other attribute fields) seems fixed now, as far as
 I can see.

 On 2023/12/28 13:30, Ryo ONODERA wrote:
  >   See: 'ud->width >2' in utf8_from_data() in 
 src/external/bsd/tmux/dist/utf8.c
  >
  >   /* Get UTF-8 character from data. */
  >   enum utf8_state
  >   utf8_from_data(const struct utf8_data *ud, utf8_char *uc)
  >   {
  >           u_int   index;
  >
  >           if (ud->width > 2)
  >                   fatalx("invalid UTF-8 width: %u", ud->width);

 Oops. I'm not pretty sure whether this is a good programming
 practice, but this was actually useful to find out the problem ;)

 Thanks,
 rin

State-Changed-From-To: open->pending-pullups
State-Changed-By: rin@NetBSD.org
State-Changed-When: Sat, 06 Jan 2024 15:46:22 +0000
State-Changed-Why:
[pullup-10 #538] [SERIOUS] Fix mklocale(1) for lib/57798
https://releng.netbsd.org/cgi-bin/req-10.cgi?show=538

netbsd-[98] are not affected.


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57798 CVS commit: [netbsd-10] src/usr.bin/mklocale
Date: Sun, 14 Jan 2024 15:15:00 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Sun Jan 14 15:15:00 UTC 2024

 Modified Files:
 	src/usr.bin/mklocale [netbsd-10]: mklocale.1 yacc.y

 Log Message:
 Pull up following revision(s) (requested by rin in ticket #538):

 	usr.bin/mklocale/yacc.y: revision 1.35
 	usr.bin/mklocale/yacc.y: revision 1.36
 	usr.bin/mklocale/mklocale.1: revision 1.18
 	usr.bin/mklocale/mklocale.1: revision 1.19

 mklocale: XXX: Neglect TODIGIT at the moment
 PR lib/57798

 It was implemented with an assumption that all digit characters
 can be mapped to numerical values <= 255.
 This is no longer true for Unicode, and results in, e.g., wrong
 return values of wcwidth(3) for U+5146 or U+16B60.

 As a workaround, neglect TODIGIT for now, as done for OpenBSD:
 https://github.com/OpenBSD/src/commit/4efe9bdeb34
 XXX

 At least netbsd-10 should be fixed, but it requires some tests.

 mklocale(1): Add range check for TODIGIT, rather than disabling it
 PR lib/57798

 Digit value specified by TODIGIT is storaged as lowest 8 bits of
 _RuneType, see lib/libc/locale/runetype_file.h:
 https://nxr.netbsd.org/xref/src/lib/libc/locale/runetype_file.h#56

 The symptom reported in the PR is due to missing range check for
 this value; values of 256 and above were mistakenly treated as
 other flag bits in _RuneType.

 For example, U+5146 has numerical value 1000,000,000,000 ==
 0xe8d4a51000 where __BITS(30, 31) == _RUNETYPE_SW3 are turned on.

 This is why wcwidth(3) returned 3 for this character.

 This apparently affected not only character width, but also other
 attributes storaged in _RuneType.

 IIUC, digit value attributes in _RuneType have never been utilized
 until now, but preserve these if digit fits within (0, 256). This
 should be safer for pulling this up into netbsd-10. Also, these
 attributes may be useful to implement some I18N features as
 suggested by uwe@ in the PR.

 netbsd-[98] is not affected as these use old UTF-8 ctype definitions.


 To generate a diff of this commit:
 cvs rdiff -u -r1.17 -r1.17.16.1 src/usr.bin/mklocale/mklocale.1
 cvs rdiff -u -r1.34 -r1.34.8.1 src/usr.bin/mklocale/yacc.y

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Sun, 14 Jan 2024 16:48:06 +0000
State-Changed-Why:
fixed and pulled up to 10, not needed for 9 or 8


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.