NetBSD Problem Report #48427

From www@NetBSD.org  Fri Dec  6 21:11:50 2013
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 46234A642D
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  6 Dec 2013 21:11:50 +0000 (UTC)
Message-Id: <20131206211149.12016A6451@mollari.NetBSD.org>
Date: Fri,  6 Dec 2013 21:11:48 +0000 (UTC)
From: yuri@rawbw.com
Reply-To: yuri@rawbw.com
To: gnats-bugs@NetBSD.org
Subject: libedit shouldn't require ISO 10646
X-Send-Pr-Version: www-1.0

>Number:         48427
>Category:       lib
>Synopsis:       libedit shouldn't require ISO 10646
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    lib-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Dec 06 21:15:00 +0000 2013
>Last-Modified:  Sun Dec 08 00:15:00 +0000 2013
>Originator:     Yuri
>Release:        current
>Organization:
n/a
>Environment:
>Description:
While porting lib/libedit to FreeBSD I noticed this lines in chartype.h:

#ifndef __STDC_ISO_10646__
/* In many places it is assumed that the first 127 code points are ASCII
 * compatible, so ensure wchar_t indeed does ISO 10646 and not some other
 * funky encoding that could break us in weird and wonderful ways. */
        #error wchar_t must store ISO 10646 characters
#endif

You limit the character set to UCS (ISO 10646) in order to make sure that lower 127 code points are ASCII. There are many character sets that satisfy this condition, and UCS is just one of them. Other practical examples are KOI8-U,KOI8-R for Cyrillic, ISO/IEC 8859-15, and some others for some other languages.

FreeBSD, for example, doesn't have __STDC_ISO_10646__ defined because the user can set any other character set through environment.

I am not sure what is the right solution, but requiring ISO 10646 isn't right, and would break compiles in general.

>How-To-Repeat:

>Fix:

>Audit-Trail:
From: Thomas Klausner <wiz@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/48427: libedit shouldn't require ISO 10646
Date: Sat, 7 Dec 2013 01:03:50 +0100

 On Fri, Dec 06, 2013 at 09:15:00PM +0000, yuri@rawbw.com wrote:
 > While porting lib/libedit to FreeBSD I noticed this lines in chartype.h:
 > 
 > #ifndef __STDC_ISO_10646__
 > /* In many places it is assumed that the first 127 code points are ASCII
 >  * compatible, so ensure wchar_t indeed does ISO 10646 and not some other
 >  * funky encoding that could break us in weird and wonderful ways. */
 >         #error wchar_t must store ISO 10646 characters
 > #endif
 > 
 > You limit the character set to UCS (ISO 10646) in order to make sure that lower 127 code points are ASCII. There are many character sets that satisfy this condition, and UCS is just one of them. Other practical examples are KOI8-U,KOI8-R for Cyrillic, ISO/IEC 8859-15, and some others for some other languages.
 > 
 > FreeBSD, for example, doesn't have __STDC_ISO_10646__ defined because the user can set any other character set through environment.
 > 
 > I am not sure what is the right solution, but requiring ISO 10646 isn't right, and would break compiles in general.

 Do I understand correctly that you're saying that what the user
 defines in the environment changes how wchar_t is defined?
  Thomas

From: Yuri <yuri@rawbw.com>
To: gnats-bugs@NetBSD.org, lib-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
        netbsd-bugs@NetBSD.org
Cc: 
Subject: Re: lib/48427: libedit shouldn't require ISO 10646
Date: Fri, 06 Dec 2013 16:15:04 -0800

 On 12/06/2013 16:05, Thomas Klausner wrote:
 >   Do I understand correctly that you're saying that what the user
 >   defines in the environment changes how wchar_t is defined?

 No, wchar_t by definition holds numeric values of the character code 
 points wider than 8 bits, for various character sets.
 Particular character set represented by wchar_t may vary depending on 
 the choice of the user.

 Your compile time limit of the character set to UCS is too narrow.
 Character set should be allowed to vary at the runtime. And any 
 character set limitations should be done at the runtime too.

 Yuri

From: Thomas Klausner <wiz@NetBSD.org>
To: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: lib/48427: libedit shouldn't require ISO 10646
Date: Sat, 7 Dec 2013 10:42:31 +0100

 On Fri, Dec 06, 2013 at 04:15:04PM -0800, Yuri wrote:
 > On 12/06/2013 16:05, Thomas Klausner wrote:
 > >  Do I understand correctly that you're saying that what the user
 > >  defines in the environment changes how wchar_t is defined?
 > 
 > No, wchar_t by definition holds numeric values of the character code
 > points wider than 8 bits, for various character sets.
 > Particular character set represented by wchar_t may vary depending
 > on the choice of the user.
 > 
 > Your compile time limit of the character set to UCS is too narrow.
 > Character set should be allowed to vary at the runtime. And any
 > character set limitations should be done at the runtime too.

 I don't know much about this stuff, but I can easily imagine that
 wchar_t always contains UCS and that there is a translation layer
 between the user-visible encoding and the one in wchar_t.

 I'll let someone who knows more about this stuff continue this
 conversation.
  Thomas

From: Yuri <yuri@rawbw.com>
To: gnats-bugs@NetBSD.org, lib-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
        netbsd-bugs@NetBSD.org
Cc: 
Subject: Re: lib/48427: libedit shouldn't require ISO 10646
Date: Sat, 07 Dec 2013 01:51:57 -0800

 On 12/07/2013 01:45, Thomas Klausner wrote:
 >   I don't know much about this stuff, but I can easily imagine that
 >   wchar_t always contains UCS and that there is a translation layer
 >   between the user-visible encoding and the one in wchar_t.

 Please read the Wikipedia article: 
 https://en.wikipedia.org/wiki/Wide_character
 It talks in detail about wchar_t definition. There is no direct 
 relationship between wchar_t and UCS.

 Yuri

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/48427: libedit shouldn't require ISO 10646
Date: Sat, 7 Dec 2013 18:59:55 +0100

 On Fri, Dec 06, 2013 at 09:15:00PM +0000, yuri@rawbw.com wrote:
 > You limit the character set to UCS (ISO 10646) in order to make sure
 > that lower 127 code points are ASCII. There are many character sets
 > that satisfy this condition, and UCS is just one of them. Other
 > practical examples are KOI8-U,KOI8-R for Cyrillic, ISO/IEC 8859-15, and
 > some others for some other languages.

 Just like Thomas I don't understand your point. The encoding used for
 wchar_t is a fixed implementation detail and does not rely on any user
 environment settings in NetBSD.

 You may use any of the encodings you list above for multibyte character
 sequences, but you will always get full 32bit unicode for wchar_t.
 At least on NetBSD.

 Martin


From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, yuri@rawbw.com
Cc: 
Subject: Re: lib/48427: libedit shouldn't require ISO 10646
Date: Sat, 7 Dec 2013 15:26:49 -0500

 On Dec 7, 12:05am, wiz@NetBSD.org (Thomas Klausner) wrote:
 -- Subject: Re: lib/48427: libedit shouldn't require ISO 10646

 |  Do I understand correctly that you're saying that what the user
 |  defines in the environment changes how wchar_t is defined?

 Even so, you can add __FreeBSD__ to the list of the OS's to avoid the
 check and call it a day...

 christos

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/48427: libedit shouldn't require ISO 10646
Date: Sun, 8 Dec 2013 01:13:34 +0100

 On Sat, Dec 07, 2013 at 06:00:01PM +0000, Martin Husemann wrote:
 >  You may use any of the encodings you list above for multibyte character
 >  sequences, but you will always get full 32bit unicode for wchar_t.
 >  At least on NetBSD.

 No, you won't necessarily get that. That's why we don't set the macro
 either. The internal encoding of wchar_t for a given locale is exactly
 that -- an encoding detail. We do provide a few basic promises, but
 that's about it.

 Joerg
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.