NetBSD Problem Report #47602
From christos@zoulas.com Thu Feb 28 21:15:46 2013
Return-Path: <christos@zoulas.com>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id DC28363E704
for <gnats-bugs@gnats.NetBSD.org>; Thu, 28 Feb 2013 21:15:45 +0000 (UTC)
Message-Id: <20130228211541.847F397129@rebar.astron.com>
Date: Thu, 28 Feb 2013 21:15:41 +0000 (UTC)
From: christos@netbsd.org
Reply-To: christos@netbsd.org
To: gnats-bugs@gnats.NetBSD.org
Subject: getwc() modifies input instead of returning EILSEQ.
X-Send-Pr-Version: 3.95
>Number: 47602
>Category: lib
>Synopsis: getwc() modifies input instead of returning EILSEQ.
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: lib-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Feb 28 21:20:00 +0000 2013
>Closed-Date:
>Last-Modified: Fri Sep 30 08:17:39 +0000 2016
>Originator: Christos Zoulas
>Release: NetBSD 5.99.59
>Organization:
Entropy Busters LLC.
>Environment:
System: NetBSD rebar.astron.com 5.99.59 NetBSD 5.99.59 (GENERIC) #7: Fri Dec 30 15:19:49 EST 2011 christos@rebar.astron.com:/usr/src/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
getwc() should not modify what it reads in order to make it compliant with
the charset it is supposed to be using. It should return an error instead.
Thanks to J.R. Oldroyd for submitting both the test program and the patch.
>How-To-Repeat:
#include <stdio.h>
#include <wchar.h>
#include <errno.h>
#include <locale.h>
main() {
char *fname = ",getwc-test";
wchar_t obuf[16];
wint_t c, c1;
FILE *fp;
char *locales[] = {
"C",
"en_US.UTF-8",
"ja_JP.eucJP",
0,
}, **locale;
int i;
obuf[0] = 0xa440;
obuf[1] = 0xac4f;
obuf[2] = 0xb36f;
obuf[3] = 0xcf40;
obuf[4] = L'\0';
setlocale(LC_CTYPE, "zh_TW.Big5");
fp = fopen(fname, "w");
i = fwprintf(fp, L"%ls\n", obuf);
fclose(fp);
for (locale = locales; *locale; locale++) {
setlocale(LC_CTYPE, *locale);
fp = fopen(fname, "r");
for (i = 0; i < wcslen(obuf); i++) {
c = getwc(fp);
if (c == WEOF && errno == EILSEQ)
c = (wint_t) getc(fp);
if ((c & 0xff00) == 0) {
c1 = getc(fp);
c = (c << 8) | c1;
}
if (c == obuf[i])
wprintf(L"%-12.12s %d wrote 0x%x read 0x%x ok\n", *locale, i, obuf[i], c);
else
wprintf(L"%-12.12s %d wrote 0x%x read 0x%x getwc error\n", *locale, i, obuf[i], c);
}
fclose(fp);
}
unlink(fname);
}
also see /usr/src/tests/lib/libc/locale/t_io.c for more errors.
>Fix:
--- lib/libc/citrus/modules/citrus_euc.c.orig 2013-02-28 15:50:49.000000000 -0500
+++ lib/libc/citrus/modules/citrus_euc.c 2013-02-28 15:53:59.000000000 -0500
@@ -271,7 +271,8 @@
wchar = 0;
while (len-- > 0)
wchar = (wchar << 8) | (*s1++ & 0xff);
- wchar = (wchar & ~ei->mask) | ei->bits[cs];
+ if (wchar != (wchar & ~ei->mask) | ei->bits[cs])
+ goto encoding_error;
psenc->chlen = 0;
if (pwc)
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: lib-bug-people->tnozaki
Responsible-Changed-By: christos@NetBSD.org
Responsible-Changed-When: Thu, 28 Feb 2013 17:04:53 -0500
Responsible-Changed-Why:
appropriate
Responsible-Changed-From-To: tnozaki->lib-bug-people
Responsible-Changed-By: wiz@NetBSD.org
Responsible-Changed-When: Mon, 23 Dec 2013 11:41:32 +0000
Responsible-Changed-Why:
resigned, back to role account
From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/47602 CVS commit: src/lib/libc/citrus/modules
Date: Thu, 16 Jan 2014 15:28:51 -0500
Module Name: src
Committed By: christos
Date: Thu Jan 16 20:28:51 UTC 2014
Modified Files:
src/lib/libc/citrus/modules: citrus_euc.c
Log Message:
PR/47602: Christos Zoulas: getwc() modifies input instead of returning EILSEQ.
Waited for almost a year for feedback and there was none.
To generate a diff of this commit:
cvs rdiff -u -r1.15 -r1.16 src/lib/libc/citrus/modules/citrus_euc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Ryo ONODERA <ryo_on@yk.rim.or.jp>
To: gnats-bugs@NetBSD.org, christos@netbsd.org
Cc:
Subject: Re: PR/47602 CVS commit: src/lib/libc/citrus/modules
Date: Sat, 18 Jan 2014 16:49:09 +0900 (JST)
Hi,
From: "Christos Zoulas" <christos@netbsd.org>, Date: Thu, 16 Jan 2014 20:30:01 +0000 (UTC)
> The following reply was made to PR lib/47602; it has been noted by GNATS.
>
> From: "Christos Zoulas" <christos@netbsd.org>
> To: gnats-bugs@gnats.NetBSD.org
> Cc:
> Subject: PR/47602 CVS commit: src/lib/libc/citrus/modules
> Date: Thu, 16 Jan 2014 15:28:51 -0500
>
> Module Name: src
> Committed By: christos
> Date: Thu Jan 16 20:28:51 UTC 2014
>
> Modified Files:
> src/lib/libc/citrus/modules: citrus_euc.c
>
> Log Message:
> PR/47602: Christos Zoulas: getwc() modifies input instead of returning EILSEQ.
> Waited for almost a year for feedback and there was none.
>
>
> To generate a diff of this commit:
> cvs rdiff -u -r1.15 -r1.16 src/lib/libc/citrus/modules/citrus_euc.c
>
> Please note that diffs are not public domain; they are subject to the
> copyright notices on the relevant files.
This commit kills Japanese character detection/input completely.
I cannot input Japanese characters with inputmethod/{uim,ibus} anymore.
And gettext misdetects Japanese characters as invalid string,
and I cannot build x11/mlterm etc.
Please revert or fix in another way.
Thank you.
--
Ryo ONODERA // ryo_on@yk.rim.or.jp
PGP fingerprint = 82A2 DC91 76E0 A10A 8ABB FD1B F404 27FA C7D1 15F3
State-Changed-From-To: open->feedback
State-Changed-By: yamt@NetBSD.org
State-Changed-When: Mon, 20 Jan 2014 12:40:54 +0000
State-Changed-Why:
can you explain what the test program is trying to do?
it looks bogus to me.
From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org,
netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, yamt@NetBSD.org
Cc:
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Mon, 20 Jan 2014 09:51:08 -0500
On Jan 20, 12:40pm, yamt@NetBSD.org (yamt@NetBSD.org) wrote:
-- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ
| Synopsis: getwc() modifies input instead of returning EILSEQ.
|
| State-Changed-From-To: open->feedback
| State-Changed-By: yamt@NetBSD.org
| State-Changed-When: Mon, 20 Jan 2014 12:40:54 +0000
| State-Changed-Why:
| can you explain what the test program is trying to do?
| it looks bogus to me.
1. wprintf fails on bad input where swprintf succeeds on the same input.
2. getwc example does not return WEOF on bad input in EUC but other junk.
Basically it tests behavior on bad input. The same tests on FreeBSD work
as expected.
christos
From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: christos@zoulas.com
Cc: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
gnats-admin@netbsd.org, yamt@NetBSD.org
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Tue, 21 Jan 2014 00:29:51 +0000 (UTC)
> On Jan 20, 12:40pm, yamt@NetBSD.org (yamt@NetBSD.org) wrote:
> -- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ
>
> | Synopsis: getwc() modifies input instead of returning EILSEQ.
> |
> | State-Changed-From-To: open->feedback
> | State-Changed-By: yamt@NetBSD.org
> | State-Changed-When: Mon, 20 Jan 2014 12:40:54 +0000
> | State-Changed-Why:
> | can you explain what the test program is trying to do?
> | it looks bogus to me.
>
> 1. wprintf fails on bad input where swprintf succeeds on the same input.
> 2. getwc example does not return WEOF on bad input in EUC but other junk.
>
> Basically it tests behavior on bad input. The same tests on FreeBSD work
> as expected.
my question was about the test program in this PR, not t_io.
YAMAMOTO Takashi
>
> christos
From: christos@zoulas.com (Christos Zoulas)
To: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
Cc: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org,
netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, yamt@NetBSD.org
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Mon, 20 Jan 2014 20:19:19 -0500
On Jan 21, 12:29am, yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) wrote:
-- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ
| my question was about the test program in this PR, not t_io.
The program expects to be able to read the data it wrote to the file.
I.e. if you getwc() from a file, the character you read should match
the sequence of bytes you read. This assumes that there is a 1<->1 mapping
from wchar_t to wint_t. Isn't that true?
christos
From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: christos@zoulas.com
Cc: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
gnats-admin@netbsd.org, yamt@NetBSD.org
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Tue, 21 Jan 2014 02:07:18 +0000 (UTC)
> On Jan 21, 12:29am, yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) wrote:
> -- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ
>
> | my question was about the test program in this PR, not t_io.
>
>
> The program expects to be able to read the data it wrote to the file.
> I.e. if you getwc() from a file, the character you read should match
> the sequence of bytes you read. This assumes that there is a 1<->1 mapping
> from wchar_t to wint_t. Isn't that true?
do you mean to expect the "c == obuf[i]" test always true?
you should read the same byte sequences but wchar_t values corresponding
to the byte sequence is locale dependent.
while obuf[] is wchar_t for big5 locale (right?), what getwc returns
is wchar_t for another locale. comparing them with "==" doesn't make
much sense.
(i'm not sure about the intention of the mixed use of getwc and getc
in the program, though.)
YAMAMOTO Takashi
>
> christos
From: christos@zoulas.com (Christos Zoulas)
To: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
Cc: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org,
netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, yamt@NetBSD.org
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Mon, 20 Jan 2014 23:06:06 -0500
On Jan 21, 2:07am, yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) wrote:
-- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ
| do you mean to expect the "c == obuf[i]" test always true?
If that's not true, many things turn quickly very complicated, because
you cannot read and write a file using getwc() and write it using putwc()
and expect it to work. If I know that wchar_t is 2 bytes and I call 2
getc()'s I can expect to get the next two bytes from the file I think.
| while obuf[] is wchar_t for big5 locale (right?), what getwc returns
| is wchar_t for another locale. comparing them with "==" doesn't make
| much sense.
Right, but I am not sure if this is such a great idea.
| (i'm not sure about the intention of the mixed use of getwc and getc
| in the program, though.)
I did not write this, but I think the logic is, read with getwc() and
if that fails, skip the next 2 bytes with getc() and keep going. How
else do you recover? Bail for the rest of the file, because two bytes
got corrupted?
christos
From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: christos@zoulas.com
Cc: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
gnats-admin@netbsd.org, yamt@NetBSD.org
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Tue, 21 Jan 2014 06:12:40 +0000 (UTC)
> On Jan 21, 2:07am, yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) wrote:
> -- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ
>
> | do you mean to expect the "c == obuf[i]" test always true?
>
> If that's not true, many things turn quickly very complicated, because
> you cannot read and write a file using getwc() and write it using putwc()
> and expect it to work. If I know that wchar_t is 2 bytes and I call 2
> getc()'s I can expect to get the next two bytes from the file I think.
it's expected to work if you do putwc and getwc within the same locale.
>
> | while obuf[] is wchar_t for big5 locale (right?), what getwc returns
> | is wchar_t for another locale. comparing them with "==" doesn't make
> | much sense.
>
> Right, but I am not sure if this is such a great idea.
>
> | (i'm not sure about the intention of the mixed use of getwc and getc
> | in the program, though.)
>
> I did not write this, but I think the logic is, read with getwc() and
> if that fails, skip the next 2 bytes with getc() and keep going. How
> else do you recover? Bail for the rest of the file, because two bytes
> got corrupted?
thanks for explanation.
you can probably recover in that way except that it's in general
difficult to make a correct guess how many bytes to skip.
besides that, probably it's safer to use fgetpos/fsetpos to recover
the position.
C locale is not rich enough for programs like editors.
YAMAMOTO Takashi
>
> christos
State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Fri, 30 Sep 2016 08:17:39 +0000
State-Changed-Why:
feedback happened in january 2014
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.