NetBSD Problem Report #47602

From christos@zoulas.com  Thu Feb 28 21:15:46 2013
Return-Path: <christos@zoulas.com>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id DC28363E704
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 28 Feb 2013 21:15:45 +0000 (UTC)
Message-Id: <20130228211541.847F397129@rebar.astron.com>
Date: Thu, 28 Feb 2013 21:15:41 +0000 (UTC)
From: christos@netbsd.org
Reply-To: christos@netbsd.org
To: gnats-bugs@gnats.NetBSD.org
Subject: getwc() modifies input instead of returning EILSEQ.
X-Send-Pr-Version: 3.95

>Number:         47602
>Category:       lib
>Synopsis:       getwc() modifies input instead of returning EILSEQ.
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    lib-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Feb 28 21:20:00 +0000 2013
>Closed-Date:    
>Last-Modified:  Fri Sep 30 08:17:39 +0000 2016
>Originator:     Christos Zoulas
>Release:        NetBSD 5.99.59
>Organization:
	Entropy Busters LLC.
>Environment:
System: NetBSD rebar.astron.com 5.99.59 NetBSD 5.99.59 (GENERIC) #7: Fri Dec 30 15:19:49 EST 2011 christos@rebar.astron.com:/usr/src/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:

getwc() should not modify what it reads in order to make it compliant with
the charset it is supposed to be using. It should return an error instead.
Thanks to J.R. Oldroyd for submitting both the test program and the patch.

>How-To-Repeat:

#include <stdio.h>
#include <wchar.h>
#include <errno.h>
#include <locale.h>

main() {
	char *fname = ",getwc-test";
	wchar_t obuf[16];
	wint_t c, c1;
	FILE *fp;
	char *locales[] = {
		"C",
		"en_US.UTF-8",
		"ja_JP.eucJP",
		0,
	}, **locale;
	int i;

	obuf[0] = 0xa440;
	obuf[1] = 0xac4f;
	obuf[2] = 0xb36f;
	obuf[3] = 0xcf40;
	obuf[4] = L'\0';

	setlocale(LC_CTYPE, "zh_TW.Big5");
	fp = fopen(fname, "w");
	i = fwprintf(fp, L"%ls\n", obuf);
	fclose(fp);

	for (locale = locales; *locale; locale++) {
		setlocale(LC_CTYPE, *locale);
		fp = fopen(fname, "r");
		for (i = 0; i < wcslen(obuf); i++) {
			c = getwc(fp);
			if (c == WEOF && errno == EILSEQ)
				c = (wint_t) getc(fp);
			if ((c & 0xff00) == 0) {
				c1 = getc(fp);
				c = (c << 8) | c1;
			}
			if (c == obuf[i])
				wprintf(L"%-12.12s %d wrote 0x%x read 0x%x ok\n", *locale, i, obuf[i], c);
			else
				wprintf(L"%-12.12s %d wrote 0x%x read 0x%x getwc error\n", *locale, i, obuf[i], c);
		}
		fclose(fp);
	}

	unlink(fname);
}

also see /usr/src/tests/lib/libc/locale/t_io.c for more errors.

>Fix:

--- lib/libc/citrus/modules/citrus_euc.c.orig   2013-02-28 15:50:49.000000000 -0500
+++ lib/libc/citrus/modules/citrus_euc.c        2013-02-28 15:53:59.000000000 -0500
@@ -271,7 +271,8 @@
      wchar = 0;
      while (len-- > 0)
              wchar = (wchar << 8) | (*s1++ & 0xff);
-     wchar = (wchar & ~ei->mask) | ei->bits[cs];
+     if (wchar != (wchar & ~ei->mask) | ei->bits[cs])
+             goto encoding_error;

      psenc->chlen = 0;
      if (pwc)

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: lib-bug-people->tnozaki
Responsible-Changed-By: christos@NetBSD.org
Responsible-Changed-When: Thu, 28 Feb 2013 17:04:53 -0500
Responsible-Changed-Why:
appropriate


Responsible-Changed-From-To: tnozaki->lib-bug-people
Responsible-Changed-By: wiz@NetBSD.org
Responsible-Changed-When: Mon, 23 Dec 2013 11:41:32 +0000
Responsible-Changed-Why:
resigned, back to role account


From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/47602 CVS commit: src/lib/libc/citrus/modules
Date: Thu, 16 Jan 2014 15:28:51 -0500

 Module Name:	src
 Committed By:	christos
 Date:		Thu Jan 16 20:28:51 UTC 2014

 Modified Files:
 	src/lib/libc/citrus/modules: citrus_euc.c

 Log Message:
 PR/47602: Christos Zoulas: getwc() modifies input instead of returning EILSEQ.
 Waited for almost a year for feedback and there was none.


 To generate a diff of this commit:
 cvs rdiff -u -r1.15 -r1.16 src/lib/libc/citrus/modules/citrus_euc.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Ryo ONODERA <ryo_on@yk.rim.or.jp>
To: gnats-bugs@NetBSD.org, christos@netbsd.org
Cc: 
Subject: Re: PR/47602 CVS commit: src/lib/libc/citrus/modules
Date: Sat, 18 Jan 2014 16:49:09 +0900 (JST)

 Hi,

 From: "Christos Zoulas" <christos@netbsd.org>, Date: Thu, 16 Jan 2014 20:30:01 +0000 (UTC)

 > The following reply was made to PR lib/47602; it has been noted by GNATS.
 > 
 > From: "Christos Zoulas" <christos@netbsd.org>
 > To: gnats-bugs@gnats.NetBSD.org
 > Cc: 
 > Subject: PR/47602 CVS commit: src/lib/libc/citrus/modules
 > Date: Thu, 16 Jan 2014 15:28:51 -0500
 > 
 >  Module Name:	src
 >  Committed By:	christos
 >  Date:		Thu Jan 16 20:28:51 UTC 2014
 >  
 >  Modified Files:
 >  	src/lib/libc/citrus/modules: citrus_euc.c
 >  
 >  Log Message:
 >  PR/47602: Christos Zoulas: getwc() modifies input instead of returning EILSEQ.
 >  Waited for almost a year for feedback and there was none.
 >  
 >  
 >  To generate a diff of this commit:
 >  cvs rdiff -u -r1.15 -r1.16 src/lib/libc/citrus/modules/citrus_euc.c
 >  
 >  Please note that diffs are not public domain; they are subject to the
 >  copyright notices on the relevant files.

 This commit kills Japanese character detection/input completely.
 I cannot input Japanese characters with inputmethod/{uim,ibus} anymore.
 And gettext misdetects Japanese characters as invalid string,
 and I cannot build x11/mlterm etc.

 Please revert or fix in another way.

 Thank you.

 --
 Ryo ONODERA // ryo_on@yk.rim.or.jp
 PGP fingerprint = 82A2 DC91 76E0 A10A 8ABB  FD1B F404 27FA C7D1 15F3

State-Changed-From-To: open->feedback
State-Changed-By: yamt@NetBSD.org
State-Changed-When: Mon, 20 Jan 2014 12:40:54 +0000
State-Changed-Why:
can you explain what the test program is trying to do?
it looks bogus to me.


From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org, 
	netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, yamt@NetBSD.org
Cc: 
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Mon, 20 Jan 2014 09:51:08 -0500

 On Jan 20, 12:40pm, yamt@NetBSD.org (yamt@NetBSD.org) wrote:
 -- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ

 | Synopsis: getwc() modifies input instead of returning EILSEQ.
 | 
 | State-Changed-From-To: open->feedback
 | State-Changed-By: yamt@NetBSD.org
 | State-Changed-When: Mon, 20 Jan 2014 12:40:54 +0000
 | State-Changed-Why:
 | can you explain what the test program is trying to do?
 | it looks bogus to me.

 1. wprintf fails on bad input where swprintf succeeds on the same input.
 2. getwc example does not return WEOF on bad input in EUC but other junk.

 Basically it tests behavior on bad input. The same tests on FreeBSD work
 as expected.

 christos

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: christos@zoulas.com
Cc: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
	gnats-admin@netbsd.org, yamt@NetBSD.org
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Tue, 21 Jan 2014 00:29:51 +0000 (UTC)

 > On Jan 20, 12:40pm, yamt@NetBSD.org (yamt@NetBSD.org) wrote:
 > -- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ
 > 
 > | Synopsis: getwc() modifies input instead of returning EILSEQ.
 > | 
 > | State-Changed-From-To: open->feedback
 > | State-Changed-By: yamt@NetBSD.org
 > | State-Changed-When: Mon, 20 Jan 2014 12:40:54 +0000
 > | State-Changed-Why:
 > | can you explain what the test program is trying to do?
 > | it looks bogus to me.
 > 
 > 1. wprintf fails on bad input where swprintf succeeds on the same input.
 > 2. getwc example does not return WEOF on bad input in EUC but other junk.
 > 
 > Basically it tests behavior on bad input. The same tests on FreeBSD work
 > as expected.

 my question was about the test program in this PR, not t_io.

 YAMAMOTO Takashi

 > 
 > christos

From: christos@zoulas.com (Christos Zoulas)
To: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
Cc: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org, 
	netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, yamt@NetBSD.org
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Mon, 20 Jan 2014 20:19:19 -0500

 On Jan 21, 12:29am, yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) wrote:
 -- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ

 | my question was about the test program in this PR, not t_io.


 The program expects to be able to read the data it wrote to the file.
 I.e. if you getwc() from a file, the character you read should match
 the sequence of bytes you read. This assumes that there is a 1<->1 mapping
 from wchar_t to wint_t. Isn't that true?

 christos

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: christos@zoulas.com
Cc: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
	gnats-admin@netbsd.org, yamt@NetBSD.org
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Tue, 21 Jan 2014 02:07:18 +0000 (UTC)

 > On Jan 21, 12:29am, yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) wrote:
 > -- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ
 > 
 > | my question was about the test program in this PR, not t_io.
 > 
 > 
 > The program expects to be able to read the data it wrote to the file.
 > I.e. if you getwc() from a file, the character you read should match
 > the sequence of bytes you read. This assumes that there is a 1<->1 mapping
 > from wchar_t to wint_t. Isn't that true?

 do you mean to expect the "c == obuf[i]" test always true?

 you should read the same byte sequences but wchar_t values corresponding
 to the byte sequence is locale dependent.

 while obuf[] is wchar_t for big5 locale (right?), what getwc returns
 is wchar_t for another locale.  comparing them with "==" doesn't make
 much sense.

 (i'm not sure about the intention of the mixed use of getwc and getc
 in the program, though.)

 YAMAMOTO Takashi

 > 
 > christos

From: christos@zoulas.com (Christos Zoulas)
To: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
Cc: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org, 
	netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, yamt@NetBSD.org
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Mon, 20 Jan 2014 23:06:06 -0500

 On Jan 21,  2:07am, yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) wrote:
 -- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ

 | do you mean to expect the "c == obuf[i]" test always true?

 If that's not true, many things turn quickly very complicated, because
 you cannot read and write a file using getwc() and write it using putwc()
 and expect it to work. If I know that wchar_t is 2 bytes and I call 2
 getc()'s I can expect to get the next two bytes from the file I think.

 | while obuf[] is wchar_t for big5 locale (right?), what getwc returns
 | is wchar_t for another locale.  comparing them with "==" doesn't make
 | much sense.

 Right, but I am not sure if this is such a great idea.

 | (i'm not sure about the intention of the mixed use of getwc and getc
 | in the program, though.)

 I did not write this, but I think the logic is, read with getwc() and
 if that fails, skip the next 2 bytes with getc() and keep going. How
 else do you recover? Bail for the rest of the file, because two bytes
 got corrupted?

 christos

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: christos@zoulas.com
Cc: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
	gnats-admin@netbsd.org, yamt@NetBSD.org
Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ.)
Date: Tue, 21 Jan 2014 06:12:40 +0000 (UTC)

 > On Jan 21,  2:07am, yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) wrote:
 > -- Subject: Re: lib/47602 (getwc() modifies input instead of returning EILSEQ
 > 
 > | do you mean to expect the "c == obuf[i]" test always true?
 > 
 > If that's not true, many things turn quickly very complicated, because
 > you cannot read and write a file using getwc() and write it using putwc()
 > and expect it to work. If I know that wchar_t is 2 bytes and I call 2
 > getc()'s I can expect to get the next two bytes from the file I think.

 it's expected to work if you do putwc and getwc within the same locale.

 > 
 > | while obuf[] is wchar_t for big5 locale (right?), what getwc returns
 > | is wchar_t for another locale.  comparing them with "==" doesn't make
 > | much sense.
 > 
 > Right, but I am not sure if this is such a great idea.
 > 
 > | (i'm not sure about the intention of the mixed use of getwc and getc
 > | in the program, though.)
 > 
 > I did not write this, but I think the logic is, read with getwc() and
 > if that fails, skip the next 2 bytes with getc() and keep going. How
 > else do you recover? Bail for the rest of the file, because two bytes
 > got corrupted?

 thanks for explanation.

 you can probably recover in that way except that it's in general
 difficult to make a correct guess how many bytes to skip.
 besides that, probably it's safer to use fgetpos/fsetpos to recover
 the position.

 C locale is not rich enough for programs like editors.

 YAMAMOTO Takashi

 > 
 > christos

State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Fri, 30 Sep 2016 08:17:39 +0000
State-Changed-Why:
feedback happened in january 2014


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.