NetBSD Problem Report #47983

From mm_lists@pulsar-zone.net  Mon Jul  1 22:02:07 2013
Return-Path: <mm_lists@pulsar-zone.net>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 7E046716C6
	for <gnats-bugs@gnats.NetBSD.org>; Mon,  1 Jul 2013 22:02:07 +0000 (UTC)
Message-Id: <201307012057.r61KvUwi021393@ginseng.pulsar-zone.net>
Date: Mon, 1 Jul 2013 16:57:30 -0400
From: Matthew Mondor <mm_lists@pulsar-zone.net>
To: gnats-bugs@gnats.NetBSD.org
Subject: libedit segfault at character decoding error (sh autocomplete)

>Number:         47983
>Category:       lib
>Synopsis:       libedit segfault at character decoding error (sh autocomplete)
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    lib-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Jul 01 22:05:00 +0000 2013
>Last-Modified:  Wed Jul 03 05:40:00 +0000 2013
>Originator:     Matthew Mondor
>Release:        NetBSD-6 (branch)
>Organization:
>Environment:
System: NetBSD ninja.xisop 6.1_RC3 NetBSD 6.1_RC3 (GENERIC_MM) #2: Mon Apr 22 10:06:12 EDT 2013 root@ninja.xisop:/usr/obj/sys/arch/amd64/compile/GENERIC_MM amd64
Architecture: x86_64
Machine: amd64
>Description:
I am using a en_US.UTF-8 locale for non-superuser, and files using a
French ISO-8859-1[5] encoding cause /bin/sh to segfault if attempting
to auto-complete them.

Using gdb I could see that el_insertstr() is called inconditionally
with the result of ct_decode_string(), which may return NULL, causing
el_insertstr() to segfault.

>How-To-Repeat:

To reproduce:

$ LANG="en_US.UTF-8" /bin/sh
$ cd /tmp/
$ touch $(printf "z\xE9")
$ ls -l z[TAB]

>Fix:

The following diff fixes the problem for me, with the following result:

$ ls z\U+00E9 
z?
$ 

Index: lib/libedit/chared.c
===================================================================
RCS file: /data/rsync/netbsd-cvs/src/lib/libedit/chared.c,v
retrieving revision 1.36
diff -u -r1.36 chared.c
--- lib/libedit/chared.c	23 Oct 2011 17:37:55 -0000	1.36
+++ lib/libedit/chared.c	1 Jul 2013 20:18:23 -0000
@@ -612,6 +612,10 @@
 {
 	size_t len;

+	/* String may be NULL, as in the case of a character decoding error
+	 */
+	if (s == NULL)
+		return -1;
 	if ((len = Strlen(s)) == 0)
 		return -1;
 	if (el->el_line.lastchar + len >= el->el_line.limit) {

>Audit-Trail:
From: Matthew Mondor <mm_lists@pulsar-zone.net>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/47983: libedit segfault at character decoding error (sh
 autocomplete)
Date: Wed, 3 Jul 2013 01:39:12 -0400

 On Mon,  1 Jul 2013 22:05:00 +0000 (UTC)
 Matthew Mondor <mm_lists@pulsar-zone.net> wrote:

 > >Fix:
 >=20
 > The following diff fixes the problem for me, with the following result:
 >=20
 > $ ls z\U+00E9=20
 > z?
 > $=20
 >=20
 > Index: lib/libedit/chared.c
 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 > RCS file: /data/rsync/netbsd-cvs/src/lib/libedit/chared.c,v
 > retrieving revision 1.36
 > diff -u -r1.36 chared.c
 > --- lib/libedit/chared.c	23 Oct 2011 17:37:55 -0000	1.36
 > +++ lib/libedit/chared.c	1 Jul 2013 20:18:23 -0000
 > @@ -612,6 +612,10 @@
 >  {
 >  	size_t len;
 > =20
 > +	/* String may be NULL, as in the case of a character decoding error
 > +	 */
 > +	if (s =3D=3D NULL)
 > +		return -1;
 >  	if ((len =3D Strlen(s)) =3D=3D 0)
 >  		return -1;
 >  	if (el->el_line.lastchar + len >=3D el->el_line.limit) {

 Actually, this doesn't work as well as intended.  Interestingly, if
 testing using /rescue/sh, the above works fine, but if using /bin/sh,
 the above results in no characters being supplied (although there is no
 more crash, at least).

 I have the impression that the proper way to solve this would be to
 support UTF-8B, a variant of UTF-8 where invalid sequences of octets
 are imported using the UTF-16 surrogate range (D800=E2=80=93DBFF, DC00=E2=
 =80=93DFFF),
 such as DC80-DCFF.  The decoder would also output that special range to
 the original octets.  This would allow non-destructive
 decoding+encoding cycles and prevent fatal decoding errors, providing a
 more transparent and reliable interface.

 As a quicker solution, I'm tempted to convert invalid UTF-8 sequence
 octets to wchar_t implicitely by assuming they are LATIN-*.

 Also remains to decide if that should be done in libedit or in the C
 library...  Being able to error on invalid sequences at decoding time
 can be considered a feature, but it should ideally not be the only
 option.  Unfortunately, I think that the current wchar related
 interface does not allow passing a flag for such an option?  Perhaps
 the flag can be part of the locale definition though...

 What could be done too would be providing another function to decode
 strings from char to wchar_t using UTF-8B or implicit LATIN-* coertion,
 and have libedit call it if ct_decode_string() returns NULL.  Afterall,
 it's input routines which commonly have to deal with potentially
 invalid octet sequences, and that's what libedit does...
 --=20
 Matt

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.