NetBSD Problem Report #40387

From www@NetBSD.org  Tue Jan 13 14:08:27 2009
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 15A7463BF57
	for <gnats-bugs@gnats.netbsd.org>; Tue, 13 Jan 2009 14:08:27 +0000 (UTC)
Message-Id: <20090113140817.BC7F963BF61@narn.NetBSD.org>
Date: Tue, 13 Jan 2009 14:08:17 +0000 (UTC)
From: netbsd-bugreport@mightyreason.com
Reply-To: netbsd-bugreport@mightyreason.com
To: gnats-bugs@NetBSD.org
Subject: Serious bug in regular expression library (regex) affected sed
X-Send-Pr-Version: www-1.0

>Number:         40387
>Category:       misc
>Synopsis:       Serious bug in regular expression library (regex) affected sed
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    misc-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jan 13 14:10:02 +0000 2009
>Originator:     Chris Kuklewicz
>Release:        4.0 and 4.99.59
>Organization:
Not Applicable
>Environment:
I have no output to paste here
>Description:
I have tracked the source of a bug originally seen on OS X back to the same bug in FreeBSD and NetBSD.

The bug is in the libc implementation of regular expressions, used via the API in "regex.h" specifically regexec.

Programs which use this library are affected, including "sed" which I will use to illustrate the bug below.

The bug effects both the whole text matched and the choice of text reported as matching parenthesized subexpressions.  The bug violates the leftmost longest matching rule for POSIX regular expressions.  The bug violates the principle that the order of subexpressions around the "|" operator should not affect the whole text matched.

A terminal sessions using three sed commands with extended regular expressions:
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
a[b]
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[ab]
[chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'
[XababaYababaZ] => [X][ab][aba][Y][b][Z]

The first two commands differ only in the order of "()" and "." around the "|" operator.  According to the POSIX specification this must not affect the whole match, which is report by "&" in sed's replacement specification.  The leftmost longest match is "ab" and is correctly reported by the second command.  The first command incorrect does not match starting with the "a" and matches just the "b".

This establishes that the bug violates the POSIX specification and can affect a program that only cares about the whole text matched (ignoring captured parenthesized subexpressions).

The third command shows an internal inconsistency in the output.  The (aba|b|ab)* operator should capture the text matched by the last repetition of the (aba|b|ab) subexpression and report that in \6. In particular \6 should match "aba" just like \3 does.

As shown above the FreeBSD output is [XababaYababaZ] => [X][ab][aba][Y][b][Z]
The correct output should have been  [XababaYababaZ] => [X][ab][aba][Y][abc][Z]

Could there be a way to explain the FreeBSD output? Consider the following questions:

If the last repetition of (aba|b|ab)* matched \6 to "b" then how is the text between that "b" skipped so that "(Z)" matched \7 against the final "Z" in the text?
If the "b" reported by \6 is the last "b" in the text then there is no part of the pattern that can match the next "a".
If the "b" reported by \6 is the next-to-last "b" in the text then there is no part of the pattern that can match the preceding "a".
Checking the indices returned by regex shows that \6 was matched against the last "b" in the text.
Thus reporting \6 as matching "b" is contradictory, there is no consistent interpretation of what regexec returned.

This establishes that the bug is can violate not only the POSIX specification but also violate any rational and consistent alternate specification.

I found the first example by using Haskell's QuickCheck to randomly generate test cases.  The second example was found by using Glenn Fowler's "AT&T Labs Research regex(3) regression tests" from http://www.research.att.com/~gsf/testregex/
I "tested" this bug on NetBSB by asking on the #netbsd channel of irc.freenode.net and ASau and sjamaan answered.
I tested FreeBSD 6.3-PRERELEASE thanks to the free shell account at m-net.arbornet.org and FreeBSD 8.0-CURRENT was tested by "methi" on the #freebsd channel at irc.freenode.net

>How-To-Repeat:
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)

If there is no bug then the first two both output "[ab]" and the last output "[XababaYababaZ] => [X][ab][aba][Y][abc][Z]".

Currently the bug causes the first the output "a[b]" and the last to output "[XababaYababaZ] => [X][ab][aba][Y][b][Z]".

>Fix:
I do not have the time to be able to fix the regex.h part of libc.

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.