NetBSD Problem Report #38108

From cheusov@tut.by  Tue Feb 26 20:28:28 2008
Return-Path: <cheusov@tut.by>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 0276963B853
	for <gnats-bugs@gnats.netbsd.org>; Tue, 26 Feb 2008 20:28:27 +0000 (UTC)
Message-Id: <s93ve4bjwda.fsf@chen.chizhovka.net>
Date: Tue, 26 Feb 2008 22:28:17 +0200
From: cheusov@tut.by
Reply-To:
To: gnats-bugs@gnats.NetBSD.org
Subject: single regexp implementation for NetBSD base system
X-Send-Pr-Version: 3.95

>Number:         38108
>Category:       bin
>Synopsis:       single regexp implementation for NetBSD base system
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    bin-bug-people
>State:          suspended
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Tue Feb 26 20:30:00 +0000 2008
>Closed-Date:    
>Last-Modified:  Sat Sep 19 21:05:57 +0000 2009
>Originator:     cheusov@tut.by
>Release:        NetBSD 4.0_STABLE
>Organization:
>Environment:
System: NetBSD chen.chizhovka.net 4.0_STABLE NetBSD 4.0_STABLE (GENERIC) #2: Tue Dec 25 17:42:38 EET 2007 cheusov@chen.chizhovka.net:/srv/obj/sys/arch/i386/compile/GENERIC i386
Architecture: i386
Machine: i386
>Description:
It whould be nice to AWK/SED and GREP using the same regexp engine
from NetBSD libc. Or at least AWK able to be built with external
regexp engine that supports UTF-8. The same for usr.bin/grep and sed.

SUN did this for their AWK years ago, see wip/heirloom-awk.
I did this for MAWK (wip/mawk-uxre) too.

I think the following function can convert AWK regexp to ERE.

void prepare_regexp (char *regexp)
{
   int bs = 0;
   char *tail = regexp;
   char ch;

   while (ch = *regexp++, ch != 0){
      if (bs){
	 switch (ch){
	 case 'n': *tail++ = '\n';   break;
	 case 't': *tail++ = '\t';   break;
	 case 'f': *tail++ = '\f';   break;
	 case 'b': *tail++ = '\b';   break;
	 case 'r': *tail++ = '\r';   break;
	 case 'a': *tail++ = '\07';  break;
	 case 'v': *tail++ = '\013'; break;
	 default:  *tail++ = '\\'; *tail++ = ch;
	 }

	 bs = 0;
      }else{
	 if (ch == '\\'){
	    bs = 1;
	 }else{
	    *tail++ = ch;
	 }
      }
   }

   *tail = 0;
}

usr.bin/grep can also be adapted to use BRE/ERE without hacks for -f.

>How-To-Repeat:

>Fix:

Unknown
>Release-Note:

>Audit-Trail:
From: "Greg A. Woods; Planix, Inc." <woods@planix.ca>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/38108: single regexp implementation for NetBSD base system
Date: Tue, 26 Feb 2008 15:55:57 -0500

 On 26-Feb-08, at 3:30 PM, cheusov@tut.by wrote:

 > It whould be nice to AWK/SED and GREP using the same regexp engine
 > from NetBSD libc. Or at least AWK able to be built with external
 > regexp engine that supports UTF-8. The same for usr.bin/grep and sed.

 Ideally I'd like to be able to replace lib/libc/regex with Henry  
 Spencer's latest regex implementation as it is found, for example, in  
 the TCL sources; and then of course have this new implementation be  
 used universally in all the common RE-capable tools on the system.

 > SUN did this for their AWK years ago, see wip/heirloom-awk.

 NetBSD uses the one true version of AWK from its author and current  
 maintainer.  See the doc/3RDPARTY entry for "nawk".

 Beware though that AWK as a language definition includes much, if not  
 all, of the RE syntax and semantics too and so arbitrarily switching  
 to a different RE implementation in the AWK interpreter is not  
 necessarily a good thing.  It would, for example, lead to the  
 possibility of many common portable scripts, including those used on  
 NetBSD through pkgsrc, to fail in strange and mysterious ways.   
 Sometimes it really is good to have a given tool provide its own  
 standardized version of a feature.

 -- 
 					Greg A. Woods; Planix, Inc.
 					<woods@planix.ca>



From: Aleksey Cheusov <cheusov@tut.by>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/38108: single regexp implementation for NetBSD base system
Date: Wed, 27 Feb 2008 00:33:30 +0200

  >> It whould be nice to AWK/SED and GREP using the same regexp engine
  >> from NetBSD libc. Or at least AWK able to be built with external
  >> regexp engine that supports UTF-8. The same for usr.bin/grep and sed.

 >  Ideally I'd like to be able to replace lib/libc/regex with Henry  
 >  Spencer's latest regex implementation as it is found, for example, in  
 >  the TCL sources; and then of course have this new implementation be  
 >  used universally in all the common RE-capable tools on the system.
 I don't know what exactly is included in TCL.
 But I've packaged a patched version of Henry Specer's regexp engine
 from here http://arglist.com/regex/.

 See devel/librxspencer package.

  >> SUN did this for their AWK years ago, see wip/heirloom-awk.
 >  
 >  NetBSD uses the one true version of AWK from its author and current  
 >  maintainer.  See the doc/3RDPARTY entry for "nawk".
 I know this. Note that wip/heirloom-awk open sourced
 by Caldera and SUN years ago was based on the same original
 nawk sources. Many years ago they separate regexp engine
 into a library. awk, grep and sed use it.
 See wip/heirloom-grep, wip/heirloom-sed and wip/libuxre and packages.
 libuxre is POSIX compatible aware of utf-8.

 >  Beware though that AWK as a language definition includes much, if not  
 >  all, of the RE syntax and semantics too and so arbitrarily switching  
 >  to a different RE implementation in the AWK interpreter is not  
 >  necessarily a good thing.
 According to SUS AWK's regexp should conform to ERE also defined in SUS.
 There are some exceptions and the wrapper function I provided did
 everything needed. Additional checks are welcome, of course.

 >  It would, for example, lead to the possibility of many common
 >  portable scripts, including those used on NetBSD through pkgsrc, to
 >  fail in strange and mysterious ways.
 I don't think so provided that these scripts are really "portable".
 Good example is cyrly braces which, according to SUS, are special
 characters ( {n,m} notation ) and MUST NOT be used as ordinary
 characters in really portable scripts.
 Anyway _unconditional_ inclusion of nawk to pkgsrc bootstrap
 was a mistake IMHO ;) I guess solaris's native /usr/xpg4/bin/awk is
 good enough. Also see /usr/pkg/heirloom/bin/posix2001/nawk
 from wip/heirloom-awk package which supports _stabdard_ {n,m}
 while NetBSD's nawk doesn't.

 Also note, that wrapper function I provided keeps \x as plain x if x
 is not n,r,t etc., i.e. it keeps awk's regexp (backslashed characters)
 backward compatible with nawk and many other awk implementations.

 >  Sometimes it really is good to have a given tool provide its own
 >  standardized version of a feature.
 oawk is already dead. Today is 2008. POSIX exists for so many years...

 -- 
 Best regards, Aleksey Cheusov.

State-Changed-From-To: open->suspended
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 19 Sep 2009 21:05:57 +0000
State-Changed-Why:
Unifying all the regexp code is probably not desirable, but in any event
needs to wait on developments regarding unicode regexps.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.