NetBSD Problem Report #46155

From www@NetBSD.org  Thu Mar  8 22:05:20 2012
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id 9041863E290
	for <gnats-bugs@gnats.NetBSD.org>; Thu,  8 Mar 2012 22:05:20 +0000 (UTC)
Message-Id: <20120308220517.5CE6D63B8C0@www.NetBSD.org>
Date: Thu,  8 Mar 2012 22:05:17 +0000 (UTC)
From: mipnix@yahoo.com
Reply-To: mipnix@yahoo.com
To: gnats-bugs@NetBSD.org
Subject: awk: NetBSD-current record splitting is broken
X-Send-Pr-Version: www-1.0

>Number:         46155
>Category:       bin
>Synopsis:       awk: NetBSD-current record splitting is broken
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    bin-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Mar 08 22:10:01 +0000 2012
>Closed-Date:    Sun Dec 30 08:07:24 +0000 2012
>Last-Modified:  Sun Dec 30 08:07:24 +0000 2012
>Originator:     Miguel Piņeiro Jr
>Release:        NetBSD 5.1 but affected code is from -current
>Organization:
>Environment:
NetBSD 5.1 NetBSD 5.1 (GENERIC) #0: Sun Nov 7 14:39:56 UTC 2010  builds@b6.netbsd.org:/home/builds/ab/netbsd-5-1-RELEASE/i386/201011061943Z-obj/home/builds/ab/netbsd-5-1-RELEASE/src/sys/arch/i386/compile/GENERIC i386

To be a bit more specific about the environment, NetBSD 5.1 is a guest in VMWare Fusion 2.0.2 on an Intel Mac running OS X 10.4.11.
>Description:
Hello, NetBSD community:

NetBSD-current's AWK is fundamentally broken and has been since the attempt to add support for RS regular expressions. As it stands, it cannot be trusted to generate correct records for all valid values of RS. If you're lucky, it'll error out. If you're not, it will silently do the wrong thing.

Note that not only is the regular expression case itself incorrectly implemented, but the once-functioning multiline-record handling (empty RS) is now also broken. Only the case of a single character RS -- no more, no less -- works correctly.

The readrec() changes in the following diff to lib.c are to blame:

http://cvsweb.netbsd.org/bsdweb.cgi/src/external/historical/nawk/dist/lib.c.diff?r1=1.1&r2=1.2&only_with_tag=MAIN

When I stumbled across this issue, while curiously surveying the code of several AWK implementations, I hadn't used NetBSD in over 10 years. Safe to say that I'm a newbie again. Following what seemed to me the most expedient path, I installed the latest release iso and checked out -current's nawk. This is almost certainly not a supported configuration, but I doubt it has affected my analysis.

My testing was conducted using the code in cvs HEAD of src/external/historical/nawk, compiled from that directory with `make MKINFO=no`.

Line numbers herein refer to http://cvsweb.netbsd.org/bsdweb.cgi/src/external/historical/nawk/dist/lib.c?annotate=1.4&only_with_tag=MAIN




CASE 1:  RS IS EMPTY (multiline records)
readrec(): lines 205-210


When RS is empty, readrec() will do nothing but skip leading newlines.  Once those newlines have been read (if indeed they were present), subsequent calls to readrec() result in a practical no-op, getc->ungetc.  nawk will be stuck in place, never reaching the end of the stream, never terminating, unless it implodes.

Also, when RS is empty, the automatic variable rr is never initialized.  It is declared on line 192 and then used in an adjbuf() call on line 268.

Within adjbuf(), the result of 1+rr-buf is eventually passed to realloc.  In Test 1.2 below, this led to realloc() failure.  Should realloc() not lead to abnormal termination of the getc-ungetc stalemate (as in Test 1.4), dereferencing and assigning to rr may.

$ # Test 1.1
$ # Correct behavior with default RS
$ echo foo | ./nbawk-current 1
foo
$ # Test 1.2
$ # Implosion with null RS
$ echo foo | ./nbawk-current 1 RS=

./nbawk-current: out of memory in readrec 3
 input record number 1, file 
 source line number 1
$ # Test 1.3
$ # Correct behavior with default RS
$ echo foo | ./nbawk-current 'END {print "END"}'
END
$ # Test  1.4
$ # Should give a result identical to Test 1.3, but
$ # with null RS, perpetually increased cpu utilization warms the room.
$ echo foo | ./nbawk-current 'END {print "END"}' RS=
^C
$ # End of transcript




CASE 2:  RS IS A SINGLE CHARACTER
readrec(), lines 246-267


The common, single-character RS case performs as expected, but harbors unreachable code.

It's impossible to reach this statement block if **RS and sep are unequal, since that inequality can only arise within a mutually exclusive block (line 206). Therefore, the test at line 256 always succeeds; the break at line 257 is unconditional; and lines 258-265 are unreachable.

This unreachable code was a critical component of multiline record handling. sep and **RS are always equal except when RS is empty, in which case sep is set to newline. This equality was used to distinguish single and multiline modes when their handling was intertwined.




CASE 3:  RS IS MULTICHARACTER (regular expression)
readrec(), lines 211-245.


There are a lot of mistakes in this section.

If adjbuf (line 220) relocates buf, it adjusts rr, but rrr and brr are left pointing to the buffer's previous location. The next call to nematch() would be looking in the wrong place for a string to match.  And, any attempt to break the stream-reading while-loop before the EOF would clobber rr, the pointer to the current character location in the buffer, with the invalid location in rrr (228-232). This would make it impossible to properly null-terminate the record (line 270), and would result in more memory management errors via adjbuf() (line 269).

Further, all of the code within the block following the while-loop is unreachable (lines 240-244).  There are only two conditions which exit the while loop and they are OR'd with each other on line 238.  That test cannot fail, hence the break (line 239) is unconditional.  Fortunately, this code doesn't belong here anyway; it is only required when handling multiline records (consecutive newlines delimiting a record).

These bugs notwithstanding, the entire approach's logic is deeply flawed.

For example, when nematch() finds a match and sets patbeg, line 235, brr = rrr = rr = patbeg, will lead to the next iteration's *rr++ = c, line 224, clobbering the first character of the matching pattern. This makes no sense whatsoever. 

Imagine what's happening when trying to match the regular expression /ab?/ against the text "1ab2". The "1" is read from the stream and null terminated. nematch then fails to match /ab?/ against "1\0".  The "a" is then read in and nematch succeeds in matching /ab?/ against "1a\0". patbeg, rr, and brr all now point to "a". The next iteration reads "b" into rr, clobbering "a", resulting in buf containing "1b\0".  nematch() is then passed brr which now points to "b\0".  Notice that nematch() never sees the original data's sequence of "ab", hence it is incapable of successfully matching "ab" against /ab?/. nematch() then correctly fails to match "1b" against /ab?/.  This failure following an earlier success triggers the ungetc'ing of "b" (which will show up as part of the second record instead of part of a record separator) and breaks the reading while-loop.

The critical problem with that algorithm is that nematch() never sees the original data's sequence of "ab", hence it never even has the opportunity to do the right thing when working with /ab?/; the data contains "1ab2" yet nematch() only sees "1", "1a", and "1b".  That is obviously incorrect.

$ # Test 3.1
$ # Correct behavior with typical, single-char RS
$ printf 1a2aa3aaa4 | ./nbawk-current 1 RS=a
1
2

3


4
$ # Test 3.2
$ # Erroneous behavior with a regular expression
$ # The following's output should be identical to RS=a
$ # There should be empty records.  Why is it matching multiple characters?
$ # Because each successive "a" is overwriting the previous "a" when patbeg
$ # is stored in rr.
$ printf 1a2aa3aaa4 | ./nbawk-current 1 RS='[a]'
1
2
3
4
$ # Test 3.3
$ # Again, erroneous behavior with a regular expression
$ # The following's output should be identical to Test 3.1.
$ # There should be empty records.  Why is it matching multiple characters?
$ # For the same reason as Test 3.2.
$ printf 1a2ab3aba4 | ./nbawk-current 1 RS='[ab]'
1
2
3
4
$ # Test 3.4
$ # Correct result because there is only one successful nematch.
$ # The number which overwrites the "a" is ungetc'd.
$ # Lucky break.
$ printf 1ab2ab3ab4 | ./nbawk-current 1 RS='ab'
1
2
3
4
$ # Test 3.5
$ # Correct behavior when attempting a greedy match
$ # A slightly different type of lucky break.
$ printf 1a2aa3aaa4 | ./nbawk-current 1 RS='aa*'
1
2
3
4
$ # Test 3.6
$ # Erroneous behavior when attempting a greedy match
$ # The following's output should be identical to the preceding example.
$ # Note that the second "a" in "aa" clobbers the first in the buffer, but
$ # since it matches the regular expression, it is not returned to the
$ # stream like each instance of "b", which winds up as part of the
$ # record when it should have been part of the separator.
$ printf 1a2ab3aab4 | ./nbawk-current 1 RS='aa*b*'
1
2
b3
b4
$ # End of transcript

There are more shortcomings than the simple cases shown above.  For example, any correct approach must be capable of backing up multiple characters, not just one.  For example, "abcdefh" ~ /a(bcdefz)*/ would require consuming through "h" to learn that the correct match is just "a".

In my opinion, regular expression support for RS cannot be correctly implemented using any of the matching functions in b.c.

match() finds only the shortest match, which is unacceptable for the case of determining the record separator.

pmatch() attempts a longest match but will accept an empty match in lieu of failure, which is also unacceptable.

nematch() will not accept empty matches and will return the earliest, longest possible match. It seems suitable upon casual inspection, but it's not.

To properly split data into records, we need to know when the automaton is incapable of finding a longer match. The possibility of a longer match exists so long as the stream has not reached the EOF and the current state contains following leaves. The automaton must be fed until one of these conditions is no longer true.

The way that the buffer-oriented nematch() works, we cannot tell when hope should be abandoned.  nematch() will return a true result if it finds a match before reaching the end of the buffer, an outcome characterized by the absence of further state transitions. This is what we're looking for.  However, nematch() will also return true if the buffer's end forces it to settle for a shorter match even though the end of the stream has not been reached and there are possible state transitions available.  Since the caller cannot distinguish the two -- the latter requires further data from the stream, the former does not -- nematch() in its present incarnation is inadequate.

If there was some way to determine the state of the automaton from outside nematch() -- perhaps by extending nematch()'s interface or by modifying the finite automaton's definition to retain state -- then nematch() could be used, although clumsily, to achieve a correct result.  However, I believe that an interface that directly interacts with the stream would be simpler.  The diffs that I've included below are my attempt at such an approach.


IMPLEMENTATION INCOMPATIBILITIES

There's at least two situations in which gawk and mawk disagree on how to handle a regular expression RS.  I've included two diffs, one for each behavior.

gawk: gawk-3.1.8nb1       GNU awk
mawk: mawk-1.3.4.20100625 AWK clone by Mike Brennan
nbawk-current: NetBSD-current's broken AWK implementation.
nbawk-fixed-mawk: NetBSD-current's AWK implementation with my mawk-behavior diff applied.
nbawk-fixed-gawk: NetBSD-current's AWK implementation with my gawk-behavior diff applied.

Within the scope of the tested scenarios, my diffs do not introduce any new area of incompatiblity. Disagreement with gawk or mawk only occurs when it is inevitable because gawk and mawk already disagree with each other. And in those cases, there is agreement with one of the implementations, a third alternative result is not introduced.

In the test results below, "SUCCESS" indicates that all AWK implementations yield identical output, "FAILURE" that they do not.  Failures are followed by each implementation's stdout and stderr.

First, let's take a look at gawk versus NetBSD's current awk implementation. It's not pretty. Incorrect results abound and there are a few instances of memory allocation errors.

SUCCESS: 000: printf 1a2aa3aaa4 | "$awk" 1 RS=a		# Single character RS
FAILURE: 001: printf 1a2aa3aaa4 | "$awk" 1 RS='[a]'		# Regular expression RS
gawk:
--> 1
--> 2
--> 
--> 3
--> 
--> 
--> 4
./nbawk-current:
--> 1
--> 2
--> 3
--> 4
FAILURE: 002: printf 1a2ab3aba4 | "$awk" 1 RS='[ab]'
gawk:
--> 1
--> 2
--> 
--> 3
--> 
--> 
--> 4
./nbawk-current:
--> 1
--> 2
--> 3
--> 4
SUCCESS: 003: printf 1ab2ab3ab4 | "$awk" 1 RS=ab
FAILURE: 004: printf 1a2aa3aaa4 | "$awk" 1 RS='a*'
gawk:
--> 
--> 
--> 
--> 
--> 
--> 
./nbawk-current:
--> 1
--> 2
--> 3
--> 4
SUCCESS: 005: printf 1a2aa3aaa4 | "$awk" 1 RS='aa*'
FAILURE: 006: printf 1a2ab3aab4 | "$awk" 1 RS='aa*b*'
gawk:
--> 1
--> 2
--> 3
--> 4
./nbawk-current:
--> 1
--> 2
--> b3
--> b4
SUCCESS: 007: printf 1abc | "$awk" 1 RS='abcde|b'
SUCCESS: 008: printf 1abcdf2 | "$awk" 1 RS='abcde|b'
FAILURE: 009: printf 1abcdebf2 | "$awk" 1 RS='abcde|b'
gawk:
--> 1
--> 
--> f2
./nbawk-current:
--> 1a
--> cde
--> f2
SUCCESS: 010: printf 1abcdf2 | "$awk" 1 RS='abcde|a'
SUCCESS: 011: printf a1a2a3a | "$awk" 1 RS='^a'		# Anchors
FAILURE: 012: printf aaa1a2a | "$awk" 1 RS='^a'
gawk:
--> 
--> aa1a2a
./nbawk-current:
--> 
--> 
--> 
--> 1a2a
FAILURE: 013: printf a1a2a3a | "$awk" 1 RS='a$'
gawk:
--> a1a2a3
./nbawk-current:
--> 
--> 1
--> 2
--> 3
FAILURE: 014: printf a1a2aaa | "$awk" 1 RS='a$'
gawk:
--> a1a2aa
./nbawk-current:
--> 
--> 1
--> 2
FAILURE: 015: jot -s a 10000 | "$awk" 'NR>1' RS='999[0-9]' 	# Trigger adjbuf relocation
gawk:
--> a
--> a
--> a
--> a
--> a
--> a
--> a
--> a
--> a
--> a10000
--> 
./nbawk-current:
FAILURE: 016: printf foo | "$awk" 1 RS=			# Empty RS (multiline records)
gawk:
--> foo
./nbawk-current:
--> 
--> 
--> 
--> 
--> ./nbawk-current: out of memory in readrec 3
-->  input record number 4, file 
-->  source line number 1
FAILURE: 017: printf '\n\n\nr1f1\nr1f2\n\nr2f1\nr2f2\n\n\n' | "$awk" '{$1=$1}1' RS= OFS=:
gawk:
--> r1f1:r1f2
--> r2f1:r2f2
./nbawk-current:
--> 
--> 
--> 
--> 
--> ./nbawk-current: out of memory in readrec 3
-->  input record number 4, file 
-->  source line number 1




And now, a comparison of gawk, nbawk-fixed-gawk, mawk, nbawk-fixed-mawk.
Of the 18 scenarios tested, disagreement arises in two cases, 004 and 012.


SUCCESS: 000: printf 1a2aa3aaa4 | "$awk" 1 RS=a		# Single character RS
SUCCESS: 001: printf 1a2aa3aaa4 | "$awk" 1 RS='[a]'		# Regular expression RS
SUCCESS: 002: printf 1a2ab3aba4 | "$awk" 1 RS='[ab]'
SUCCESS: 003: printf 1ab2ab3ab4 | "$awk" 1 RS=ab
FAILURE: 004: printf 1a2aa3aaa4 | "$awk" 1 RS='a*'
gawk:
--> 
--> 
--> 
--> 
--> 
--> 
./nbawk-fixed-gawk:
--> 1
--> 2
--> 3
--> 4
mawk:
--> 1
--> 2
--> 3
--> 4
./nbawk-fixed-mawk:
--> 1
--> 2
--> 3
--> 4
SUCCESS: 005: printf 1a2aa3aaa4 | "$awk" 1 RS='aa*'
SUCCESS: 006: printf 1a2ab3aab4 | "$awk" 1 RS='aa*b*'
SUCCESS: 007: printf 1abc | "$awk" 1 RS='abcde|b'
SUCCESS: 008: printf 1abcdf2 | "$awk" 1 RS='abcde|b'
SUCCESS: 009: printf 1abcdebf2 | "$awk" 1 RS='abcde|b'
SUCCESS: 010: printf 1abcdf2 | "$awk" 1 RS='abcde|a'
SUCCESS: 011: printf a1a2a3a | "$awk" 1 RS='^a'		# Anchors
FAILURE: 012: printf aaa1a2a | "$awk" 1 RS='^a'
gawk:
--> 
--> aa1a2a
./nbawk-fixed-gawk:
--> 
--> aa1a2a
mawk:
--> 
--> 
--> 
--> 1a2a
./nbawk-fixed-mawk:
--> 
--> 
--> 
--> 1a2a
SUCCESS: 013: printf a1a2a3a | "$awk" 1 RS='a$'
SUCCESS: 014: printf a1a2aaa | "$awk" 1 RS='a$'
SUCCESS: 015: jot -s a 10000 | "$awk" 'NR>1' RS='999[0-9]' 	# Trigger adjbuf relocation
SUCCESS: 016: printf foo | "$awk" 1 RS=			# Empty RS (multiline records)
SUCCESS: 017: printf '\n\n\nr1f1\nr1f2\n\nr2f1\nr2f2\n\n\n' | "$awk" '{$1=$1}1' RS= OFS=:


In 004, I don't know what's going on there with gawk. It emits the same number of empty lines as there are possible non-empty, shortest matches to the RS regular expression. Somehow, characters which cannot match RS and should appear in an output record are consumed. This gawk behavior appears to be useless and a bug. Neither diff behaves this way.

Case 012 is more interesting. Here we see that gawk and mawk behave differently when RS is anchored. RS=^a in gawk matches only once and only at the very beginning of a stream. In mawk, RS=^a can match repeated instances of "a" at the beginning of the stream, yielding an empty record for each match.

It seems gawk will only anchor at the start of a stream, while in mawk each record search starting point is anchorable. This difference is only evident when there are multiple consecutive instances of RS at the start of a stream. Otherwise, they behave identically.

The gawk behavior seems more sensible and less surprising. gawk is also the more widely-deployed implementation.

The mawk behavior has the advantage of being simpler to implement -- since the automaton's initial state does not need to be manipulated to enable/disable the hat (^) anchor.

It's likely that the decision of which implementation to emulate will prove to be of little importance, since the utility of a hat-anchored regular expression record separator is limited at best. Personally, I have a slight preference for gawk's behavior.

Before I conclude, I'd like to take a moment to thank Russ Cox. His exposition of how finite automatons are used to implement regular expressions in "Regular Expression Matching Can Be Simple And Fast" at http://swtch.com/~rsc/regexp/regexp1.html helped me, someone with no formal schooling in computer science, get through nawk/dist/b.c a lot faster than I otherwise would have. Thank you, Russ.

And, finally, my diffs and test script. Hopefully they'll be of some use. First the simpler mawk-like diff, then the gawk-like diff, and then the sh script and testcases.


Kind regards to all,
Miguel Piņeiro Jr.
>How-To-Repeat:
See description above.
>Fix:
MAWK-LIKE DIFF


Index: b.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/b.c,v
retrieving revision 1.2
diff -u -p -r1.2 b.c
--- b.c	26 Aug 2010 14:55:19 -0000	1.2
+++ b.c	4 Mar 2012 22:11:38 -0000
@@ -624,6 +624,96 @@ int nematch(fa *f, const char *p0)	/* no
 	return (0);
 }

+
+/*
+ * NAME
+ *     fnematch
+ *
+ * DESCRIPTION
+ *     A stream-fed version of nematch which transfers characters to a
+ *     null-terminated buffer. All characters up to and including the last
+ *     character of the matching text or EOF are placed in the buffer. If
+ *     a match is found, patbeg and patlen are set appropriately.
+ *
+ * RETURN VALUES
+ *     0    No match found.
+ *     1    Match found.
+ */  
+
+int fnematch(fa *pfa, FILE *f, uschar **pbuf, int *pbufsize, int quantum)	
+{
+	uschar *buf = *pbuf;
+	int bufsize = *pbufsize;
+	int c, i, j, k, ns, s;
+
+	s = pfa->initstat;
+	assert(s < pfa->state_count);
+	patlen = 0;
+
+	/*
+	 * All indices relative to buf.
+	 * i <= j <= k <= bufsize
+	 *
+	 * i: origin of active substring
+	 * j: current character
+	 * k: destination of next getc()
+	 */
+	i = -1, k = 0;
+        do {
+		j = i++;
+		do {
+			if (++j == k) {
+				if (k == bufsize)
+					if (!adjbuf(&buf, &bufsize, bufsize+1, quantum, 0, "fnematch"))
+						FATAL("stream '%.30s...' too long", buf);	
+				buf[k++] = (c = getc(f)) != EOF ? c : 0;
+			}
+			c = buf[j];
+			/* assert(c < NCHARS); */
+
+			if ((ns = pfa->gototab[s][c]) != 0)
+				s = ns;
+			else
+				s = cgoto(pfa, s, c);
+			assert(s < pfa->state_count);
+
+			if (pfa->out[s]) {	/* final state */
+				patlen = j - i + 1;
+				if (c == 0)	/* don't count $ */
+					patlen--;
+			}
+		} while (buf[j] && s != 1);
+		s = 2;
+	} while (buf[i] && !patlen);
+
+	/* adjbuf() may have relocated a resized buffer. Inform the world. */
+	*pbuf = buf;
+	*pbufsize = bufsize;
+
+	if (patlen) {
+		patbeg = buf + i;
+		/*
+		 * Under no circumstances is the last character fed to
+		 * the automaton part of the match. It is EOF's nullbyte,
+		 * or it sent the automaton into a state with no further
+		 * transitions available (s==1), or both. Room for a
+		 * terminating nullbyte is guaranteed.
+		 *
+		 * ungetc any chars after the end of matching text
+		 * (except for EOF's nullbyte, if present) and null
+		 * terminate the buffer.
+		 */
+		do
+			if (buf[--k] && ungetc(buf[k], f) == EOF)
+				FATAL("unable to ungetc '%c'", buf[k]);	
+		while (k > i + patlen);
+		buf[k] = 0;
+		return 1;
+	}
+	else
+		return 0;
+}
+
 Node *reparse(const char *p)	/* parses regular expression pointed to by p */
 {			/* uses relex() to scan regular expression */
 	Node *np;
Index: lib.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/lib.c,v
retrieving revision 1.4
diff -u -p -r1.4 lib.c
--- lib.c	20 Jan 2011 21:23:11 -0000	1.4
+++ lib.c	4 Mar 2012 22:11:38 -0000
@@ -188,7 +188,7 @@ void nextfile(void)

 int readrec(uschar **pbuf, int *pbufsize, FILE *inf)	/* read one record into buf */
 {
-	int sep, c;
+	int sep, c, isrec;
 	uschar *rr, *buf = *pbuf;
 	int bufsize = *pbufsize;
 	size_t len;
@@ -202,48 +202,18 @@ int readrec(uschar **pbuf, int *pbufsize
 			FATAL("field separator %.10s... is too long", *FS);
 		memcpy(inputFS, *FS, len_inputFS);
 	}
-	if ((sep = **RS) == 0) {
-		sep = '\n';
-		while ((c=getc(inf)) == '\n' && c != EOF)	/* skip leading \n's */
-			;
-		if (c != EOF)
-			ungetc(c, inf);
-	} else if ((*RS)[1]) {
+	if (**RS && (*RS)[1]) {
 		fa *pfa = makedfa(*RS, 1);
-		int tempstat = pfa->initstat;
-		char *brr = buf;
-		char *rrr = NULL;
-		int x;
-		for (rr = buf; ; ) {
-			while ((c = getc(inf)) != EOF) {
-				if (rr-buf+3 > bufsize)
-					if (!adjbuf(&buf, &bufsize, 3+rr-buf,
-					    recsize, &rr, "readrec 2"))
-						FATAL("input record `%.30s...'"
-						    " too long", buf);
-				*rr++ = c;
-				*rr = '\0';
-				if (!(x = nematch(pfa, brr))) {
-					pfa->initstat = tempstat;
-					if (rrr) {
-						rr = rrr;
-						ungetc(c, inf);
-						break;
-					}
-				} else {
-					pfa->initstat = 2;
-					brr = rrr = rr = patbeg;
-				}
-			}
-			if (rrr || c == EOF)
-				break;
-			if ((c = getc(inf)) == '\n' || c == EOF)
-				/* 2 in a row */
-				break;
-			*rr++ = '\n';
-			*rr++ = c;
-		}
+		if (fnematch(pfa, inf, &buf, &bufsize, recsize))
+			*patbeg = 0;
 	} else {
+		if ((sep = **RS) == 0) {
+			sep = '\n';
+			while ((c=getc(inf)) == '\n' && c != EOF)	/* skip leading \n's */
+				;
+			if (c != EOF)
+				ungetc(c, inf);
+		}
 		for (rr = buf; ; ) {
 			for (; (c=getc(inf)) != sep && c != EOF; ) {
 				if (rr-buf+1 > bufsize)
@@ -264,14 +234,15 @@ int readrec(uschar **pbuf, int *pbufsize
 			*rr++ = '\n';
 			*rr++ = c;
 		}
+		if (!adjbuf(&buf, &bufsize, 1+rr-buf, recsize, &rr, "readrec 3"))
+			FATAL("input record `%.30s...' too long", buf);
+		*rr = 0;
 	}
-	if (!adjbuf(&buf, &bufsize, 1+rr-buf, recsize, &rr, "readrec 3"))
-		FATAL("input record `%.30s...' too long", buf);
-	*rr = 0;
-	   dprintf( ("readrec saw <%s>, returns %d\n", buf, c == EOF && rr == buf ? 0 : 1) );
 	*pbuf = buf;
 	*pbufsize = bufsize;
-	return c == EOF && rr == buf ? 0 : 1;
+	isrec = *buf || !feof(inf);
+	   dprintf( ("readrec saw <%s>, returns %d\n", buf, isrec) );
+	return isrec;
 }

 char *getargv(int n)	/* get ARGV[n] */
Index: proto.h
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/proto.h,v
retrieving revision 1.5
diff -u -p -r1.5 proto.h
--- proto.h	16 Sep 2011 16:09:46 -0000	1.5
+++ proto.h	4 Mar 2012 22:11:38 -0000
@@ -54,6 +54,7 @@ extern	int	member(int, const char *);
 extern	int	match(fa *, const char *);
 extern	int	pmatch(fa *, const char *);
 extern	int	nematch(fa *, const char *);
+extern	int	fnematch(fa *, FILE *, uschar **, int *, int);
 extern	Node	*reparse(const char *);
 extern	Node	*regexp(void);
 extern	Node	*primary(void);




GAWK-LIKE DIFF


Index: b.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/b.c,v
retrieving revision 1.2
diff -u -p -r1.2 b.c
--- b.c	26 Aug 2010 14:55:19 -0000	1.2
+++ b.c	6 Mar 2012 17:36:15 -0000
@@ -624,6 +624,96 @@ int nematch(fa *f, const char *p0)	/* no
 	return (0);
 }

+
+/*
+ * NAME
+ *     fnematch
+ *
+ * DESCRIPTION
+ *     A stream-fed version of nematch which transfers characters to a
+ *     null-terminated buffer. All characters up to and including the last
+ *     character of the matching text or EOF are placed in the buffer. If
+ *     a match is found, patbeg and patlen are set appropriately.
+ *
+ * RETURN VALUES
+ *     0    No match found.
+ *     1    Match found.
+ */  
+
+int fnematch(fa *pfa, FILE *f, uschar **pbuf, int *pbufsize, int quantum)	
+{
+	uschar *buf = *pbuf;
+	int bufsize = *pbufsize;
+	int c, i, j, k, ns, s;
+
+	s = pfa->initstat;
+	assert(s < pfa->state_count);
+	patlen = 0;
+
+	/*
+	 * All indices relative to buf.
+	 * i <= j <= k <= bufsize
+	 *
+	 * i: origin of active substring
+	 * j: current character
+	 * k: destination of next getc()
+	 */
+	i = -1, k = 0;
+        do {
+		j = i++;
+		do {
+			if (++j == k) {
+				if (k == bufsize)
+					if (!adjbuf(&buf, &bufsize, bufsize+1, quantum, 0, "fnematch"))
+						FATAL("stream '%.30s...' too long", buf);	
+				buf[k++] = (c = getc(f)) != EOF ? c : 0;
+			}
+			c = buf[j];
+			/* assert(c < NCHARS); */
+
+			if ((ns = pfa->gototab[s][c]) != 0)
+				s = ns;
+			else
+				s = cgoto(pfa, s, c);
+			assert(s < pfa->state_count);
+
+			if (pfa->out[s]) {	/* final state */
+				patlen = j - i + 1;
+				if (c == 0)	/* don't count $ */
+					patlen--;
+			}
+		} while (buf[j] && s != 1);
+		s = 2;
+	} while (buf[i] && !patlen);
+
+	/* adjbuf() may have relocated a resized buffer. Inform the world. */
+	*pbuf = buf;
+	*pbufsize = bufsize;
+
+	if (patlen) {
+		patbeg = buf + i;
+		/*
+		 * Under no circumstances is the last character fed to
+		 * the automaton part of the match. It is EOF's nullbyte,
+		 * or it sent the automaton into a state with no further
+		 * transitions available (s==1), or both. Room for a
+		 * terminating nullbyte is guaranteed.
+		 *
+		 * ungetc any chars after the end of matching text
+		 * (except for EOF's nullbyte, if present) and null
+		 * terminate the buffer.
+		 */
+		do
+			if (buf[--k] && ungetc(buf[k], f) == EOF)
+				FATAL("unable to ungetc '%c'", buf[k]);	
+		while (k > i + patlen);
+		buf[k] = 0;
+		return 1;
+	}
+	else
+		return 0;
+}
+
 Node *reparse(const char *p)	/* parses regular expression pointed to by p */
 {			/* uses relex() to scan regular expression */
 	Node *np;
Index: lib.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/lib.c,v
retrieving revision 1.4
diff -u -p -r1.4 lib.c
--- lib.c	20 Jan 2011 21:23:11 -0000	1.4
+++ lib.c	6 Mar 2012 17:36:15 -0000
@@ -38,6 +38,7 @@ THIS SOFTWARE.

 char	EMPTY[] = { '\0' };
 FILE	*infile	= NULL;
+int	innew;		/* 1 = infile has not been read by readrec */
 char	*file	= EMPTY;
 uschar	*record;
 int	recsize	= RECSIZE;
@@ -104,6 +105,7 @@ void initgetrec(void)
 		argno++;
 	}
 	infile = stdin;		/* no filenames, so use stdin */
+	innew = 1;
 }

 static int firsttime = 1;
@@ -146,9 +148,12 @@ int getrec(uschar **pbuf, int *pbufsize,
 				infile = stdin;
 			else if ((infile = fopen(file, "r")) == NULL)
 				FATAL("can't open file %s", file);
+			innew = 1;
 			setfval(fnrloc, 0.0);
 		}
-		c = readrec(&buf, &bufsize, infile);
+		c = readrec(&buf, &bufsize, infile, innew);
+		if (innew)
+			innew = 0;
 		if (c != 0 || buf[0] != '\0') {	/* normal record */
 			if (isrecord) {
 				if (freeable(fldtab[0]))
@@ -186,9 +191,9 @@ void nextfile(void)
 	argno++;
 }

-int readrec(uschar **pbuf, int *pbufsize, FILE *inf)	/* read one record into buf */
+int readrec(uschar **pbuf, int *pbufsize, FILE *inf, int newflag)	/* read one record into buf */
 {
-	int sep, c;
+	int sep, c, isrec, found, tempstat;
 	uschar *rr, *buf = *pbuf;
 	int bufsize = *pbufsize;
 	size_t len;
@@ -202,48 +207,26 @@ int readrec(uschar **pbuf, int *pbufsize
 			FATAL("field separator %.10s... is too long", *FS);
 		memcpy(inputFS, *FS, len_inputFS);
 	}
-	if ((sep = **RS) == 0) {
-		sep = '\n';
-		while ((c=getc(inf)) == '\n' && c != EOF)	/* skip leading \n's */
-			;
-		if (c != EOF)
-			ungetc(c, inf);
-	} else if ((*RS)[1]) {
+	if (**RS && (*RS)[1]) {
 		fa *pfa = makedfa(*RS, 1);
-		int tempstat = pfa->initstat;
-		char *brr = buf;
-		char *rrr = NULL;
-		int x;
-		for (rr = buf; ; ) {
-			while ((c = getc(inf)) != EOF) {
-				if (rr-buf+3 > bufsize)
-					if (!adjbuf(&buf, &bufsize, 3+rr-buf,
-					    recsize, &rr, "readrec 2"))
-						FATAL("input record `%.30s...'"
-						    " too long", buf);
-				*rr++ = c;
-				*rr = '\0';
-				if (!(x = nematch(pfa, brr))) {
-					pfa->initstat = tempstat;
-					if (rrr) {
-						rr = rrr;
-						ungetc(c, inf);
-						break;
-					}
-				} else {
-					pfa->initstat = 2;
-					brr = rrr = rr = patbeg;
-				}
-			}
-			if (rrr || c == EOF)
-				break;
-			if ((c = getc(inf)) == '\n' || c == EOF)
-				/* 2 in a row */
-				break;
-			*rr++ = '\n';
-			*rr++ = c;
+		if (newflag)
+			found = fnematch(pfa, inf, &buf, &bufsize, recsize);
+		else {
+			tempstat = pfa->initstat;
+			pfa->initstat = 2;
+			found = fnematch(pfa, inf, &buf, &bufsize, recsize);
+			pfa->initstat = tempstat;
 		}
+		if (found)
+			*patbeg = 0;
 	} else {
+		if ((sep = **RS) == 0) {
+			sep = '\n';
+			while ((c=getc(inf)) == '\n' && c != EOF)	/* skip leading \n's */
+				;
+			if (c != EOF)
+				ungetc(c, inf);
+		}
 		for (rr = buf; ; ) {
 			for (; (c=getc(inf)) != sep && c != EOF; ) {
 				if (rr-buf+1 > bufsize)
@@ -264,14 +247,15 @@ int readrec(uschar **pbuf, int *pbufsize
 			*rr++ = '\n';
 			*rr++ = c;
 		}
+		if (!adjbuf(&buf, &bufsize, 1+rr-buf, recsize, &rr, "readrec 3"))
+			FATAL("input record `%.30s...' too long", buf);
+		*rr = 0;
 	}
-	if (!adjbuf(&buf, &bufsize, 1+rr-buf, recsize, &rr, "readrec 3"))
-		FATAL("input record `%.30s...' too long", buf);
-	*rr = 0;
-	   dprintf( ("readrec saw <%s>, returns %d\n", buf, c == EOF && rr == buf ? 0 : 1) );
 	*pbuf = buf;
 	*pbufsize = bufsize;
-	return c == EOF && rr == buf ? 0 : 1;
+	isrec = *buf || !feof(inf);
+	   dprintf( ("readrec saw <%s>, returns %d\n", buf, isrec) );
+	return isrec;
 }

 char *getargv(int n)	/* get ARGV[n] */
Index: proto.h
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/proto.h,v
retrieving revision 1.5
diff -u -p -r1.5 proto.h
--- proto.h	16 Sep 2011 16:09:46 -0000	1.5
+++ proto.h	6 Mar 2012 17:36:15 -0000
@@ -54,6 +54,7 @@ extern	int	member(int, const char *);
 extern	int	match(fa *, const char *);
 extern	int	pmatch(fa *, const char *);
 extern	int	nematch(fa *, const char *);
+extern	int	fnematch(fa *, FILE *, uschar **, int *, int);
 extern	Node	*reparse(const char *);
 extern	Node	*regexp(void);
 extern	Node	*primary(void);
@@ -122,7 +123,7 @@ extern	void	makefields(int, int);
 extern	void	growfldtab(int n);
 extern	int	getrec(uschar **, int *, int);
 extern	void	nextfile(void);
-extern	int	readrec(uschar **buf, int *bufsize, FILE *inf);
+extern	int	readrec(uschar **buf, int *bufsize, FILE *inf, int newflag);
 extern	char	*getargv(int);
 extern	void	setclvar(char *);
 extern	void	fldbld(void);
@@ -191,7 +192,7 @@ extern	Cell	*bltin(Node **, int);
 extern	Cell	*printstat(Node **, int);
 extern	Cell	*nullproc(Node **, int);
 extern	FILE	*redirect(int, Node *);
-extern	FILE	*openfile(int, const char *);
+extern	FILE	*openfile(int, const char *, int *);
 extern	const char	*filename(FILE *);
 extern	Cell	*closefile(Node **, int);
 extern	void	closeall(void);
Index: run.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/run.c,v
retrieving revision 1.4
diff -u -p -r1.4 run.c
--- run.c	22 Nov 2011 22:30:22 -0000	1.4
+++ run.c	6 Mar 2012 17:36:15 -0000
@@ -406,7 +406,7 @@ Cell *awkgetline(Node **a, int n)	/* get
 	FILE *fp;
 	uschar *buf;
 	int bufsize = recsize;
-	int mode;
+	int mode, newflag;

 	if ((buf = malloc(bufsize)) == NULL)
 		FATAL("out of memory in getline");
@@ -418,12 +418,12 @@ Cell *awkgetline(Node **a, int n)	/* get
 		mode = ptoi(a[1]);
 		if (mode == '|')		/* input pipe */
 			mode = LE;	/* arbitrary flag */
-		fp = openfile(mode, getsval(x));
+		fp = openfile(mode, getsval(x), &newflag);
 		tempfree(x);
 		if (fp == NULL)
 			n = -1;
 		else
-			n = readrec(&buf, &bufsize, fp);
+			n = readrec(&buf, &bufsize, fp, newflag);
 		if (n <= 0) {
 			;
 		} else if (a[0] != NULL) {	/* getline var <file */
@@ -1623,7 +1623,7 @@ Cell *bltin(Node **a, int n)	/* builtin 
 		if (isrec(x) || strlen(getsval(x)) == 0) {
 			flush_all();	/* fflush() or fflush("") -> all */
 			u = 0;
-		} else if ((fp = openfile(FFLUSH, getsval(x))) == NULL)
+		} else if ((fp = openfile(FFLUSH, getsval(x), NULL)) == NULL)
 			u = -1;
 		else
 			u = fflush(fp);
@@ -1715,7 +1715,7 @@ FILE *redirect(int a, Node *b)	/* set up

 	x = execute(b);
 	fname = getsval(x);
-	fp = openfile(a, fname);
+	fp = openfile(a, fname, NULL);
 	if (fp == NULL)
 		FATAL("can't open file %s", fname);
 	tempfree(x);
@@ -1746,7 +1746,7 @@ void stdinit(void)	/* in case stdin, etc
 	files[2].mode = GT;
 }

-FILE *openfile(int a, const char *us)
+FILE *openfile(int a, const char *us, int *pnewflag)
 {
 	const char *s = us;
 	size_t i;
@@ -1756,11 +1756,12 @@ FILE *openfile(int a, const char *us)
 	if (*s == '\0')
 		FATAL("null file name in print or getline");
 	for (i = 0; i < nfiles; i++)
-		if (files[i].fname && strcmp(s, files[i].fname) == 0) {
-			if (a == files[i].mode || (a==APPEND && files[i].mode==GT))
-				return files[i].fp;
-			if (a == FFLUSH)
-				return files[i].fp;
+		if (files[i].fname && strcmp(s, files[i].fname) == 0 &&
+		    (a == files[i].mode || (a==APPEND && files[i].mode==GT) ||
+		     a == FFLUSH)) {
+			if (pnewflag)
+				*pnewflag = 0;
+			return files[i].fp;
 		}
 	if (a == FFLUSH)	/* didn't find it, so don't create it! */
 		return NULL;
@@ -1797,6 +1798,8 @@ FILE *openfile(int a, const char *us)
 		files[i].fname = tostring(s);
 		files[i].fp = fp;
 		files[i].mode = m;
+		if (pnewflag)
+			*pnewflag = 1;
 	}
 	return fp;
 }




====================================


The testscript takes any number of awk implementations as arguments and reads test cases (pipelines to eval) from standard input. For example:
./awktest.sh gawk mawk nawk < testcases > results 2>&1

$ cat awktest.sh
#/bin/sh

if [ $# -eq 0 ]; then
	printf '%s\n' 'Where'\''s the AWK?' 1>&2
	exit 2
fi

# Pair each AWK implementation with a tempfile
for awk; do
	f=$(mktemp -t awktests)
	if [ $? -ne 0 ]; then 
		rm $flist 2> /dev/null
		exit 2
	fi
	flist="$flist $f"
	set "$@" "$awk:$f"
	shift
done

e=0	# Exit status.
i=0	# Test case loop index
# CAUTION: eval ahead.
while read -r testcase; do
	oldf=
	result=SUCCESS
	for awk; do
		f=${awk##*:}
		awk=${awk%:*}
		eval "$testcase" > $f 2>&1 || result=FAILURE
		if [ "$oldf" ]; then
			cmp -s $oldf $f || result=FAILURE
		fi
		oldf=$f
	done
	printf '%s: %s: %s\n' $result $(printf %03d $i) "$testcase"
	if [ $result = FAILURE ]; then
		e=1
		for awk; do
			f=${awk##*:}
			awk=${awk%:*}
			printf '%s:\n' "$awk"
			sed 's/^/--> /' $f
		done
	fi
	i=$((i+1))
done

rm $flist 2> /dev/null
exit $e


$ cat testcases
printf 1a2aa3aaa4 | "$awk" 1 RS=a		# Single character RS
printf 1a2aa3aaa4 | "$awk" 1 RS='[a]'		# Regular expression RS
printf 1a2ab3aba4 | "$awk" 1 RS='[ab]'
printf 1ab2ab3ab4 | "$awk" 1 RS=ab
printf 1a2aa3aaa4 | "$awk" 1 RS='a*'
printf 1a2aa3aaa4 | "$awk" 1 RS='aa*'
printf 1a2ab3aab4 | "$awk" 1 RS='aa*b*'
printf 1abc | "$awk" 1 RS='abcde|b'
printf 1abcdf2 | "$awk" 1 RS='abcde|b'
printf 1abcdebf2 | "$awk" 1 RS='abcde|b'
printf 1abcdf2 | "$awk" 1 RS='abcde|a'
printf a1a2a3a | "$awk" 1 RS='^a'		# Anchors
printf aaa1a2a | "$awk" 1 RS='^a'
printf a1a2a3a | "$awk" 1 RS='a$'
printf a1a2aaa | "$awk" 1 RS='a$'
jot -s a 10000 | "$awk" 'NR>1' RS='999[0-9]' 	# Trigger adjbuf relocation
printf foo | "$awk" 1 RS=			# Empty RS (multiline records)
printf '\n\n\nr1f1\nr1f2\n\nr2f1\nr2f2\n\n\n' | "$awk" '{$1=$1}1' RS= OFS=:

>Release-Note:

>Audit-Trail:
From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46155 CVS commit: src/tests/usr.bin
Date: Sat, 10 Mar 2012 14:08:57 -0500

 Module Name:	src
 Committed By:	christos
 Date:		Sat Mar 10 19:08:57 UTC 2012

 Modified Files:
 	src/tests/usr.bin: Makefile
 Added Files:
 	src/tests/usr.bin/awk: Makefile t_awk.sh

 Log Message:
 PR/46155: Miguel Piņeiro Jr: awk RS handling is broken. Add the test cases
 from the PR.


 To generate a diff of this commit:
 cvs rdiff -u -r1.6 -r1.7 src/tests/usr.bin/Makefile
 cvs rdiff -u -r0 -r1.1 src/tests/usr.bin/awk/Makefile \
     src/tests/usr.bin/awk/t_awk.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46155 CVS commit: src/external/historical/nawk/dist
Date: Sat, 10 Mar 2012 14:18:49 -0500

 Module Name:	src
 Committed By:	christos
 Date:		Sat Mar 10 19:18:48 UTC 2012

 Modified Files:
 	src/external/historical/nawk/dist: b.c lib.c proto.h run.c

 Log Message:
 PR/46155: Miguel Piņeiro Jr: Fix RS processing. Apply the gawk-like patch
 from the excellent PR. Many thanks for all the work you put on this,
 explanation, tests, and patch!


 To generate a diff of this commit:
 cvs rdiff -u -r1.2 -r1.3 src/external/historical/nawk/dist/b.c
 cvs rdiff -u -r1.4 -r1.5 src/external/historical/nawk/dist/lib.c \
     src/external/historical/nawk/dist/run.c
 cvs rdiff -u -r1.5 -r1.6 src/external/historical/nawk/dist/proto.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: wiz@NetBSD.org
State-Changed-When: Sat, 10 Mar 2012 19:42:40 +0000
State-Changed-Why:
christos committed the changes, everything fine now?


From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46155 CVS commit: [netbsd-6] src/external/historical/nawk/dist
Date: Thu, 12 Jul 2012 19:24:22 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Thu Jul 12 19:24:21 UTC 2012

 Modified Files:
 	src/external/historical/nawk/dist [netbsd-6]: b.c lib.c main.c proto.h
 	    run.c

 Log Message:
 Pull up following revision(s) (requested by christos in ticket #414):

 external/historical/nawk/dist/main.c		patch
 external/historical/nawk/dist/b.c		patch
 external/historical/nawk/dist/lib.c		patch
 external/historical/nawk/dist/proto.h		patch
 external/historical/nawk/dist/run.c		patch

 	Fix segfault in awk, and address PR#46155.
 	[christos, ticket #414]


 To generate a diff of this commit:
 cvs rdiff -u -r1.2 -r1.2.8.1 src/external/historical/nawk/dist/b.c
 cvs rdiff -u -r1.4 -r1.4.6.1 src/external/historical/nawk/dist/lib.c
 cvs rdiff -u -r1.6 -r1.6.4.1 src/external/historical/nawk/dist/main.c
 cvs rdiff -u -r1.5 -r1.5.4.1 src/external/historical/nawk/dist/proto.h
 cvs rdiff -u -r1.4 -r1.4.2.1 src/external/historical/nawk/dist/run.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 30 Dec 2012 08:07:24 +0000
State-Changed-Why:
fixed


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.