NetBSD Problem Report #42463
From dholland@eecs.harvard.edu Wed Dec 16 21:05:33 2009
Return-Path: <dholland@eecs.harvard.edu>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id 69BE763C3A9
for <gnats-bugs@gnats.NetBSD.org>; Wed, 16 Dec 2009 21:05:33 +0000 (UTC)
Message-Id: <20091216210407.56048F9BC@tanaqui.eecs.harvard.edu>
Date: Wed, 16 Dec 2009 16:04:06 -0500 (EST)
From: dholland@eecs.harvard.edu
Reply-To: dholland@eecs.harvard.edu
To: gnats-bugs@gnats.NetBSD.org
Subject: Bizarre behavior in awk with invalid numeric constants
X-Send-Pr-Version: 3.95
>Number: 42463
>Category: bin
>Synopsis: Bizarre behavior in awk with invalid numeric constants
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Dec 16 21:10:00 +0000 2009
>Last-Modified: Sun Jan 01 00:45:01 +0000 2017
>Originator: David A. Holland
>Release: NetBSD 5.99.22 (20091208)
>Organization:
>Environment:
System: NetBSD tanaqui 5.99.22 NetBSD 5.99.22 (TANAQUI) #31: Tue Dec 8 22:53:35 EST 2009 dholland@tanaqui:/usr/src/sys/arch/i386/compile/TANAQUI i386
Architecture: i386
Machine: i386
>Description:
awk does bizarrely random things when you write invalid numbers in the
program text.
This is not so surprising, although one would expect it to generate a
syntax error (recall that awk doesn't handle hex integer constants...)
% awk </dev/null 'END { printf "%d\n", 0xblegh }'
0
This, however, is very strange:
% awk </dev/null 'END { printf "%c\n", 0xblegh }'
0
If 0xblegh is a number, that should print a NUL, not a literal zero.
So ok, maybe it's being treated as a string constant, so let's try %s:
% awk </dev/null 'END { printf "%s\n", 0xblegh }'
0
...nope. But wait, it gets weirder. Let's try forcing a conversion to
a number:
% awk </dev/null 'END { printf "%s\n", (0xblegh + 0) }'
00
% awk </dev/null 'END { printf "%s\n", (0xblegh + 3) }'
03
% awk </dev/null 'END { printf "%s\n", (0xblegh - 5) }'
0-5
Huh?
gawk also behaves in a similar way:
% gawk < /dev/null 'END { printf "%d\n", 0xblegh }'
11
% gawk < /dev/null 'END { printf "%c\n", 0xblegh }'
1
% gawk < /dev/null 'END { printf "%s\n", 0xblegh }'
11
% gawk < /dev/null 'END { printf "%s\n", (0xblegh + 0) }'
110
In fact, modulo gawk treating the number as 11 (0xb) because it
accepts hex constants, the behavior is identical. Furthermore, this
whole thing came to light because of this bug filed on mawk:
http://www.mail-archive.com/ubuntu-bugs@lists.ubuntu.com/msg1266528.html
I find this disturbing, especially the way in which + mystically turns
into string concatenation. Is there some strange way in which this
behavior is mandated by the awk specification?
>How-To-Repeat:
as above.
>Fix:
reject invalid numbers up front?
>Audit-Trail:
From: "Greg A. Woods" <woods@planix.ca>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc:
Subject: Re: bin/42463: Bizarre behavior in awk with invalid numeric constants
Date: Sat, 19 Dec 2009 00:01:11 -0500
--pgp-sign-Multipart_Sat_Dec_19_00:01:11_2009-1
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable
At Wed, 16 Dec 2009 21:10:00 +0000 (UTC), dholland@eecs.harvard.edu wrote:
Subject: bin/42463: Bizarre behavior in awk with invalid numeric constants
>=20
> This is not so surprising, although one would expect it to generate a
> syntax error (recall that awk doesn't handle hex integer constants...)
Well, a syntax error would not really be correct so far as I can tell,
maybe not even for constants in the program text.
As you know an awk scalar variable has both a string and a number value
at the same time; and expressions take on string or numeric values as
appropriate.
As far as I can find very little is said in the awk book or the awk(1)
manual about numerical constants. Both do say though that string
constants are quoted with double quote characters (and can contain
C-like character escapes), and regular expression constants are quoted
with slash characters.
The awk book does say, (appendix A, p.192) "The numeric value of an
arbitrary string is the numeric value of its numeric prefix."
The mawk(1) manual says this about numeric constants:
Numeric constants can be integer like -2, decimal like 1.08,
or in scientific notation like -1.1e4 or .28E-3.
So, in all your examples the numeric value of the unnamed constant you
give as "0xblegh" should probably be zero, IIUC, at least for awk and
mawk.
However as you've shown it doesn't seem as though things actually work
the way _I_ would expect when it comes to expressions containing
un-quoted non-numeric constants with numeric prefixes.
Interestingly to me awk and mawk behave in exactly the same bizarre way:
$ awk 'BEGIN{v =3D 0xblegh - 5; printf("%s\n", v) }'
0-5
$ mawk 'BEGIN{v =3D 0xblegh - 5; printf("%s\n", v) }'
0-5
$ gawk 'BEGIN{v =3D 0xblegh - 5; printf("%s\n", v) }'
11-5
Those examples really do floor me. What an amazing side effect, and
identically in two different implementations! I can only guess without
looking at the code that mawk tries very hard to mimic awk's behaviour
here. Gawk is almost being even more bizarre, but at least it might be
getting the interpretation of the first constant correct, for some
meaning of correct as per its own documentation.
Given the following as well it looks as if the parser is sticking the
numeric value of the first term into the number part of the variable,
and then sticking the numeric part of the second "term" into the string
part of the variable, but just as if it parsed number, not as the
operator and second value:
$ awk 'BEGIN{v = 0xblegh + 9; printf("%s\n", v) }'
09
Gawk does just print the correct result if the hex number is indeed a
proper hex number, and if the variable is printed as a string:
$ gawk 'BEGIN{v = 0x11 - 5; printf "%s\n", v }'
12
but I guess that's not really a surprise of any kind.
To avoid any possible code parser issues we can feed the value in as
input, and indeed that does then seem to have a better result, though
still not entirely an expected result since, IIUC, neither awk nor mawk
should interpret hex for input values, but apparently they do:
23:07 [602] $ echo "0xblegh" | awk '{v =3D $1} END{printf("%s\n", v)}'
0xblegh
23:07 [603] $ echo "0xblegh" | awk '{v =3D $1} END{printf("%s\n", v + 0)}'
11
23:07 [604] $ echo "0xblegh" | mawk '{v =3D $1} END{printf("%s\n", v + 0)}'
11
23:07 [605] $ echo "0xblegh" | gawk '{v =3D $1} END{printf("%s\n", v + 0)}'
0
23:07 [606] $ echo "0xblegh" | gawk '{v =3D $1} END{printf("%d\n", v + 0)}'
0
23:08 [607] $ echo "0xblegh" | mawk '{v =3D $1} END{printf("%d\n", v + 0)}'
11
23:08 [608] $ echo "0xblegh" | awk '{v =3D $1} END{printf("%d\n", v + 0)}'
11
23:08 [609] $ echo "0xblegh" | mawk '{v =3D $1} END{printf("%d\n", v "")}'
11
23:09 [610] $ echo "0xblegh" | awk '{v =3D $1} END{printf("%d\n", v "")}'
11
23:09 [611] $ echo "0xblegh" | gawk '{v =3D $1} END{printf("%d\n", v "")}'
0
23:09 [612] $ echo "0xblegh" | gawk '{v =3D $1} END{printf("%s\n", v "")}'
0xblegh
23:09 [613] $ echo "0xblegh" | mawk '{v =3D $1} END{printf("%s\n", v "")}'
0xblegh
23:09 [614] $ echo "0xblegh" | awk '{v =3D $1} END{printf("%s\n", v "")}'
0xblegh
--=20
Greg A. Woods
Planix, Inc.
<woods@planix.com> +1 416 218 0099 http://www.planix.com/
--pgp-sign-Multipart_Sat_Dec_19_00:01:11_2009-1
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (NetBSD)
iD8DBQBLLF4XZn1xt3i/9H8RAlzTAJ9TW5KGuzGgHN4zAYSYOlU3pRFY5wCffXMI
cLw0/Z8hq8IRLdqIwsvAyDM=
=3G8y
-----END PGP SIGNATURE-----
--pgp-sign-Multipart_Sat_Dec_19_00:01:11_2009-1--
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/42463: Bizarre behavior in awk with invalid numeric
constants
Date: Sat, 23 Jan 2010 19:55:52 +0000
On Sat, Dec 19, 2009 at 05:05:03AM +0000, Greg A. Woods wrote:
>> This is not so surprising, although one would expect it to generate a
>> syntax error (recall that awk doesn't handle hex integer constants...)
>
> Well, a syntax error would not really be correct so far as I can tell,
> maybe not even for constants in the program text.
For arbitrary values, yes, but for constants in the program text?
Surely those should be rejected if they aren't actually numbers.
> However as you've shown it doesn't seem as though things actually work
> the way _I_ would expect when it comes to expressions containing
> un-quoted non-numeric constants with numeric prefixes.
Right. Whatever is going on is deeper than just running strtol() on
some bogus strings gotten out of the program text.
> Interestingly to me awk and mawk behave in exactly the same bizarre way:
>
> $ awk 'BEGIN{v = 0xblegh - 5; printf("%s\n", v) }'
> 0-5
> $ mawk 'BEGIN{v = 0xblegh - 5; printf("%s\n", v) }'
> 0-5
> $ gawk 'BEGIN{v = 0xblegh - 5; printf("%s\n", v) }'
> 11-5
>
> Those examples really do floor me. What an amazing side effect, and
> identically in two different implementations!
Yeah...
> To avoid any possible code parser issues we can feed the value in as
> input, and indeed that does then seem to have a better result, though
> still not entirely an expected result since, IIUC, neither awk nor mawk
> should interpret hex for input values, but apparently they do:
>
> [...]
> 23:07 [603] $ echo "0xblegh" | awk '{v = $1} END{printf("%s\n", v + 0)}'
> 11
That seems broken, yes...
--
David A. Holland
dholland@netbsd.org
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/42463: Bizarre behavior in awk with invalid numeric constants
Date: Sun, 1 Jan 2017 00:44:38 +0000
On Wed, Dec 16, 2009 at 09:10:00PM +0000, dholland@eecs.harvard.edu wrote:
> % awk </dev/null 'END { printf "%s\n", (0xblegh + 0) }'
> 00
> % awk </dev/null 'END { printf "%s\n", (0xblegh + 3) }'
> 03
> % awk </dev/null 'END { printf "%s\n", (0xblegh - 5) }'
> 0-5
>
> Huh?
I figured out what going on here: the lexer isn't a C lexer where
trailing letters are potentially part of a number, so it matches only
the acceptable digits, calls those a number, and goes on. That means
it gets the expression
0 xblegh - 5
which is to say, the string concatenation of 0 with (xblegh - 5), and
since xblegh (a perfectly fine variable name) hasn't been assigned a
value, it's 0, so we get a string concatenation of two numbers, which
converts them to strings and pastes them together.
This can be detected by assigning xblegh a value:
% awk < /dev/null 'END { printf "%s\n", (0xblegh - 5) }' xblegh=8
03
In the case of gawk, because it accepts hex constants, it takes 0xb
and what's left is the also unassigned variable "legh".
This behavior certainly produces mystifying results and it might be
good to have awk print a warning, like "Warning: no space between
number and variable name".
untested patch for that:
Index: lex.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/lex.c,v
retrieving revision 1.2
diff -u -p -r1.2 lex.c
--- lex.c 26 Aug 2010 14:55:19 -0000 1.2
+++ lex.c 1 Jan 2017 00:44:00 -0000
@@ -151,6 +151,8 @@ int gettok(char **pbuf, int *psz) /* get
|| c == '.' || c == '+' || c == '-')
*bp++ = c;
else {
+ if (isalpha(c))
+ WARNING( "no space between number and variable name" );
unput(c);
break;
}
--
David A. Holland
dholland@netbsd.org
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.