NetBSD Problem Report #42463

From dholland@eecs.harvard.edu  Wed Dec 16 21:05:33 2009
Return-Path: <dholland@eecs.harvard.edu>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 69BE763C3A9
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 16 Dec 2009 21:05:33 +0000 (UTC)
Message-Id: <20091216210407.56048F9BC@tanaqui.eecs.harvard.edu>
Date: Wed, 16 Dec 2009 16:04:06 -0500 (EST)
From: dholland@eecs.harvard.edu
Reply-To: dholland@eecs.harvard.edu
To: gnats-bugs@gnats.NetBSD.org
Subject: Bizarre behavior in awk with invalid numeric constants
X-Send-Pr-Version: 3.95

>Number:         42463
>Category:       bin
>Synopsis:       Bizarre behavior in awk with invalid numeric constants
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Dec 16 21:10:00 +0000 2009
>Last-Modified:  Sun Jan 01 00:45:01 +0000 2017
>Originator:     David A. Holland
>Release:        NetBSD 5.99.22 (20091208)
>Organization:
>Environment:
System: NetBSD tanaqui 5.99.22 NetBSD 5.99.22 (TANAQUI) #31: Tue Dec 8 22:53:35 EST 2009 dholland@tanaqui:/usr/src/sys/arch/i386/compile/TANAQUI i386
Architecture: i386
Machine: i386
>Description:

awk does bizarrely random things when you write invalid numbers in the
program text.

This is not so surprising, although one would expect it to generate a
syntax error (recall that awk doesn't handle hex integer constants...)

   % awk </dev/null 'END { printf "%d\n", 0xblegh }'
   0

This, however, is very strange:

   % awk </dev/null 'END { printf "%c\n", 0xblegh }'
   0

If 0xblegh is a number, that should print a NUL, not a literal zero.
So ok, maybe it's being treated as a string constant, so let's try %s:

   % awk </dev/null 'END { printf "%s\n", 0xblegh }'
   0

...nope. But wait, it gets weirder. Let's try forcing a conversion to
a number:

   % awk </dev/null 'END { printf "%s\n", (0xblegh + 0) }'
   00
   % awk </dev/null 'END { printf "%s\n", (0xblegh + 3) }'
   03
   % awk </dev/null 'END { printf "%s\n", (0xblegh - 5) }'
   0-5

Huh?

gawk also behaves in a similar way:

   % gawk < /dev/null 'END { printf "%d\n", 0xblegh }'
   11
   % gawk < /dev/null 'END { printf "%c\n", 0xblegh }'
   1
   % gawk < /dev/null 'END { printf "%s\n", 0xblegh }'
   11
   % gawk < /dev/null 'END { printf "%s\n", (0xblegh + 0) }'
   110

In fact, modulo gawk treating the number as 11 (0xb) because it
accepts hex constants, the behavior is identical. Furthermore, this
whole thing came to light because of this bug filed on mawk:

   http://www.mail-archive.com/ubuntu-bugs@lists.ubuntu.com/msg1266528.html

I find this disturbing, especially the way in which + mystically turns
into string concatenation. Is there some strange way in which this
behavior is mandated by the awk specification?

>How-To-Repeat:

as above.

>Fix:

reject invalid numbers up front?

>Audit-Trail:
From: "Greg A. Woods" <woods@planix.ca>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: bin/42463: Bizarre behavior in awk with invalid numeric constants
Date: Sat, 19 Dec 2009 00:01:11 -0500

 --pgp-sign-Multipart_Sat_Dec_19_00:01:11_2009-1
 Content-Type: text/plain; charset=US-ASCII
 Content-Transfer-Encoding: quoted-printable

 At Wed, 16 Dec 2009 21:10:00 +0000 (UTC), dholland@eecs.harvard.edu wrote:
 Subject: bin/42463: Bizarre behavior in awk with invalid numeric constants
 >=20
 > This is not so surprising, although one would expect it to generate a
 > syntax error (recall that awk doesn't handle hex integer constants...)

 Well, a syntax error would not really be correct so far as I can tell,
 maybe not even for constants in the program text.

 As you know an awk scalar variable has both a string and a number value
 at the same time; and expressions take on string or numeric values as
 appropriate.

 As far as I can find very little is said in the awk book or the awk(1)
 manual about numerical constants.  Both do say though that string
 constants are quoted with double quote characters (and can contain
 C-like character escapes), and regular expression constants are quoted
 with slash characters.

 The awk book does say, (appendix A, p.192) "The numeric value of an
 arbitrary string is the numeric value of its numeric prefix."

 The mawk(1) manual says this about numeric constants:

 	  Numeric constants can be integer like -2, decimal like 1.08,
        or in scientific notation like -1.1e4 or .28E-3.

 So, in all your examples the numeric value of the unnamed constant you
 give as "0xblegh" should probably be zero, IIUC, at least for awk and
 mawk.

 However as you've shown it doesn't seem as though things actually work
 the way _I_ would expect when it comes to expressions containing
 un-quoted non-numeric constants with numeric prefixes.

 Interestingly to me awk and mawk behave in exactly the same bizarre way:

 $ awk 'BEGIN{v =3D 0xblegh - 5; printf("%s\n", v) }'
 0-5
 $ mawk 'BEGIN{v =3D 0xblegh - 5; printf("%s\n", v) }'
 0-5
 $ gawk 'BEGIN{v =3D 0xblegh - 5; printf("%s\n", v) }'
 11-5

 Those examples really do floor me.  What an amazing side effect, and
 identically in two different implementations!  I can only guess without
 looking at the code that mawk tries very hard to mimic awk's behaviour
 here.  Gawk is almost being even more bizarre, but at least it might be
 getting the interpretation of the first constant correct, for some
 meaning of correct as per its own documentation.

 Given the following as well it looks as if the parser is sticking the
 numeric value of the first term into the number part of the variable,
 and then sticking the numeric part of the second "term" into the string
 part of the variable, but just as if it parsed number, not as the
 operator and second value:

 $ awk 'BEGIN{v = 0xblegh + 9; printf("%s\n", v) }'  
 09


 Gawk does just print the correct result if the hex number is indeed a
 proper hex number, and if the variable is printed as a string:

 $ gawk 'BEGIN{v = 0x11 - 5; printf "%s\n", v }'
 12

 but I guess that's not really a surprise of any kind.


 To avoid any possible code parser issues we can feed the value in as
 input, and indeed that does then seem to have a better result, though
 still not entirely an expected result since, IIUC, neither awk nor mawk
 should interpret hex for input values, but apparently they do:

 23:07 [602] $ echo "0xblegh" | awk '{v =3D $1} END{printf("%s\n", v)}'
 0xblegh
 23:07 [603] $ echo "0xblegh" | awk '{v =3D $1} END{printf("%s\n", v + 0)}'
 11
 23:07 [604] $ echo "0xblegh" | mawk '{v =3D $1} END{printf("%s\n", v + 0)}'
 11
 23:07 [605] $ echo "0xblegh" | gawk '{v =3D $1} END{printf("%s\n", v + 0)}'
 0
 23:07 [606] $ echo "0xblegh" | gawk '{v =3D $1} END{printf("%d\n", v + 0)}'
 0
 23:08 [607] $ echo "0xblegh" | mawk '{v =3D $1} END{printf("%d\n", v + 0)}'
 11
 23:08 [608] $ echo "0xblegh" | awk '{v =3D $1} END{printf("%d\n", v + 0)}'
 11
 23:08 [609] $ echo "0xblegh" | mawk '{v =3D $1} END{printf("%d\n", v "")}'
 11
 23:09 [610] $ echo "0xblegh" | awk '{v =3D $1} END{printf("%d\n", v "")}'
 11
 23:09 [611] $ echo "0xblegh" | gawk '{v =3D $1} END{printf("%d\n", v "")}'
 0
 23:09 [612] $ echo "0xblegh" | gawk '{v =3D $1} END{printf("%s\n", v "")}'
 0xblegh
 23:09 [613] $ echo "0xblegh" | mawk '{v =3D $1} END{printf("%s\n", v "")}'
 0xblegh
 23:09 [614] $ echo "0xblegh" | awk '{v =3D $1} END{printf("%s\n", v "")}'
 0xblegh



 --=20
 						Greg A. Woods
 						Planix, Inc.

 <woods@planix.com>       +1 416 218 0099        http://www.planix.com/

 --pgp-sign-Multipart_Sat_Dec_19_00:01:11_2009-1
 Content-Type: application/pgp-signature
 Content-Transfer-Encoding: 7bit

 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.9 (NetBSD)

 iD8DBQBLLF4XZn1xt3i/9H8RAlzTAJ9TW5KGuzGgHN4zAYSYOlU3pRFY5wCffXMI
 cLw0/Z8hq8IRLdqIwsvAyDM=
 =3G8y
 -----END PGP SIGNATURE-----

 --pgp-sign-Multipart_Sat_Dec_19_00:01:11_2009-1--

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/42463: Bizarre behavior in awk with invalid numeric
	constants
Date: Sat, 23 Jan 2010 19:55:52 +0000

 On Sat, Dec 19, 2009 at 05:05:03AM +0000, Greg A. Woods wrote:
  >> This is not so surprising, although one would expect it to generate a
  >> syntax error (recall that awk doesn't handle hex integer constants...)
  >  
  >  Well, a syntax error would not really be correct so far as I can tell,
  >  maybe not even for constants in the program text.

 For arbitrary values, yes, but for constants in the program text?
 Surely those should be rejected if they aren't actually numbers.

  >  However as you've shown it doesn't seem as though things actually work
  >  the way _I_ would expect when it comes to expressions containing
  >  un-quoted non-numeric constants with numeric prefixes.

 Right. Whatever is going on is deeper than just running strtol() on
 some bogus strings gotten out of the program text.

  >  Interestingly to me awk and mawk behave in exactly the same bizarre way:
  >  
  >  $ awk 'BEGIN{v = 0xblegh - 5; printf("%s\n", v) }'
  >  0-5
  >  $ mawk 'BEGIN{v = 0xblegh - 5; printf("%s\n", v) }'
  >  0-5
  >  $ gawk 'BEGIN{v = 0xblegh - 5; printf("%s\n", v) }'
  >  11-5
  >  
  >  Those examples really do floor me.  What an amazing side effect, and
  >  identically in two different implementations!

 Yeah...

  >  To avoid any possible code parser issues we can feed the value in as
  >  input, and indeed that does then seem to have a better result, though
  >  still not entirely an expected result since, IIUC, neither awk nor mawk
  >  should interpret hex for input values, but apparently they do:
  >  
  >  [...]
  >  23:07 [603] $ echo "0xblegh" | awk '{v = $1} END{printf("%s\n", v + 0)}'
  >  11

 That seems broken, yes...

 -- 
 David A. Holland
 dholland@netbsd.org

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/42463: Bizarre behavior in awk with invalid numeric constants
Date: Sun, 1 Jan 2017 00:44:38 +0000

 On Wed, Dec 16, 2009 at 09:10:00PM +0000, dholland@eecs.harvard.edu wrote:
  >    % awk </dev/null 'END { printf "%s\n", (0xblegh + 0) }'
  >    00
  >    % awk </dev/null 'END { printf "%s\n", (0xblegh + 3) }'
  >    03
  >    % awk </dev/null 'END { printf "%s\n", (0xblegh - 5) }'
  >    0-5
  > 
  > Huh?

 I figured out what going on here: the lexer isn't a C lexer where
 trailing letters are potentially part of a number, so it matches only
 the acceptable digits, calls those a number, and goes on. That means
 it gets the expression

    0 xblegh - 5

 which is to say, the string concatenation of 0 with (xblegh - 5), and
 since xblegh (a perfectly fine variable name) hasn't been assigned a
 value, it's 0, so we get a string concatenation of two numbers, which
 converts them to strings and pastes them together.

 This can be detected by assigning xblegh a value:

    % awk < /dev/null 'END { printf "%s\n", (0xblegh - 5) }' xblegh=8
    03

 In the case of gawk, because it accepts hex constants, it takes 0xb
 and what's left is the also unassigned variable "legh".

 This behavior certainly produces mystifying results and it might be
 good to have awk print a warning, like "Warning: no space between
 number and variable name".

 untested patch for that:

 Index: lex.c
 ===================================================================
 RCS file: /cvsroot/src/external/historical/nawk/dist/lex.c,v
 retrieving revision 1.2
 diff -u -p -r1.2 lex.c
 --- lex.c	26 Aug 2010 14:55:19 -0000	1.2
 +++ lex.c	1 Jan 2017 00:44:00 -0000
 @@ -151,6 +151,8 @@ int gettok(char **pbuf, int *psz)	/* get
  			  || c == '.' || c == '+' || c == '-')
  				*bp++ = c;
  			else {
 +				if (isalpha(c))
 +					WARNING( "no space between number and variable name" );
  				unput(c);
  				break;
  			}


 -- 
 David A. Holland
 dholland@netbsd.org

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.