NetBSD Problem Report #58014

From www@netbsd.org  Sat Mar  9 07:46:05 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 012A71A923F
	for <gnats-bugs@gnats.NetBSD.org>; Sat,  9 Mar 2024 07:46:05 +0000 (UTC)
Message-Id: <20240309074534.076391A9242@mollari.NetBSD.org>
Date: Sat,  9 Mar 2024 07:45:33 +0000 (UTC)
From: michael.cheponis@gmail.com
Reply-To: michael.cheponis@gmail.com
To: gnats-bugs@NetBSD.org
Subject: wc no longer works with binary files
X-Send-Pr-Version: www-1.0

>Number:         58014
>Category:       bin
>Synopsis:       wc no longer works with binary files
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 09 07:50:00 +0000 2024
>Last-Modified:  Sun Mar 10 11:35:01 +0000 2024
>Originator:     Mike Cheponis
>Release:        10.0_RC5
>Organization:
self
>Environment:
NetBSD SS.Culver.Net 10.0_RC5 NetBSD 10.0_RC5 (GENERIC) #0: Tue Feb 27 05:27:39 UTC 2024  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
when 'wc' is given input from a binary file, it now gives the error:

wc: hello: invalid byte sequence

(Assuming 'hello' is a binary file)


On 
NetBSD arm64 10.99.7 NetBSD 10.99.7 (GENERIC64) #0: Fri Aug 11 08:15:30 UTC 2023 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/evbarm/compile/GENERIC64 evbarm

wc works as one would expect on arm64.  This error only shows up on amd64

There is no mention of a "-b" switch, say, for binary files; nor is there any explanation on the "man wc" page explaining this.
>How-To-Repeat:
wc <any binary file>
>Fix:

>Audit-Trail:
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sat, 09 Mar 2024 16:50:02 +0700

     Date:        Sat,  9 Mar 2024 07:50:00 +0000 (UTC)
     From:        michael.cheponis@gmail.com
     Message-ID:  <20240309075000.E456A1A9241@mollari.NetBSD.org>

   | when 'wc' is given input from a binary file, it now gives the error:
   |
   | wc: hello: invalid byte sequence

   | (Assuming 'hello' is a binary file)

 wc without flags needs to count characters.   What is a character depends
 upon your locale settings.  Do

 	LC_ALL=C wc hello

 (or prefix that with "env" if you're a csh user) and it will work.

   | wc works as one would expect on arm64.  This error only shows up on amd64

 More likely your default locale (LANG, LC_CTYPE or LC_ALL) is different in
 the two cases.

 I am not sure that it makes sense to attempt count characters, lines, or
 words, in a binary file - what would the answers mean?    If you were looking
 to get the size of the file, wc is not the right tool.

 I see no bug here, nor any real need to explain that a "word count" program
 isn't intended to be sane on non word/character containing files in the
 manual page.


From: Michael Cheponis <michael.cheponis@gmail.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sat, 9 Mar 2024 12:05:40 -0800

 --0000000000005a4a3506133fd73f
 Content-Type: text/plain; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 It's indeed the case that on my arm64 test of 'wc' that 'worked' on binary
 files, the environment variable "LC_ALL=3DC" was set.

 I think the man page for wc needs updating, at least, to explain its
 interaction with that environment variable.   There *is* a discussion on
 that man page about needed to use the posix iswspace() function, but when I
 followed that page, there was no detail about the LC_ALL environment
 variable.

 Also, historically, wc was something like this:

 int main(int argc, char *argv[]) {
     int character, lineCount =3D 0, wordCount =3D 0, byteCount =3D 0, inWor=
 d =3D 0;

     while ((character =3D getchar()) !=3D EOF) {
         ++byteCount;
         if (character =3D=3D '\n')
             ++lineCount;
         if (character =3D=3D ' ' || character =3D=3D '\n' || character =3D=
 =3D '\t')
             inWord =3D 0;
         else if (inWord =3D=3D 0) {
             inWord =3D 1;
             ++wordCount;
         }
     }

     printf("%d %d %d\n", lineCount, wordCount, byteCount);
     return 0;
 }

 That is, because unix 'files' are simply strings-of-bytes, it may be
 meaningless to count 'words' and 'lines' -- but yes, characters (file size)
 is useful.

 Generally, I use this when I want to know source size, and the program's
 executable is in the source directory as an artifact - I do "wc *"

 Anyway, I'm asking for a documentation change.

 Thank you,
 Mike

 On Sat, Mar 9, 2024 at 1:55=E2=80=AFAM Robert Elz <kre@munnari.oz.au> wrote=
 :

 > The following reply was made to PR bin/58014; it has been noted by GNATS.
 >
 > From: Robert Elz <kre@munnari.OZ.AU>
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: bin/58014: wc no longer works with binary files
 > Date: Sat, 09 Mar 2024 16:50:02 +0700
 >
 >      Date:        Sat,  9 Mar 2024 07:50:00 +0000 (UTC)
 >      From:        michael.cheponis@gmail.com
 >      Message-ID:  <20240309075000.E456A1A9241@mollari.NetBSD.org>
 >
 >    | when 'wc' is given input from a binary file, it now gives the error:
 >    |
 >    | wc: hello: invalid byte sequence
 >
 >    | (Assuming 'hello' is a binary file)
 >
 >  wc without flags needs to count characters.   What is a character depend=
 s
 >  upon your locale settings.  Do
 >
 >         LC_ALL=3DC wc hello
 >
 >  (or prefix that with "env" if you're a csh user) and it will work.
 >
 >    | wc works as one would expect on arm64.  This error only shows up on
 > amd64
 >
 >  More likely your default locale (LANG, LC_CTYPE or LC_ALL) is different =
 in
 >  the two cases.
 >
 >  I am not sure that it makes sense to attempt count characters, lines, or
 >  words, in a binary file - what would the answers mean?    If you were
 > looking
 >  to get the size of the file, wc is not the right tool.
 >
 >  I see no bug here, nor any real need to explain that a "word count"
 > program
 >  isn't intended to be sane on non word/character containing files in the
 >  manual page.
 >
 >
 >

 --0000000000005a4a3506133fd73f
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,he=
 lvetica,sans-serif;font-size:small">It&#39;s indeed the case that on my arm=
 64 test of &#39;wc&#39; that &#39;worked&#39; on binary files, the environm=
 ent variable &quot;LC_ALL=3DC&quot; was set.</div><div class=3D"gmail_defau=
 lt" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></=
 div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-=
 serif;font-size:small">I think the man page for wc needs updating, at least=
 , to explain its interaction with that environment variable.=C2=A0 =C2=A0Th=
 ere *is* a discussion on that man page about needed to use the posix iswspa=
 ce() function, but when I followed=C2=A0that page, there was no detail abou=
 t the LC_ALL environment variable.=C2=A0 =C2=A0</div><div class=3D"gmail_de=
 fault" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br=
 ></div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sa=
 ns-serif;font-size:small">Also, historically, wc was something like this:</=
 div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-=
 serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font=
 -family:arial,helvetica,sans-serif;font-size:small">int main(int argc, char=
  *argv[]) {<br>=C2=A0 =C2=A0 int character, lineCount =3D 0, wordCount =3D =
 0, byteCount =3D 0, inWord =3D 0;<br><br>=C2=A0 =C2=A0 while ((character =
 =3D getchar()) !=3D EOF) {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 ++byteCount;<br>=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (character =3D=3D &#39;\n&#39;)<br>=C2=A0 =
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++lineCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=
 =A0 if (character =3D=3D &#39; &#39; || character =3D=3D &#39;\n&#39; || ch=
 aracter =3D=3D &#39;\t&#39;)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 i=
 nWord =3D 0;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 else if (inWord =3D=3D 0) {<br>=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 inWord =3D 1;<br>=C2=A0 =C2=A0 =
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++wordCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }<b=
 r>=C2=A0 =C2=A0 }<br><br>=C2=A0 =C2=A0 printf(&quot;%d %d %d\n&quot;, lineC=
 ount, wordCount, byteCount);<br>=C2=A0 =C2=A0 return 0;<br>}<br></div><div =
 class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fon=
 t-size:small"><br></div><div class=3D"gmail_default" style=3D"font-family:a=
 rial,helvetica,sans-serif;font-size:small">That is, because unix &#39;files=
 &#39; are simply strings-of-bytes, it may be meaningless to count &#39;word=
 s&#39; and &#39;lines&#39; -- but yes, characters (file size) is useful.</d=
 iv><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-s=
 erif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-=
 family:arial,helvetica,sans-serif;font-size:small">Generally, I use this wh=
 en I want to know source size, and the program&#39;s executable is in the s=
 ource directory as an artifact - I do &quot;wc *&quot;=C2=A0</div><div clas=
 s=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-si=
 ze:small"><br></div><div class=3D"gmail_default" style=3D"font-family:arial=
 ,helvetica,sans-serif;font-size:small">Anyway, I&#39;m asking for a documen=
 tation change.</div><div class=3D"gmail_default" style=3D"font-family:arial=
 ,helvetica,sans-serif;font-size:small"><br></div><div class=3D"gmail_defaul=
 t" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">Thank y=
 ou,</div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,=
 sans-serif;font-size:small">Mike</div></div><br><div class=3D"gmail_quote">=
 <div dir=3D"ltr" class=3D"gmail_attr">On Sat, Mar 9, 2024 at 1:55=E2=80=AFA=
 M Robert Elz &lt;<a href=3D"mailto:kre@munnari.oz.au">kre@munnari.oz.au</a>=
 &gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px =
 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The =
 following reply was made to PR bin/58014; it has been noted by GNATS.<br>
 <br>
 From: Robert Elz &lt;<a href=3D"mailto:kre@munnari.OZ.AU" target=3D"_blank"=
 >kre@munnari.OZ.AU</a>&gt;<br>
 To: <a href=3D"mailto:gnats-bugs@netbsd.org" target=3D"_blank">gnats-bugs@n=
 etbsd.org</a><br>
 Cc: <br>
 Subject: Re: bin/58014: wc no longer works with binary files<br>
 Date: Sat, 09 Mar 2024 16:50:02 +0700<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0Date:=C2=A0 =C2=A0 =C2=A0 =C2=A0 Sat,=C2=A0 9 Mar 2024 =
 07:50:00 +0000 (UTC)<br>
 =C2=A0 =C2=A0 =C2=A0From:=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"mailto:mich=
 ael.cheponis@gmail.com" target=3D"_blank">michael.cheponis@gmail.com</a><br=
 >
 =C2=A0 =C2=A0 =C2=A0Message-ID:=C2=A0 &lt;<a href=3D"mailto:20240309075000.=
 E456A1A9241@mollari.NetBSD.org" target=3D"_blank">20240309075000.E456A1A924=
 1@mollari.NetBSD.org</a>&gt;<br>
 <br>
 =C2=A0 =C2=A0| when &#39;wc&#39; is given input from a binary file, it now =
 gives the error:<br>
 =C2=A0 =C2=A0|<br>
 =C2=A0 =C2=A0| wc: hello: invalid byte sequence<br>
 <br>
 =C2=A0 =C2=A0| (Assuming &#39;hello&#39; is a binary file)<br>
 <br>
 =C2=A0wc without flags needs to count characters.=C2=A0 =C2=A0What is a cha=
 racter depends<br>
 =C2=A0upon your locale settings.=C2=A0 Do<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LC_ALL=3DC wc hello<br>
 <br>
 =C2=A0(or prefix that with &quot;env&quot; if you&#39;re a csh user) and it=
  will work.<br>
 <br>
 =C2=A0 =C2=A0| wc works as one would expect on arm64.=C2=A0 This error only=
  shows up on amd64<br>
 <br>
 =C2=A0More likely your default locale (LANG, LC_CTYPE or LC_ALL) is differe=
 nt in<br>
 =C2=A0the two cases.<br>
 <br>
 =C2=A0I am not sure that it makes sense to attempt count characters, lines,=
  or<br>
 =C2=A0words, in a binary file - what would the answers mean?=C2=A0 =C2=A0 I=
 f you were looking<br>
 =C2=A0to get the size of the file, wc is not the right tool.<br>
 <br>
 =C2=A0I see no bug here, nor any real need to explain that a &quot;word cou=
 nt&quot; program<br>
 =C2=A0isn&#39;t intended to be sane on non word/character containing files =
 in the<br>
 =C2=A0manual page.<br>
 <br>
 <br>
 </blockquote></div>

 --0000000000005a4a3506133fd73f--

From: Michael Cheponis <michael.cheponis@gmail.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sat, 9 Mar 2024 12:16:32 -0800

 --00000000000033175e06133ffe14
 Content-Type: text/plain; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 Crap.  I see what the problem is:  I want my "ll" alias to give me commas
 in the file length reported.

 This requires setting env vars like this:
 LANG=3Den_US.UTF-8
 LC_ALL=3D""
 LC_NUMERIC=3Den_US.UTF-8 locale -k thousands_sep

 Specifically, note that LC_ALL must be set to ""

 producing output like this
 -rwxr-xr-x  1 mac  users  17,096 Mar  8 23:31 n*
 -rw-r--r--  1 mac  users     560 Mar  8 23:31 n.c
 -rw-r--r--  1 mac  users     155 Mar  8 23:01 n.c~

 Now, if I set LC_ALL=3DC   (to make 'wc' count ok on binary files), then I
 get from my "ll" :
 -rwxr-xr-x  1 mac  users   17096 Mar  8 23:31 n*
 -rw-r--r--  1 mac  users     560 Mar  8 23:31 n.c
 -rw-r--r--  1 mac  users     155 Mar  8 23:01 n.c~

 Catch 22 -- I have to use an alias for wc that changes the local
 environment variable when running wc

 alias wc=3D"LC_ALL=3DC wc"

 Again, I'm not sure this is sufficiently documented.   I'd be happy to make
 suggested changes to the man page(s).

 Thanks again,
 Mike


 On Sat, Mar 9, 2024 at 12:05=E2=80=AFPM Michael Cheponis <michael.cheponis@=
 gmail.com>
 wrote:

 > It's indeed the case that on my arm64 test of 'wc' that 'worked' on binar=
 y
 > files, the environment variable "LC_ALL=3DC" was set.
 >
 > I think the man page for wc needs updating, at least, to explain its
 > interaction with that environment variable.   There *is* a discussion on
 > that man page about needed to use the posix iswspace() function, but when=
  I
 > followed that page, there was no detail about the LC_ALL environment
 > variable.
 >
 > Also, historically, wc was something like this:
 >
 > int main(int argc, char *argv[]) {
 >     int character, lineCount =3D 0, wordCount =3D 0, byteCount =3D 0, inW=
 ord =3D 0;
 >
 >     while ((character =3D getchar()) !=3D EOF) {
 >         ++byteCount;
 >         if (character =3D=3D '\n')
 >             ++lineCount;
 >         if (character =3D=3D ' ' || character =3D=3D '\n' || character =
 =3D=3D '\t')
 >             inWord =3D 0;
 >         else if (inWord =3D=3D 0) {
 >             inWord =3D 1;
 >             ++wordCount;
 >         }
 >     }
 >
 >     printf("%d %d %d\n", lineCount, wordCount, byteCount);
 >     return 0;
 > }
 >
 > That is, because unix 'files' are simply strings-of-bytes, it may be
 > meaningless to count 'words' and 'lines' -- but yes, characters (file siz=
 e)
 > is useful.
 >
 > Generally, I use this when I want to know source size, and the program's
 > executable is in the source directory as an artifact - I do "wc *"
 >
 > Anyway, I'm asking for a documentation change.
 >
 > Thank you,
 > Mike
 >
 > On Sat, Mar 9, 2024 at 1:55=E2=80=AFAM Robert Elz <kre@munnari.oz.au> wro=
 te:
 >
 >> The following reply was made to PR bin/58014; it has been noted by GNATS=
 .
 >>
 >> From: Robert Elz <kre@munnari.OZ.AU>
 >> To: gnats-bugs@netbsd.org
 >> Cc:
 >> Subject: Re: bin/58014: wc no longer works with binary files
 >> Date: Sat, 09 Mar 2024 16:50:02 +0700
 >>
 >>      Date:        Sat,  9 Mar 2024 07:50:00 +0000 (UTC)
 >>      From:        michael.cheponis@gmail.com
 >>      Message-ID:  <20240309075000.E456A1A9241@mollari.NetBSD.org>
 >>
 >>    | when 'wc' is given input from a binary file, it now gives the error=
 :
 >>    |
 >>    | wc: hello: invalid byte sequence
 >>
 >>    | (Assuming 'hello' is a binary file)
 >>
 >>  wc without flags needs to count characters.   What is a character depen=
 ds
 >>  upon your locale settings.  Do
 >>
 >>         LC_ALL=3DC wc hello
 >>
 >>  (or prefix that with "env" if you're a csh user) and it will work.
 >>
 >>    | wc works as one would expect on arm64.  This error only shows up on
 >> amd64
 >>
 >>  More likely your default locale (LANG, LC_CTYPE or LC_ALL) is different
 >> in
 >>  the two cases.
 >>
 >>  I am not sure that it makes sense to attempt count characters, lines, o=
 r
 >>  words, in a binary file - what would the answers mean?    If you were
 >> looking
 >>  to get the size of the file, wc is not the right tool.
 >>
 >>  I see no bug here, nor any real need to explain that a "word count"
 >> program
 >>  isn't intended to be sane on non word/character containing files in the
 >>  manual page.
 >>
 >>
 >>

 --00000000000033175e06133ffe14
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,he=
 lvetica,sans-serif;font-size:small">Crap.=C2=A0 I see what the problem is:=
 =C2=A0 I want my &quot;ll&quot; alias to give me commas in the file length =
 reported.</div><div class=3D"gmail_default" style=3D"font-family:arial,helv=
 etica,sans-serif;font-size:small"><br></div><div class=3D"gmail_default" st=
 yle=3D"font-family:arial,helvetica,sans-serif;font-size:small">This require=
 s setting env vars like this:</div><div class=3D"gmail_default" style=3D"fo=
 nt-family:arial,helvetica,sans-serif;font-size:small">LANG=3Den_US.UTF-8<br=
 >LC_ALL=3D&quot;&quot;<br>LC_NUMERIC=3Den_US.UTF-8 locale -k thousands_sep<=
 br></div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,=
 sans-serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D=
 "font-family:arial,helvetica,sans-serif;font-size:small">Specifically, note=
  that LC_ALL must be set to &quot;&quot;</div><div class=3D"gmail_default" =
 style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></div>=
 <div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-seri=
 f;font-size:small">producing output like this</div><div class=3D"gmail_defa=
 ult" style=3D"font-size:small"><font face=3D"monospace">-rwxr-xr-x =C2=A01 =
 mac =C2=A0users =C2=A017,096 Mar =C2=A08 23:31 n*<br>-rw-r--r-- =C2=A01 mac=
  =C2=A0users =C2=A0 =C2=A0 560 Mar =C2=A08 23:31 n.c<br>-rw-r--r-- =C2=A01 =
 mac =C2=A0users =C2=A0 =C2=A0 155 Mar =C2=A08 23:01 n.c~</font><br></div><d=
 iv class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;=
 font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-famil=
 y:arial,helvetica,sans-serif;font-size:small">Now, if I set LC_ALL=3DC=C2=
 =A0 =C2=A0(to make &#39;wc&#39; count ok on binary files), then I get from =
 my &quot;ll&quot; :</div><div class=3D"gmail_default" style=3D"font-size:sm=
 all"><font face=3D"monospace">-rwxr-xr-x =C2=A01 mac =C2=A0users =C2=A0 170=
 96 Mar =C2=A08 23:31 n*<br>-rw-r--r-- =C2=A01 mac =C2=A0users =C2=A0 =C2=A0=
  560 Mar =C2=A08 23:31 n.c<br>-rw-r--r-- =C2=A01 mac =C2=A0users =C2=A0 =C2=
 =A0 155 Mar =C2=A08 23:01 n.c~</font><br></div><div class=3D"gmail_default"=
  style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></div=
 ><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-ser=
 if;font-size:small">Catch 22 -- I have to use an alias for wc that changes =
 the local environment=C2=A0variable when running wc</div><div class=3D"gmai=
 l_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"=
 ><br></div><div class=3D"gmail_default" style=3D"font-family:arial,helvetic=
 a,sans-serif;font-size:small"><div class=3D"gmail_default">alias wc=3D&quot=
 ;LC_ALL=3DC wc&quot;<br></div></div><div class=3D"gmail_default" style=3D"f=
 ont-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class=
 =3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-siz=
 e:small">Again, I&#39;m not sure this is sufficiently documented.=C2=A0 =C2=
 =A0I&#39;d be happy to make suggested changes to the man page(s).</div><div=
  class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fo=
 nt-size:small"><br></div><div class=3D"gmail_default" style=3D"font-family:=
 arial,helvetica,sans-serif;font-size:small">Thanks again,</div><div class=
 =3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-siz=
 e:small">Mike</div><div class=3D"gmail_default" style=3D"font-family:arial,=
 helvetica,sans-serif;font-size:small"><br></div></div><br><div class=3D"gma=
 il_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sat, Mar 9, 2024 at 12:0=
 5=E2=80=AFPM Michael Cheponis &lt;<a href=3D"mailto:michael.cheponis@gmail.=
 com">michael.cheponis@gmail.com</a>&gt; wrote:<br></div><blockquote class=
 =3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
 b(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_defau=
 lt" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">It&#39=
 ;s indeed the case that on my arm64 test of &#39;wc&#39; that &#39;worked&#=
 39; on binary files, the environment variable &quot;LC_ALL=3DC&quot; was se=
 t.</div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,s=
 ans-serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"=
 font-family:arial,helvetica,sans-serif;font-size:small">I think the man pag=
 e for wc needs updating, at least, to explain its interaction with that env=
 ironment variable.=C2=A0 =C2=A0There *is* a discussion on that man page abo=
 ut needed to use the posix iswspace() function, but when I followed=C2=A0th=
 at page, there was no detail about the LC_ALL environment variable.=C2=A0 =
 =C2=A0</div><div class=3D"gmail_default" style=3D"font-family:arial,helveti=
 ca,sans-serif;font-size:small"><br></div><div class=3D"gmail_default" style=
 =3D"font-family:arial,helvetica,sans-serif;font-size:small">Also, historica=
 lly, wc was something like this:</div><div class=3D"gmail_default" style=3D=
 "font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div cla=
 ss=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-s=
 ize:small">int main(int argc, char *argv[]) {<br>=C2=A0 =C2=A0 int characte=
 r, lineCount =3D 0, wordCount =3D 0, byteCount =3D 0, inWord =3D 0;<br><br>=
 =C2=A0 =C2=A0 while ((character =3D getchar()) !=3D EOF) {<br>=C2=A0 =C2=A0=
  =C2=A0 =C2=A0 ++byteCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (character =
 =3D=3D &#39;\n&#39;)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++lineCou=
 nt;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (character =3D=3D &#39; &#39; || char=
 acter =3D=3D &#39;\n&#39; || character =3D=3D &#39;\t&#39;)<br>=C2=A0 =C2=
 =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 inWord =3D 0;<br>=C2=A0 =C2=A0 =C2=A0 =C2=
 =A0 else if (inWord =3D=3D 0) {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
 =A0 inWord =3D 1;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++wordCount;=
 <br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br>=C2=A0 =C2=A0 }<br><br>=C2=A0 =C2=A0 p=
 rintf(&quot;%d %d %d\n&quot;, lineCount, wordCount, byteCount);<br>=C2=A0 =
 =C2=A0 return 0;<br>}<br></div><div class=3D"gmail_default" style=3D"font-f=
 amily:arial,helvetica,sans-serif;font-size:small"><br></div><div class=3D"g=
 mail_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:sma=
 ll">That is, because unix &#39;files&#39; are simply strings-of-bytes, it m=
 ay be meaningless to count &#39;words&#39; and &#39;lines&#39; -- but yes, =
 characters (file size) is useful.</div><div class=3D"gmail_default" style=
 =3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div =
 class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fon=
 t-size:small">Generally, I use this when I want to know source size, and th=
 e program&#39;s executable is in the source directory as an artifact - I do=
  &quot;wc *&quot;=C2=A0</div><div class=3D"gmail_default" style=3D"font-fam=
 ily:arial,helvetica,sans-serif;font-size:small"><br></div><div class=3D"gma=
 il_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:small=
 ">Anyway, I&#39;m asking for a documentation change.</div><div class=3D"gma=
 il_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:small=
 "><br></div><div class=3D"gmail_default" style=3D"font-family:arial,helveti=
 ca,sans-serif;font-size:small">Thank you,</div><div class=3D"gmail_default"=
  style=3D"font-family:arial,helvetica,sans-serif;font-size:small">Mike</div=
 ></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr"=
 >On Sat, Mar 9, 2024 at 1:55=E2=80=AFAM Robert Elz &lt;<a href=3D"mailto:kr=
 e@munnari.oz.au" target=3D"_blank">kre@munnari.oz.au</a>&gt; wrote:<br></di=
 v><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;borde=
 r-left:1px solid rgb(204,204,204);padding-left:1ex">The following reply was=
  made to PR bin/58014; it has been noted by GNATS.<br>
 <br>
 From: Robert Elz &lt;<a href=3D"mailto:kre@munnari.OZ.AU" target=3D"_blank"=
 >kre@munnari.OZ.AU</a>&gt;<br>
 To: <a href=3D"mailto:gnats-bugs@netbsd.org" target=3D"_blank">gnats-bugs@n=
 etbsd.org</a><br>
 Cc: <br>
 Subject: Re: bin/58014: wc no longer works with binary files<br>
 Date: Sat, 09 Mar 2024 16:50:02 +0700<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0Date:=C2=A0 =C2=A0 =C2=A0 =C2=A0 Sat,=C2=A0 9 Mar 2024 =
 07:50:00 +0000 (UTC)<br>
 =C2=A0 =C2=A0 =C2=A0From:=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"mailto:mich=
 ael.cheponis@gmail.com" target=3D"_blank">michael.cheponis@gmail.com</a><br=
 >
 =C2=A0 =C2=A0 =C2=A0Message-ID:=C2=A0 &lt;<a href=3D"mailto:20240309075000.=
 E456A1A9241@mollari.NetBSD.org" target=3D"_blank">20240309075000.E456A1A924=
 1@mollari.NetBSD.org</a>&gt;<br>
 <br>
 =C2=A0 =C2=A0| when &#39;wc&#39; is given input from a binary file, it now =
 gives the error:<br>
 =C2=A0 =C2=A0|<br>
 =C2=A0 =C2=A0| wc: hello: invalid byte sequence<br>
 <br>
 =C2=A0 =C2=A0| (Assuming &#39;hello&#39; is a binary file)<br>
 <br>
 =C2=A0wc without flags needs to count characters.=C2=A0 =C2=A0What is a cha=
 racter depends<br>
 =C2=A0upon your locale settings.=C2=A0 Do<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LC_ALL=3DC wc hello<br>
 <br>
 =C2=A0(or prefix that with &quot;env&quot; if you&#39;re a csh user) and it=
  will work.<br>
 <br>
 =C2=A0 =C2=A0| wc works as one would expect on arm64.=C2=A0 This error only=
  shows up on amd64<br>
 <br>
 =C2=A0More likely your default locale (LANG, LC_CTYPE or LC_ALL) is differe=
 nt in<br>
 =C2=A0the two cases.<br>
 <br>
 =C2=A0I am not sure that it makes sense to attempt count characters, lines,=
  or<br>
 =C2=A0words, in a binary file - what would the answers mean?=C2=A0 =C2=A0 I=
 f you were looking<br>
 =C2=A0to get the size of the file, wc is not the right tool.<br>
 <br>
 =C2=A0I see no bug here, nor any real need to explain that a &quot;word cou=
 nt&quot; program<br>
 =C2=A0isn&#39;t intended to be sane on non word/character containing files =
 in the<br>
 =C2=A0manual page.<br>
 <br>
 <br>
 </blockquote></div>
 </blockquote></div>

 --00000000000033175e06133ffe14--

From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sun, 10 Mar 2024 11:34:31 +0000 (UTC)

 On Sat, 9 Mar 2024, michael.cheponis@gmail.com wrote:

 >> Description:
 > when 'wc' is given input from a binary file, it now gives the error:
 >
 > wc: hello: invalid byte sequence
 >
 > (Assuming 'hello' is a binary file)
 >


 It's pretty inefficient to use mbrtowc() when a `-m' wasn't supplied (no
 matter what the locale), but, you can at least stop wc from spamming and
 confusing users so...:

 ```
 --- wc.c.orig	2024-01-14 17:39:19.000000000 +0000
 +++ wc.c	2024-03-10 11:06:10.228327632 +0000
 @@ -73,7 +73,7 @@
   #endif

   static wc_count_t	tlinect, twordct, tcharct, tlongest;
 -static bool		doline, doword, dobyte, dochar, dolongest;
 +static bool		doline, doword, dobyte, dochar, dolongest, warned;
   static int 		rval = 0;

   static void	cnt(const char *);
 @@ -148,8 +148,11 @@
   	do {
   		r = mbrtowc(wc, p, len, st);
   		if (r == (size_t)-1) {
 -			warnx("%s: invalid byte sequence", file);
 -			rval = 1;
 +			if (!warned && dochar) {
 +				warnx("%s: invalid byte sequence", file);
 +				rval = 1;
 +				warned = true;
 +			}

   			/* XXX skip 1 byte */
   			len--;
 @@ -187,6 +190,7 @@
   	int fd, len = 0;

   	linect = wordct = charct = longest = 0;
 +	warned = false;
   	if (file != NULL) {
   		if ((fd = open(file, O_RDONLY, 0)) < 0) {
   			warn("%s", file);
 ```

 > wc works as one would expect on arm64.  This error only shows up on amd64
 >

 Are the counts between the multi-byte locale vs. C locale actually wrong, or
 is it just the spew that worried you?

 > There is no mention of a "-b" switch, say, for binary files; nor is there any explanation on the "man wc" page explaining this.
 >

 There's a `-c' (byte-count) switch which is the default.

 -RVP

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.