NetBSD Problem Report #58014
From www@netbsd.org Sat Mar 9 07:46:05 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 012A71A923F
for <gnats-bugs@gnats.NetBSD.org>; Sat, 9 Mar 2024 07:46:05 +0000 (UTC)
Message-Id: <20240309074534.076391A9242@mollari.NetBSD.org>
Date: Sat, 9 Mar 2024 07:45:33 +0000 (UTC)
From: michael.cheponis@gmail.com
Reply-To: michael.cheponis@gmail.com
To: gnats-bugs@NetBSD.org
Subject: wc no longer works with binary files
X-Send-Pr-Version: www-1.0
>Number: 58014
>Category: bin
>Synopsis: wc no longer works with binary files
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Mar 09 07:50:00 +0000 2024
>Last-Modified: Sun Mar 10 11:35:01 +0000 2024
>Originator: Mike Cheponis
>Release: 10.0_RC5
>Organization:
self
>Environment:
NetBSD SS.Culver.Net 10.0_RC5 NetBSD 10.0_RC5 (GENERIC) #0: Tue Feb 27 05:27:39 UTC 2024 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
when 'wc' is given input from a binary file, it now gives the error:
wc: hello: invalid byte sequence
(Assuming 'hello' is a binary file)
On
NetBSD arm64 10.99.7 NetBSD 10.99.7 (GENERIC64) #0: Fri Aug 11 08:15:30 UTC 2023 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/evbarm/compile/GENERIC64 evbarm
wc works as one would expect on arm64. This error only shows up on amd64
There is no mention of a "-b" switch, say, for binary files; nor is there any explanation on the "man wc" page explaining this.
>How-To-Repeat:
wc <any binary file>
>Fix:
>Audit-Trail:
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sat, 09 Mar 2024 16:50:02 +0700
Date: Sat, 9 Mar 2024 07:50:00 +0000 (UTC)
From: michael.cheponis@gmail.com
Message-ID: <20240309075000.E456A1A9241@mollari.NetBSD.org>
| when 'wc' is given input from a binary file, it now gives the error:
|
| wc: hello: invalid byte sequence
| (Assuming 'hello' is a binary file)
wc without flags needs to count characters. What is a character depends
upon your locale settings. Do
LC_ALL=C wc hello
(or prefix that with "env" if you're a csh user) and it will work.
| wc works as one would expect on arm64. This error only shows up on amd64
More likely your default locale (LANG, LC_CTYPE or LC_ALL) is different in
the two cases.
I am not sure that it makes sense to attempt count characters, lines, or
words, in a binary file - what would the answers mean? If you were looking
to get the size of the file, wc is not the right tool.
I see no bug here, nor any real need to explain that a "word count" program
isn't intended to be sane on non word/character containing files in the
manual page.
From: Michael Cheponis <michael.cheponis@gmail.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sat, 9 Mar 2024 12:05:40 -0800
--0000000000005a4a3506133fd73f
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
It's indeed the case that on my arm64 test of 'wc' that 'worked' on binary
files, the environment variable "LC_ALL=3DC" was set.
I think the man page for wc needs updating, at least, to explain its
interaction with that environment variable. There *is* a discussion on
that man page about needed to use the posix iswspace() function, but when I
followed that page, there was no detail about the LC_ALL environment
variable.
Also, historically, wc was something like this:
int main(int argc, char *argv[]) {
int character, lineCount =3D 0, wordCount =3D 0, byteCount =3D 0, inWor=
d =3D 0;
while ((character =3D getchar()) !=3D EOF) {
++byteCount;
if (character =3D=3D '\n')
++lineCount;
if (character =3D=3D ' ' || character =3D=3D '\n' || character =3D=
=3D '\t')
inWord =3D 0;
else if (inWord =3D=3D 0) {
inWord =3D 1;
++wordCount;
}
}
printf("%d %d %d\n", lineCount, wordCount, byteCount);
return 0;
}
That is, because unix 'files' are simply strings-of-bytes, it may be
meaningless to count 'words' and 'lines' -- but yes, characters (file size)
is useful.
Generally, I use this when I want to know source size, and the program's
executable is in the source directory as an artifact - I do "wc *"
Anyway, I'm asking for a documentation change.
Thank you,
Mike
On Sat, Mar 9, 2024 at 1:55=E2=80=AFAM Robert Elz <kre@munnari.oz.au> wrote=
:
> The following reply was made to PR bin/58014; it has been noted by GNATS.
>
> From: Robert Elz <kre@munnari.OZ.AU>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: bin/58014: wc no longer works with binary files
> Date: Sat, 09 Mar 2024 16:50:02 +0700
>
> Date: Sat, 9 Mar 2024 07:50:00 +0000 (UTC)
> From: michael.cheponis@gmail.com
> Message-ID: <20240309075000.E456A1A9241@mollari.NetBSD.org>
>
> | when 'wc' is given input from a binary file, it now gives the error:
> |
> | wc: hello: invalid byte sequence
>
> | (Assuming 'hello' is a binary file)
>
> wc without flags needs to count characters. What is a character depend=
s
> upon your locale settings. Do
>
> LC_ALL=3DC wc hello
>
> (or prefix that with "env" if you're a csh user) and it will work.
>
> | wc works as one would expect on arm64. This error only shows up on
> amd64
>
> More likely your default locale (LANG, LC_CTYPE or LC_ALL) is different =
in
> the two cases.
>
> I am not sure that it makes sense to attempt count characters, lines, or
> words, in a binary file - what would the answers mean? If you were
> looking
> to get the size of the file, wc is not the right tool.
>
> I see no bug here, nor any real need to explain that a "word count"
> program
> isn't intended to be sane on non word/character containing files in the
> manual page.
>
>
>
--0000000000005a4a3506133fd73f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,he=
lvetica,sans-serif;font-size:small">It's indeed the case that on my arm=
64 test of 'wc' that 'worked' on binary files, the environm=
ent variable "LC_ALL=3DC" was set.</div><div class=3D"gmail_defau=
lt" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></=
div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-=
serif;font-size:small">I think the man page for wc needs updating, at least=
, to explain its interaction with that environment variable.=C2=A0 =C2=A0Th=
ere *is* a discussion on that man page about needed to use the posix iswspa=
ce() function, but when I followed=C2=A0that page, there was no detail abou=
t the LC_ALL environment variable.=C2=A0 =C2=A0</div><div class=3D"gmail_de=
fault" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br=
></div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sa=
ns-serif;font-size:small">Also, historically, wc was something like this:</=
div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-=
serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font=
-family:arial,helvetica,sans-serif;font-size:small">int main(int argc, char=
*argv[]) {<br>=C2=A0 =C2=A0 int character, lineCount =3D 0, wordCount =3D =
0, byteCount =3D 0, inWord =3D 0;<br><br>=C2=A0 =C2=A0 while ((character =
=3D getchar()) !=3D EOF) {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 ++byteCount;<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (character =3D=3D '\n')<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++lineCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 if (character =3D=3D ' ' || character =3D=3D '\n' || ch=
aracter =3D=3D '\t')<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 i=
nWord =3D 0;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 else if (inWord =3D=3D 0) {<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 inWord =3D 1;<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ++wordCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }<b=
r>=C2=A0 =C2=A0 }<br><br>=C2=A0 =C2=A0 printf("%d %d %d\n", lineC=
ount, wordCount, byteCount);<br>=C2=A0 =C2=A0 return 0;<br>}<br></div><div =
class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fon=
t-size:small"><br></div><div class=3D"gmail_default" style=3D"font-family:a=
rial,helvetica,sans-serif;font-size:small">That is, because unix 'files=
' are simply strings-of-bytes, it may be meaningless to count 'word=
s' and 'lines' -- but yes, characters (file size) is useful.</d=
iv><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-s=
erif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-=
family:arial,helvetica,sans-serif;font-size:small">Generally, I use this wh=
en I want to know source size, and the program's executable is in the s=
ource directory as an artifact - I do "wc *"=C2=A0</div><div clas=
s=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-si=
ze:small"><br></div><div class=3D"gmail_default" style=3D"font-family:arial=
,helvetica,sans-serif;font-size:small">Anyway, I'm asking for a documen=
tation change.</div><div class=3D"gmail_default" style=3D"font-family:arial=
,helvetica,sans-serif;font-size:small"><br></div><div class=3D"gmail_defaul=
t" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">Thank y=
ou,</div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,=
sans-serif;font-size:small">Mike</div></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Sat, Mar 9, 2024 at 1:55=E2=80=AFA=
M Robert Elz <<a href=3D"mailto:kre@munnari.oz.au">kre@munnari.oz.au</a>=
> wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The =
following reply was made to PR bin/58014; it has been noted by GNATS.<br>
<br>
From: Robert Elz <<a href=3D"mailto:kre@munnari.OZ.AU" target=3D"_blank"=
>kre@munnari.OZ.AU</a>><br>
To: <a href=3D"mailto:gnats-bugs@netbsd.org" target=3D"_blank">gnats-bugs@n=
etbsd.org</a><br>
Cc: <br>
Subject: Re: bin/58014: wc no longer works with binary files<br>
Date: Sat, 09 Mar 2024 16:50:02 +0700<br>
<br>
=C2=A0 =C2=A0 =C2=A0Date:=C2=A0 =C2=A0 =C2=A0 =C2=A0 Sat,=C2=A0 9 Mar 2024 =
07:50:00 +0000 (UTC)<br>
=C2=A0 =C2=A0 =C2=A0From:=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"mailto:mich=
ael.cheponis@gmail.com" target=3D"_blank">michael.cheponis@gmail.com</a><br=
>
=C2=A0 =C2=A0 =C2=A0Message-ID:=C2=A0 <<a href=3D"mailto:20240309075000.=
E456A1A9241@mollari.NetBSD.org" target=3D"_blank">20240309075000.E456A1A924=
1@mollari.NetBSD.org</a>><br>
<br>
=C2=A0 =C2=A0| when 'wc' is given input from a binary file, it now =
gives the error:<br>
=C2=A0 =C2=A0|<br>
=C2=A0 =C2=A0| wc: hello: invalid byte sequence<br>
<br>
=C2=A0 =C2=A0| (Assuming 'hello' is a binary file)<br>
<br>
=C2=A0wc without flags needs to count characters.=C2=A0 =C2=A0What is a cha=
racter depends<br>
=C2=A0upon your locale settings.=C2=A0 Do<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 LC_ALL=3DC wc hello<br>
<br>
=C2=A0(or prefix that with "env" if you're a csh user) and it=
will work.<br>
<br>
=C2=A0 =C2=A0| wc works as one would expect on arm64.=C2=A0 This error only=
shows up on amd64<br>
<br>
=C2=A0More likely your default locale (LANG, LC_CTYPE or LC_ALL) is differe=
nt in<br>
=C2=A0the two cases.<br>
<br>
=C2=A0I am not sure that it makes sense to attempt count characters, lines,=
or<br>
=C2=A0words, in a binary file - what would the answers mean?=C2=A0 =C2=A0 I=
f you were looking<br>
=C2=A0to get the size of the file, wc is not the right tool.<br>
<br>
=C2=A0I see no bug here, nor any real need to explain that a "word cou=
nt" program<br>
=C2=A0isn't intended to be sane on non word/character containing files =
in the<br>
=C2=A0manual page.<br>
<br>
<br>
</blockquote></div>
--0000000000005a4a3506133fd73f--
From: Michael Cheponis <michael.cheponis@gmail.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sat, 9 Mar 2024 12:16:32 -0800
--00000000000033175e06133ffe14
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Crap. I see what the problem is: I want my "ll" alias to give me commas
in the file length reported.
This requires setting env vars like this:
LANG=3Den_US.UTF-8
LC_ALL=3D""
LC_NUMERIC=3Den_US.UTF-8 locale -k thousands_sep
Specifically, note that LC_ALL must be set to ""
producing output like this
-rwxr-xr-x 1 mac users 17,096 Mar 8 23:31 n*
-rw-r--r-- 1 mac users 560 Mar 8 23:31 n.c
-rw-r--r-- 1 mac users 155 Mar 8 23:01 n.c~
Now, if I set LC_ALL=3DC (to make 'wc' count ok on binary files), then I
get from my "ll" :
-rwxr-xr-x 1 mac users 17096 Mar 8 23:31 n*
-rw-r--r-- 1 mac users 560 Mar 8 23:31 n.c
-rw-r--r-- 1 mac users 155 Mar 8 23:01 n.c~
Catch 22 -- I have to use an alias for wc that changes the local
environment variable when running wc
alias wc=3D"LC_ALL=3DC wc"
Again, I'm not sure this is sufficiently documented. I'd be happy to make
suggested changes to the man page(s).
Thanks again,
Mike
On Sat, Mar 9, 2024 at 12:05=E2=80=AFPM Michael Cheponis <michael.cheponis@=
gmail.com>
wrote:
> It's indeed the case that on my arm64 test of 'wc' that 'worked' on binar=
y
> files, the environment variable "LC_ALL=3DC" was set.
>
> I think the man page for wc needs updating, at least, to explain its
> interaction with that environment variable. There *is* a discussion on
> that man page about needed to use the posix iswspace() function, but when=
I
> followed that page, there was no detail about the LC_ALL environment
> variable.
>
> Also, historically, wc was something like this:
>
> int main(int argc, char *argv[]) {
> int character, lineCount =3D 0, wordCount =3D 0, byteCount =3D 0, inW=
ord =3D 0;
>
> while ((character =3D getchar()) !=3D EOF) {
> ++byteCount;
> if (character =3D=3D '\n')
> ++lineCount;
> if (character =3D=3D ' ' || character =3D=3D '\n' || character =
=3D=3D '\t')
> inWord =3D 0;
> else if (inWord =3D=3D 0) {
> inWord =3D 1;
> ++wordCount;
> }
> }
>
> printf("%d %d %d\n", lineCount, wordCount, byteCount);
> return 0;
> }
>
> That is, because unix 'files' are simply strings-of-bytes, it may be
> meaningless to count 'words' and 'lines' -- but yes, characters (file siz=
e)
> is useful.
>
> Generally, I use this when I want to know source size, and the program's
> executable is in the source directory as an artifact - I do "wc *"
>
> Anyway, I'm asking for a documentation change.
>
> Thank you,
> Mike
>
> On Sat, Mar 9, 2024 at 1:55=E2=80=AFAM Robert Elz <kre@munnari.oz.au> wro=
te:
>
>> The following reply was made to PR bin/58014; it has been noted by GNATS=
.
>>
>> From: Robert Elz <kre@munnari.OZ.AU>
>> To: gnats-bugs@netbsd.org
>> Cc:
>> Subject: Re: bin/58014: wc no longer works with binary files
>> Date: Sat, 09 Mar 2024 16:50:02 +0700
>>
>> Date: Sat, 9 Mar 2024 07:50:00 +0000 (UTC)
>> From: michael.cheponis@gmail.com
>> Message-ID: <20240309075000.E456A1A9241@mollari.NetBSD.org>
>>
>> | when 'wc' is given input from a binary file, it now gives the error=
:
>> |
>> | wc: hello: invalid byte sequence
>>
>> | (Assuming 'hello' is a binary file)
>>
>> wc without flags needs to count characters. What is a character depen=
ds
>> upon your locale settings. Do
>>
>> LC_ALL=3DC wc hello
>>
>> (or prefix that with "env" if you're a csh user) and it will work.
>>
>> | wc works as one would expect on arm64. This error only shows up on
>> amd64
>>
>> More likely your default locale (LANG, LC_CTYPE or LC_ALL) is different
>> in
>> the two cases.
>>
>> I am not sure that it makes sense to attempt count characters, lines, o=
r
>> words, in a binary file - what would the answers mean? If you were
>> looking
>> to get the size of the file, wc is not the right tool.
>>
>> I see no bug here, nor any real need to explain that a "word count"
>> program
>> isn't intended to be sane on non word/character containing files in the
>> manual page.
>>
>>
>>
--00000000000033175e06133ffe14
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,he=
lvetica,sans-serif;font-size:small">Crap.=C2=A0 I see what the problem is:=
=C2=A0 I want my "ll" alias to give me commas in the file length =
reported.</div><div class=3D"gmail_default" style=3D"font-family:arial,helv=
etica,sans-serif;font-size:small"><br></div><div class=3D"gmail_default" st=
yle=3D"font-family:arial,helvetica,sans-serif;font-size:small">This require=
s setting env vars like this:</div><div class=3D"gmail_default" style=3D"fo=
nt-family:arial,helvetica,sans-serif;font-size:small">LANG=3Den_US.UTF-8<br=
>LC_ALL=3D""<br>LC_NUMERIC=3Den_US.UTF-8 locale -k thousands_sep<=
br></div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,=
sans-serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D=
"font-family:arial,helvetica,sans-serif;font-size:small">Specifically, note=
that LC_ALL must be set to ""</div><div class=3D"gmail_default" =
style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></div>=
<div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-seri=
f;font-size:small">producing output like this</div><div class=3D"gmail_defa=
ult" style=3D"font-size:small"><font face=3D"monospace">-rwxr-xr-x =C2=A01 =
mac =C2=A0users =C2=A017,096 Mar =C2=A08 23:31 n*<br>-rw-r--r-- =C2=A01 mac=
=C2=A0users =C2=A0 =C2=A0 560 Mar =C2=A08 23:31 n.c<br>-rw-r--r-- =C2=A01 =
mac =C2=A0users =C2=A0 =C2=A0 155 Mar =C2=A08 23:01 n.c~</font><br></div><d=
iv class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;=
font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-famil=
y:arial,helvetica,sans-serif;font-size:small">Now, if I set LC_ALL=3DC=C2=
=A0 =C2=A0(to make 'wc' count ok on binary files), then I get from =
my "ll" :</div><div class=3D"gmail_default" style=3D"font-size:sm=
all"><font face=3D"monospace">-rwxr-xr-x =C2=A01 mac =C2=A0users =C2=A0 170=
96 Mar =C2=A08 23:31 n*<br>-rw-r--r-- =C2=A01 mac =C2=A0users =C2=A0 =C2=A0=
560 Mar =C2=A08 23:31 n.c<br>-rw-r--r-- =C2=A01 mac =C2=A0users =C2=A0 =C2=
=A0 155 Mar =C2=A08 23:01 n.c~</font><br></div><div class=3D"gmail_default"=
style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></div=
><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-ser=
if;font-size:small">Catch 22 -- I have to use an alias for wc that changes =
the local environment=C2=A0variable when running wc</div><div class=3D"gmai=
l_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"=
><br></div><div class=3D"gmail_default" style=3D"font-family:arial,helvetic=
a,sans-serif;font-size:small"><div class=3D"gmail_default">alias wc=3D"=
;LC_ALL=3DC wc"<br></div></div><div class=3D"gmail_default" style=3D"f=
ont-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class=
=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-siz=
e:small">Again, I'm not sure this is sufficiently documented.=C2=A0 =C2=
=A0I'd be happy to make suggested changes to the man page(s).</div><div=
class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fo=
nt-size:small"><br></div><div class=3D"gmail_default" style=3D"font-family:=
arial,helvetica,sans-serif;font-size:small">Thanks again,</div><div class=
=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-siz=
e:small">Mike</div><div class=3D"gmail_default" style=3D"font-family:arial,=
helvetica,sans-serif;font-size:small"><br></div></div><br><div class=3D"gma=
il_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sat, Mar 9, 2024 at 12:0=
5=E2=80=AFPM Michael Cheponis <<a href=3D"mailto:michael.cheponis@gmail.=
com">michael.cheponis@gmail.com</a>> wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_defau=
lt" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">It'=
;s indeed the case that on my arm64 test of 'wc' that 'worked&#=
39; on binary files, the environment variable "LC_ALL=3DC" was se=
t.</div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,s=
ans-serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"=
font-family:arial,helvetica,sans-serif;font-size:small">I think the man pag=
e for wc needs updating, at least, to explain its interaction with that env=
ironment variable.=C2=A0 =C2=A0There *is* a discussion on that man page abo=
ut needed to use the posix iswspace() function, but when I followed=C2=A0th=
at page, there was no detail about the LC_ALL environment variable.=C2=A0 =
=C2=A0</div><div class=3D"gmail_default" style=3D"font-family:arial,helveti=
ca,sans-serif;font-size:small"><br></div><div class=3D"gmail_default" style=
=3D"font-family:arial,helvetica,sans-serif;font-size:small">Also, historica=
lly, wc was something like this:</div><div class=3D"gmail_default" style=3D=
"font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div cla=
ss=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-s=
ize:small">int main(int argc, char *argv[]) {<br>=C2=A0 =C2=A0 int characte=
r, lineCount =3D 0, wordCount =3D 0, byteCount =3D 0, inWord =3D 0;<br><br>=
=C2=A0 =C2=A0 while ((character =3D getchar()) !=3D EOF) {<br>=C2=A0 =C2=A0=
=C2=A0 =C2=A0 ++byteCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (character =
=3D=3D '\n')<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++lineCou=
nt;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (character =3D=3D ' ' || char=
acter =3D=3D '\n' || character =3D=3D '\t')<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 inWord =3D 0;<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 else if (inWord =3D=3D 0) {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 inWord =3D 1;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++wordCount;=
<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br>=C2=A0 =C2=A0 }<br><br>=C2=A0 =C2=A0 p=
rintf("%d %d %d\n", lineCount, wordCount, byteCount);<br>=C2=A0 =
=C2=A0 return 0;<br>}<br></div><div class=3D"gmail_default" style=3D"font-f=
amily:arial,helvetica,sans-serif;font-size:small"><br></div><div class=3D"g=
mail_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:sma=
ll">That is, because unix 'files' are simply strings-of-bytes, it m=
ay be meaningless to count 'words' and 'lines' -- but yes, =
characters (file size) is useful.</div><div class=3D"gmail_default" style=
=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div =
class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fon=
t-size:small">Generally, I use this when I want to know source size, and th=
e program's executable is in the source directory as an artifact - I do=
"wc *"=C2=A0</div><div class=3D"gmail_default" style=3D"font-fam=
ily:arial,helvetica,sans-serif;font-size:small"><br></div><div class=3D"gma=
il_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:small=
">Anyway, I'm asking for a documentation change.</div><div class=3D"gma=
il_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:small=
"><br></div><div class=3D"gmail_default" style=3D"font-family:arial,helveti=
ca,sans-serif;font-size:small">Thank you,</div><div class=3D"gmail_default"=
style=3D"font-family:arial,helvetica,sans-serif;font-size:small">Mike</div=
></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr"=
>On Sat, Mar 9, 2024 at 1:55=E2=80=AFAM Robert Elz <<a href=3D"mailto:kr=
e@munnari.oz.au" target=3D"_blank">kre@munnari.oz.au</a>> wrote:<br></di=
v><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;borde=
r-left:1px solid rgb(204,204,204);padding-left:1ex">The following reply was=
made to PR bin/58014; it has been noted by GNATS.<br>
<br>
From: Robert Elz <<a href=3D"mailto:kre@munnari.OZ.AU" target=3D"_blank"=
>kre@munnari.OZ.AU</a>><br>
To: <a href=3D"mailto:gnats-bugs@netbsd.org" target=3D"_blank">gnats-bugs@n=
etbsd.org</a><br>
Cc: <br>
Subject: Re: bin/58014: wc no longer works with binary files<br>
Date: Sat, 09 Mar 2024 16:50:02 +0700<br>
<br>
=C2=A0 =C2=A0 =C2=A0Date:=C2=A0 =C2=A0 =C2=A0 =C2=A0 Sat,=C2=A0 9 Mar 2024 =
07:50:00 +0000 (UTC)<br>
=C2=A0 =C2=A0 =C2=A0From:=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"mailto:mich=
ael.cheponis@gmail.com" target=3D"_blank">michael.cheponis@gmail.com</a><br=
>
=C2=A0 =C2=A0 =C2=A0Message-ID:=C2=A0 <<a href=3D"mailto:20240309075000.=
E456A1A9241@mollari.NetBSD.org" target=3D"_blank">20240309075000.E456A1A924=
1@mollari.NetBSD.org</a>><br>
<br>
=C2=A0 =C2=A0| when 'wc' is given input from a binary file, it now =
gives the error:<br>
=C2=A0 =C2=A0|<br>
=C2=A0 =C2=A0| wc: hello: invalid byte sequence<br>
<br>
=C2=A0 =C2=A0| (Assuming 'hello' is a binary file)<br>
<br>
=C2=A0wc without flags needs to count characters.=C2=A0 =C2=A0What is a cha=
racter depends<br>
=C2=A0upon your locale settings.=C2=A0 Do<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 LC_ALL=3DC wc hello<br>
<br>
=C2=A0(or prefix that with "env" if you're a csh user) and it=
will work.<br>
<br>
=C2=A0 =C2=A0| wc works as one would expect on arm64.=C2=A0 This error only=
shows up on amd64<br>
<br>
=C2=A0More likely your default locale (LANG, LC_CTYPE or LC_ALL) is differe=
nt in<br>
=C2=A0the two cases.<br>
<br>
=C2=A0I am not sure that it makes sense to attempt count characters, lines,=
or<br>
=C2=A0words, in a binary file - what would the answers mean?=C2=A0 =C2=A0 I=
f you were looking<br>
=C2=A0to get the size of the file, wc is not the right tool.<br>
<br>
=C2=A0I see no bug here, nor any real need to explain that a "word cou=
nt" program<br>
=C2=A0isn't intended to be sane on non word/character containing files =
in the<br>
=C2=A0manual page.<br>
<br>
<br>
</blockquote></div>
</blockquote></div>
--00000000000033175e06133ffe14--
From: RVP <rvp@SDF.ORG>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sun, 10 Mar 2024 11:34:31 +0000 (UTC)
On Sat, 9 Mar 2024, michael.cheponis@gmail.com wrote:
>> Description:
> when 'wc' is given input from a binary file, it now gives the error:
>
> wc: hello: invalid byte sequence
>
> (Assuming 'hello' is a binary file)
>
It's pretty inefficient to use mbrtowc() when a `-m' wasn't supplied (no
matter what the locale), but, you can at least stop wc from spamming and
confusing users so...:
```
--- wc.c.orig 2024-01-14 17:39:19.000000000 +0000
+++ wc.c 2024-03-10 11:06:10.228327632 +0000
@@ -73,7 +73,7 @@
#endif
static wc_count_t tlinect, twordct, tcharct, tlongest;
-static bool doline, doword, dobyte, dochar, dolongest;
+static bool doline, doword, dobyte, dochar, dolongest, warned;
static int rval = 0;
static void cnt(const char *);
@@ -148,8 +148,11 @@
do {
r = mbrtowc(wc, p, len, st);
if (r == (size_t)-1) {
- warnx("%s: invalid byte sequence", file);
- rval = 1;
+ if (!warned && dochar) {
+ warnx("%s: invalid byte sequence", file);
+ rval = 1;
+ warned = true;
+ }
/* XXX skip 1 byte */
len--;
@@ -187,6 +190,7 @@
int fd, len = 0;
linect = wordct = charct = longest = 0;
+ warned = false;
if (file != NULL) {
if ((fd = open(file, O_RDONLY, 0)) < 0) {
warn("%s", file);
```
> wc works as one would expect on arm64. This error only shows up on amd64
>
Are the counts between the multi-byte locale vs. C locale actually wrong, or
is it just the spew that worried you?
> There is no mention of a "-b" switch, say, for binary files; nor is there any explanation on the "man wc" page explaining this.
>
There's a `-c' (byte-count) switch which is the default.
-RVP
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.