NetBSD Problem Report #58209
From www@netbsd.org Sun Apr 28 14:59:20 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id B3F611A9238
for <gnats-bugs@gnats.NetBSD.org>; Sun, 28 Apr 2024 14:59:20 +0000 (UTC)
Message-Id: <20240428145918.DB0171A923A@mollari.NetBSD.org>
Date: Sun, 28 Apr 2024 14:59:18 +0000 (UTC)
From: campbell+netbsd@mumble.net
Reply-To: campbell+netbsd@mumble.net
To: gnats-bugs@NetBSD.org
Subject: <cctype> lacks compile-time diagnostics for char abuse
X-Send-Pr-Version: www-1.0
>Number: 58209
>Category: lib
>Synopsis: <cctype> lacks compile-time diagnostics for char abuse
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: lib-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Apr 28 15:00:02 +0000 2024
>Originator: Taylor R Campbell
>Release: current, 10, 9, ...
>Organization:
The NetBSD std::isfoundation
>Environment:
>Description:
The <cctype> functions, such as std::isprint/isdigit/isalpha and std::toupper/tolower, have a singularly troublesome specification: Their argument has type int, but they are only defined on inputs that are either (a) the value of the EOF macro (which on NetBSD is -1), or (b) representable by unsigned char. In other words, there are exactly 257 allowed inputs: {-1, 0, 1, 2, 3, ..., 255}. Any other inputs lead to undefined behaviour.
This is because they are meant for use with I/O functions like std::istream.peek:
int ch;
while ((ch = std::cin.peek()) != EOF) {
if (std::isspace(ch))
...
}
Using them to process arbitrary contents of, e.g., std::string requires explicit conversion to unsigned char:
std::string s = ...;
for (i = 0; i < s.size(); i++) {
if (std::isspace(static_cast<unsigned char>(s[i])))
...
}
Without this conversion, on machines where char is signed such as x86, char values outside the 7-bit US-ASCII range are either (a) undefined behaviour, or (b) in the case of the all-bits-set octet, conflated with EOF.
Our standard C <ctype.h> definitions are crafted to trigger the -Wchar-subscripts compiler warning, by defining, e.g., isspace(c) as a macro that expands into ((_ctype_tab_ + 1)[c] & bits). But that doesn't work with C++; we can't expand `std::isspace(c)' into `std::((_ctype_tab_ + 1)[c] & bits)'. So C++ code with ctype abuse (like https://github.com/ledger/ledger/issues/2340) gets no compile-time feedback, and bad runtime feedback (https://gnats.netbsd.org/58208) leading to simply confusing behaviour (like https://github.com/ledger/ledger/issues/2338).
>How-To-Repeat:
#include <cctype>
#include <string>
std::string s = {static_cast<char>(0xb5), 0;
std::cout << std::isspace(s[0]) << std::endl;
>Fix:
Maybe we can teach <cctype> to overload isspace &c., or find some template magic, that will trigger a warning at compile-time.
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.