NetBSD Problem Report #58209

From www@netbsd.org  Sun Apr 28 14:59:20 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id B3F611A9238
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 28 Apr 2024 14:59:20 +0000 (UTC)
Message-Id: <20240428145918.DB0171A923A@mollari.NetBSD.org>
Date: Sun, 28 Apr 2024 14:59:18 +0000 (UTC)
From: campbell+netbsd@mumble.net
Reply-To: campbell+netbsd@mumble.net
To: gnats-bugs@NetBSD.org
Subject: <cctype> lacks compile-time diagnostics for char abuse
X-Send-Pr-Version: www-1.0

>Number:         58209
>Category:       lib
>Synopsis:       <cctype> lacks compile-time diagnostics for char abuse
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    lib-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Apr 28 15:00:02 +0000 2024
>Originator:     Taylor R Campbell
>Release:        current, 10, 9, ...
>Organization:
The NetBSD std::isfoundation
>Environment:
>Description:
The <cctype> functions, such as std::isprint/isdigit/isalpha and std::toupper/tolower, have a singularly troublesome specification: Their argument has type int, but they are only defined on inputs that are either (a) the value of the EOF macro (which on NetBSD is -1), or (b) representable by unsigned char.  In other words, there are exactly 257 allowed inputs: {-1, 0, 1, 2, 3, ..., 255}.  Any other inputs lead to undefined behaviour.

This is because they are meant for use with I/O functions like std::istream.peek:

int ch;
while ((ch = std::cin.peek()) != EOF) {
        if (std::isspace(ch))
                ...
}

Using them to process arbitrary contents of, e.g., std::string requires explicit conversion to unsigned char:

std::string s = ...;
for (i = 0; i < s.size(); i++) {
        if (std::isspace(static_cast<unsigned char>(s[i])))
                ...
}

Without this conversion, on machines where char is signed such as x86, char values outside the 7-bit US-ASCII range are either (a) undefined behaviour, or (b) in the case of the all-bits-set octet, conflated with EOF.

Our standard C <ctype.h> definitions are crafted to trigger the -Wchar-subscripts compiler warning, by defining, e.g., isspace(c) as a macro that expands into ((_ctype_tab_ + 1)[c] & bits).  But that doesn't work with C++; we can't expand `std::isspace(c)' into `std::((_ctype_tab_ + 1)[c] & bits)'.  So C++ code with ctype abuse (like https://github.com/ledger/ledger/issues/2340) gets no compile-time feedback, and bad runtime feedback (https://gnats.netbsd.org/58208) leading to simply confusing behaviour (like https://github.com/ledger/ledger/issues/2338).
>How-To-Repeat:
#include <cctype>
#include <string>

std::string s = {static_cast<char>(0xb5), 0;
std::cout << std::isspace(s[0]) << std::endl;
>Fix:
Maybe we can teach <cctype> to overload isspace &c., or find some template magic, that will trigger a warning at compile-time.

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.