NetBSD Problem Report #57657

From www@netbsd.org  Sat Oct 14 18:55:02 2023
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id B16D81A923A
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 14 Oct 2023 18:55:02 +0000 (UTC)
Message-Id: <20231014185501.168A51A923C@mollari.NetBSD.org>
Date: Sat, 14 Oct 2023 18:55:01 +0000 (UTC)
From: logix@foobar.franken.de
Reply-To: logix@foobar.franken.de
To: gnats-bugs@NetBSD.org
Subject: NetBSD crashes if the number of CPUs is not of the form N*[1..8]
X-Send-Pr-Version: www-1.0

>Number:         57657
>Category:       kern
>Synopsis:       NetBSD crashes if the number of CPUs is not of the form N*[1..8]
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Oct 14 19:00:01 +0000 2023
>Last-Modified:  Sun Oct 15 10:35:01 +0000 2023
>Originator:     Harold Gutch
>Release:        NetBSD current
>Organization:
>Environment:
NetBSD  10.99.10 NetBSD 10.99.10 (GENERIC) #0: Thu Oct 12 23:51:05 UTC 2023  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
Booting NetBSD in KVM (on a RHEL 9.2 host) with 1..40 virtual CPUs succeeds if and only if the number of CPUs is of the form N*[1..8], i.e., only for 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 24, 28, 32, 40 CPUs.  For any other number the following panic happens:

Stopped in pid 0.0 (system) at  netbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
vpanic at netbsd:vpanic+0x173
kern_assert() at netbsd:kern_assert+0x4b
uvm_pagealloc_pgb() at netbsd:uvm_pagealloc_pgb+0x2e
uvm_pagealloc_pgfl() at netbsd:uvm_pagealloc_pgfl+0x63
uvm_pagealloc_strat() at netbsd_uvm_pagealloc_strat+0x130
uvm_km_alloc() at netbsd:uvm_km_alloc+0x17a
cpu_uarea_alloc() at netbsd:cpu_uarea_alloc+0x26
uarea_system_poolpage_alloc() at netbsd:uarea_system_poolpage_alloc+0x16
pool_grow() at netbsd:pool_grow+0x34c
pool_get() at netbsd:pool_get+0xa8
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x263
kthread_create() at netbsd:kthread_create+0x4d
config_create_interruptthreads() at netbsd:_config_create_interrptthreads+0x33
main() at netbsd:main+0x3be

oster@ could reproduce this and mentioned that for him Fedora 37 does *not* crash with 9 CPUs, so it does not seem to be a bug in KVM.

The systematic test for all numbers of CPUs from 1 to 40 was with an Ivy Bridge host CPU, but for all other emulated CPUs tested, it booted with 8 but I got exactly the same crash with 9 CPUs.

# objdump --disassemble=uvm_pagealloc_pgb /netbsd
ffffffff80d8591b <uvm_pagealloc_pgb>:
[...]
ffffffff80d85949:       49 83 3a 00             cmq    $0x0,(%r10)
ffffffff80d8594d:       4c 89 55 b8             mov    %r10,-0x48(%rbp)
ffffffff80d85951:       0f 84 80 01 00 00       je     ffffffff80d85ad7 <uvm_pagealloc_pgb+0x1bc>

This seems to be line 1017 in uvm_page.c 1.254.
>How-To-Repeat:
Boot NetBSD on a VM where the number of CPUs is not of the form N*[1..8].
>Fix:

>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57657: NetBSD crashes if the number of CPUs is not of the form N*[1..8]
Date: Sun, 15 Oct 2023 10:05:32 -0000 (UTC)

 logix@foobar.franken.de writes:

 >Booting NetBSD in KVM (on a RHEL 9.2 host) with 1..40 virtual CPUs succeeds if and only if the number of CPUs is of the form N*[1..8], i.e., only for 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 24, 28, 32, 40 CPUs.  For any other number the following panic happens:

 >Stopped in pid 0.0 (system) at  netbsd:breakpoint+0x5:  leave
 >breakpoint() at netbsd:breakpoint+0x5
 >vpanic at netbsd:vpanic+0x173
 >kern_assert() at netbsd:kern_assert+0x4b
 >uvm_pagealloc_pgb() at netbsd:uvm_pagealloc_pgb+0x2e
 >uvm_pagealloc_pgfl() at netbsd:uvm_pagealloc_pgfl+0x63


 That's a bug in uvm_page_rebucket(). This patch helps (the
 line numbers are a bit off):

 Index: uvm_page.c
 ===================================================================
 RCS file: /cvsroot/src/sys/uvm/uvm_page.c,v
 retrieving revision 1.254
 diff -p -u -r1.254 uvm_page.c
 --- uvm_page.c	23 Sep 2023 18:20:20 -0000	1.254
 +++ uvm_page.c	14 Oct 2023 21:37:39 -0000
 @@ -868,7 +883,7 @@ uvm_page_recolor(int newncolors)
  void
  uvm_page_rebucket(void)
  {
 -	u_int min_numa, max_numa, npackage, shift;
 +	u_int min_numa, max_numa, npackage, div;
  	struct cpu_info *ci, *ci2, *ci3;
  	CPU_INFO_ITERATOR cii;

 @@ -906,12 +921,11 @@ uvm_page_rebucket(void)

  	/*
  	 * Figure out how to arrange the packages & buckets, and the total
 -	 * number of buckets we need.  XXX 2 may not be the best factor.
 +	 * number of buckets we need.
  	 */
 -	for (shift = 0; npackage > PGFL_MAX_BUCKETS; shift++) {
 -		npackage >>= 1;
 -	}
 - 	uvm_page_redim(uvmexp.ncolors, npackage);
 +
 +	div = howmany(npackage, PGFL_MAX_BUCKETS);
 + 	uvm_page_redim(uvmexp.ncolors, howmany(npackage, div));

   	/*
   	 * Now tell each CPU which bucket to use.  In the outer loop, scroll
 @@ -927,7 +941,7 @@ uvm_page_rebucket(void)
  		 */
  		ci3 = ci2;
  		do {
 -			ci3->ci_data.cpu_uvm->pgflbucket = npackage >> shift;
 +			ci3->ci_data.cpu_uvm->pgflbucket = npackage / div;
  			ci3 = ci3->ci_sibling[CPUREL_PACKAGE];
  		} while (ci3 != ci2);
  		npackage++;
 @@ -935,7 +949,7 @@ uvm_page_rebucket(void)
  	} while (ci2 != ci->ci_sibling[CPUREL_PACKAGE1ST]);

  	aprint_debug("UVM: using package allocation scheme, "
 -	    "%d package(s) per bucket\n", 1 << shift);
 +	    "%d package(s) per bucket\n", div);
  }

  /*

From: Harold Gutch <logix@foobar.franken.de>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/57657: NetBSD crashes if the number of CPUs is not of the form N*[1..8]
Date: Sun, 15 Oct 2023 12:29:48 +0200

 Hi,

 thanks, from what I can see your patch does "effectively the same",
 just with slightly other rounding/bucket distribution if repeated
 dividing by 2 requires rounding before getting to a number in [1, 8]
 (i.e., the right condition is not N*[1..8], but 2^N*[1..8]).

 For a "bad" number of CPUs it now behaves better for me.  I didn't
 test all numbers, but for the few that I tried I successfully booted.

 So:  looks good, thanks!


   Harold

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.