NetBSD Problem Report #56442

From gson@gson.org  Wed Oct  6 06:44:38 2021
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 548F11A921F
	for <gnats-bugs@gnats.NetBSD.org>; Wed,  6 Oct 2021 06:44:38 +0000 (UTC)
Message-Id: <20211006064431.11E4B25417E@guava.gson.org>
Date: Wed,  6 Oct 2021 09:44:31 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Tests hang under NVMM with options DEBUG
X-Send-Pr-Version: 3.95

>Number:         56442
>Category:       kern
>Synopsis:       Tests hang under NVMM with options DEBUG
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Oct 06 06:45:00 +0000 2021
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2019.12.01.13.20.42
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

If I enable "options DEBUG" in the NetBSD-current/amd64 GENERIC kernel
configuration, build a release, boot it in "qemu -accel nvmm", and run
the kernel/t_trapsignal test, the test hangs during the fpe_ignore
test case and is not killed by the ATF timeout mechanism, nor can it
be killed manually by hitting control-c or control-\.  Console
keystrokes are still echoed and I am able to enter ddb by sending a
serial break:

  # cd /usr/tests/kernel
  # atf-run t_trapsignal | atf-report
  Tests root: /usr/tests/kernel

  t_trapsignal (1/1): 20 test cases
      bus_handle: [0.213148s] Passed.
      bus_handle_recurse: [0.209931s] Passed.
      bus_ignore: [0.199009s] Passed.
      bus_mask: [0.200927s] Passed.
      bus_simple: [0.200246s] Passed.
      fpe_handle: [0.219451s] Passed.
      fpe_handle_recurse: [598.294510s] Failed: Test case timed out after 300 seconds
      fpe_ignore:^Z^C^C
  ^C^C^C^C
  ^\^\
  ^C^C[ 115781.8123080] fatal breakpoint trap in supervisor mode
  [ 115781.8123080] trap type 1 code 0 rip 0xffffffff8021dd8d cs 0x8 rflags 0x202 cr2 0x72bf48be8fe0 ilevel 0x8 rsp 0xffffcb002a623e88
  [ 115781.8123080] curlwp 0xffffeb5ed06676c0 pid 881.1 lowest kstack 0xffffcb002a6202c0
  Stopped in pid 881.1 (h_segv) at        netbsd:breakpoint+0x5:  leave
  db{0}> bt
  breakpoint() at netbsd:breakpoint+0x5
  comintr() at netbsd:comintr+0x8e5
  db{0}>

A bisection identified the following commit as the point where the
problem started:

  2019.12.01.13.20.42 ad src/sys/kern/kern_runq.c 1.51
  2019.12.01.13.20.42 ad src/sys/kern/sched_4bsd.c 1.39
  2019.12.01.13.20.42 ad src/sys/kern/sched_m2.c 1.35

The problem does not occur
 - without "options DEBUG"
 - in qemu without "-accel nvmm"
 - in qemu with "-accel nvmm -smp 2"
 - on real amd64 multiprocessor hardware

I also tried to test on real amd64 uniprocessor hardware, but was
thwarted by the unrelated problem reported in PR 51531.

>How-To-Repeat:

See above.

>Fix:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.