NetBSD Problem Report #51623
From paul@whooppee.com Sat Nov 12 09:22:58 2016
Return-Path: <paul@whooppee.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id B30A87A279
for <gnats-bugs@gnats.NetBSD.org>; Sat, 12 Nov 2016 09:22:58 +0000 (UTC)
Message-Id: <20161112092255.A39E316E60@speedy.whooppee.com>
Date: Sat, 12 Nov 2016 17:22:55 +0800 (PHT)
From: paul@whooppee.com
Reply-To: paul@whooppee.com
To: gnats-bugs@NetBSD.org
Subject: Non-0 CPUs don't properly "start" under qemu
X-Send-Pr-Version: 3.95
>Number: 51623
>Category: kern
>Synopsis: running qemu-x86_64 with -smp 4 - the additional CPUs don't start
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Nov 12 09:25:00 +0000 2016
>Closed-Date: Thu Jan 11 02:16:30 +0000 2018
>Last-Modified: Thu Jan 11 02:16:30 +0000 2018
>Originator: Paul Goyette
>Release: NetBSD 7.99.42
>Organization:
+------------------+--------------------------+------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
+------------------+--------------------------+------------------------+
>Environment:
System: NetBSD speedy.whooppee.com 7.99.42 NetBSD 7.99.42 (TEST 2016-11-11 01:03:42) #0: Fri Nov 11 10:12:17 PHT 2016 paul@speedy.whooppee.com:/build/netbsd-local/obj/amd64/sys/arch/amd64/compile/TEST amd64
Architecture: x86_64
Machine: amd64
>Description:
Running qemu 2.7.0 from pkgsrc...
When attempting to emulate an SMP environment using "-smp 4", a NetBSD
kernel finds the additional CPUs but fails to completely "start" them.
# qemu-system-x86_64 -nographic -m 1024 -smp 4 \
-drive file=./work/wd0.img,index=0,media=disk,format=raw,snapshot=on
### boot menu
Choose an option; RETURN for default; SPACE to stop countdown.
Option 1 will be chosen in 0 seconds.
type "?" or "help" for help.
> boot -x
### kernel boots
ACPI: 1 ACPI AML tables successfully acquired and loaded
efi: missing or invalid systbl
ioapic0 at mainbus0 apid 0
cpu0 at mainbus0 apid 0
cpu0: 8 page colors
cpu0: calibrating local timer
cpu0: apic clock running at 1000 MHz
cpu0: QEMU Virtual CPU version 2.5+, id 0x663
cpu0: PAT enabled
cpu1 at mainbus0 apid 1
cpu1: 2 page colors
cpu1: QEMU Virtual CPU version 2.5+, id 0x663
cpu1: PAT enabled
cpu2 at mainbus0 apid 2
cpu2: 2 page colors
cpu2: QEMU Virtual CPU version 2.5+, id 0x663
cpu2: PAT enabled
cpu3 at mainbus0 apid 3
cpu3: 2 page colors
cpu3: QEMU Virtual CPU version 2.5+, id 0x663
cpu3: PAT enabled
### much other config stuff snipped
acpicpu0 at cpu0: ACPI CPU
acpicpu0: id 0, lapic id 0, cap 0x0000, flags 0x00000c21
vmt0 at cpu0: Unknown
vmware: open failed, eax=564d5868, ecx=0000001e, edx=00005658
vmt0: failed to open backdoor RPC channel (TCLO protocol)
acpicpu1 at cpu1: ACPI CPU
acpicpu1: id 1, lapic id 1, cap 0x0000, flags 0x00000c21
acpicpu2 at cpu2: ACPI CPU
acpicpu2: id 2, lapic id 2, cap 0x0000, flags 0x00000c21
acpicpu3 at cpu3: ACPI CPU
acpicpu3: id 3, lapic id 3, cap 0x0000, flags 0x00000c21
Initializing SSP: 8c28adf057c2aaf6 39b9028b76aa0870 d20cf25d97217579 1ae37818d4f5c757 8a35495040477d50 f4488137e57414de 94a94f41b14c7db9 5c761e3d70b78301
cpu1: failed to start
cpu2: failed to start
cpu3: failed to start
### system hangs here
Some on-the-fly analysis on irc seems to conclude that this is
likely to be a qemu bug, with the secondary CPU(s) stalled in
cpu_init().
The same behavior is seen with NetBSD kernels from today's -current,
today's netbsd-7 branch, and today's netbsd-6-0 branch.
>How-To-Repeat:
See above
>Fix:
Unknown
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: pkg-manager->kern-bug-people
Responsible-Changed-By: maya@NetBSD.org
Responsible-Changed-When: Sat, 17 Dec 2016 15:50:20 +0000
Responsible-Changed-Why:
move to category where this bug has a chance to be resolved
From: coypu@SDF.ORG
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/51623
Date: Mon, 20 Mar 2017 16:45:42 +0000
it's x86_patch
From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: pkg/51623: Non-0 CPUs don't properly "start" under qemu
Date: Tue, 4 Apr 2017 11:17:35 +0800 (+08)
I wonder if the following link is related to this issue? The mention of
disabling certain CPU features to prevent x86_patch() from installing
some patches could be significant...
https://www.mail-archive.com/qemu-devel@nongnu.org/msg438799.html
+------------------+--------------------------+----------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+------------------+--------------------------+----------------------------+
Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Tue, 04 Apr 2017 05:29:20 +0000
Responsible-Changed-Why:
I've stumbled upon this and it complicates my testing, I'll look
what can be done.
Responsible-Changed-From-To: jdolecek->kern-bug-people
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Tue, 04 Apr 2017 17:10:03 +0000
Responsible-Changed-Why:
On my machine (running Mac OS X) using the -cx8,-sse2 flags doesn't help,
maybe it only helps when using kvm; so can't really even test a workaround.
From: "David H. Gutteridge" <dhgutteridge@sympatico.ca>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/51623 (Non-0 CPUs don't properly "start" under qemu)
Date: Fri, 07 Apr 2017 17:39:55 -0400
In my Linux host environment (Fedora 25, running QEMU 2.7.0 and 2.7.1
over the past five months), I haven't encountered this problem with
various amd64 kernels (7.99.40, 7.99.63, 7.99.65, 7.99.67). An SMP
configuration of two or four CPUs works fine. I mention this as an
additional data point.
Having said that, I've been encountering significant stability
problems of late with NetBSD kernels in this same environment, where
seemingly at random they'll max out all available CPUs and I have to
power cycle the VM, as it becomes completely unresponsive. There's no
particular trigger in terms of user input. The VM can basically be
idling when this spike suddenly occurs, and there's no particular
association to uptime or load I can determine. (E.g., I can run a
complete series of ATF tests, wait an hour, begin typing a command,
and suddenly a spike will occur and max out all the CPUs.)
Dave
From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/51623: running qemu-x86_64 with -smp 4 - the additional
CPUs don't start
Date: Thu, 25 May 2017 12:04:13 +0000
Switching x86_pause to use 'nop' instead of 'pause' gets me booting.
As a quicker workaround, choosing another CPU type to emulate works too,
for example -cpu phenom worked, but -cpu Broadwell didn't.
From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: pkg/51623: Non-0 CPUs don't properly "start" under qemu
Date: Fri, 26 May 2017 08:45:33 +0800 (+08)
I have filed a bug report upstream:
https://bugs.launchpad.net/qemu/+bug/1693649
+------------------+--------------------------+----------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+------------------+--------------------------+----------------------------+
From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: pkg/51623: Non-0 CPUs don't properly "start" under qemu
Date: Fri, 26 May 2017 11:56:40 +0800 (+08)
Also note following upstream bug was filed. It is because of this
bug (missing MONITOR) that the previous bug (funky pause instruction)
occurs!
https://bugs.launchpad.net/qemu/+bug/1693667
+------------------+--------------------------+----------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+------------------+--------------------------+----------------------------+
From: "Maya Rashish" <maya@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/51623 CVS commit: src/sys/arch/x86/x86
Date: Wed, 31 May 2017 00:19:17 +0000
Module Name: src
Committed By: maya
Date: Wed May 31 00:19:17 UTC 2017
Modified Files:
src/sys/arch/x86/x86: cpu.c
Log Message:
Do not pause many times between testing if the CPU can go.
This only impacts QEMU as QEMU's implementation of pause is
significantly slower than its implementation of nop.
PR kern/51623: running qemu-x86_64 with -smp 4 - the additional
CPUs don't start.
To generate a diff of this commit:
cvs rdiff -u -r1.125 -r1.126 src/sys/arch/x86/x86/cpu.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: pgoyette@NetBSD.org
State-Changed-When: Thu, 11 Jan 2018 02:16:30 +0000
State-Changed-Why:
This is working now. Root-cause is an upstream bug in qemu, but they
haven't done anything with my bug report.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.