NetBSD Problem Report #51623

From paul@whooppee.com  Sat Nov 12 09:22:58 2016
Return-Path: <paul@whooppee.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id B30A87A279
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 12 Nov 2016 09:22:58 +0000 (UTC)
Message-Id: <20161112092255.A39E316E60@speedy.whooppee.com>
Date: Sat, 12 Nov 2016 17:22:55 +0800 (PHT)
From: paul@whooppee.com
Reply-To: paul@whooppee.com
To: gnats-bugs@NetBSD.org
Subject: Non-0 CPUs don't properly "start" under qemu
X-Send-Pr-Version: 3.95

>Number:         51623
>Category:       kern
>Synopsis:       running qemu-x86_64 with -smp 4 - the additional CPUs don't start
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Nov 12 09:25:00 +0000 2016
>Closed-Date:    Thu Jan 11 02:16:30 +0000 2018
>Last-Modified:  Thu Jan 11 02:16:30 +0000 2018
>Originator:     Paul Goyette
>Release:        NetBSD 7.99.42
>Organization:
+------------------+--------------------------+------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:      |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
+------------------+--------------------------+------------------------+
>Environment:


System: NetBSD speedy.whooppee.com 7.99.42 NetBSD 7.99.42 (TEST 2016-11-11 01:03:42) #0: Fri Nov 11 10:12:17 PHT 2016 paul@speedy.whooppee.com:/build/netbsd-local/obj/amd64/sys/arch/amd64/compile/TEST amd64
Architecture: x86_64
Machine: amd64
>Description:
Running qemu 2.7.0 from pkgsrc...

When attempting to emulate an SMP environment using "-smp 4", a NetBSD
kernel finds the additional CPUs but fails to completely "start" them.

	# qemu-system-x86_64 -nographic -m 1024 -smp 4 \
	-drive file=./work/wd0.img,index=0,media=disk,format=raw,snapshot=on

	### boot menu

	Choose an option; RETURN for default; SPACE to stop countdown.
	Option 1 will be chosen in 0 seconds.     
	type "?" or "help" for help.
	> boot -x

	### kernel boots

	ACPI: 1 ACPI AML tables successfully acquired and loaded
	efi: missing or invalid systbl
	ioapic0 at mainbus0 apid 0
	cpu0 at mainbus0 apid 0
	cpu0: 8 page colors
	cpu0: calibrating local timer
	cpu0: apic clock running at 1000 MHz
	cpu0: QEMU Virtual CPU version 2.5+, id 0x663
	cpu0: PAT enabled
	cpu1 at mainbus0 apid 1
	cpu1: 2 page colors
	cpu1: QEMU Virtual CPU version 2.5+, id 0x663
	cpu1: PAT enabled
	cpu2 at mainbus0 apid 2
	cpu2: 2 page colors
	cpu2: QEMU Virtual CPU version 2.5+, id 0x663
	cpu2: PAT enabled
	cpu3 at mainbus0 apid 3
	cpu3: 2 page colors
	cpu3: QEMU Virtual CPU version 2.5+, id 0x663
	cpu3: PAT enabled

	### much other config stuff snipped

	acpicpu0 at cpu0: ACPI CPU
	acpicpu0: id 0, lapic id 0, cap 0x0000, flags 0x00000c21
	vmt0 at cpu0: Unknown
	vmware: open failed, eax=564d5868, ecx=0000001e, edx=00005658
	vmt0: failed to open backdoor RPC channel (TCLO protocol)
	acpicpu1 at cpu1: ACPI CPU
	acpicpu1: id 1, lapic id 1, cap 0x0000, flags 0x00000c21
	acpicpu2 at cpu2: ACPI CPU
	acpicpu2: id 2, lapic id 2, cap 0x0000, flags 0x00000c21
	acpicpu3 at cpu3: ACPI CPU
	acpicpu3: id 3, lapic id 3, cap 0x0000, flags 0x00000c21
	Initializing SSP: 8c28adf057c2aaf6 39b9028b76aa0870 d20cf25d97217579 1ae37818d4f5c757 8a35495040477d50 f4488137e57414de 94a94f41b14c7db9 5c761e3d70b78301 
	cpu1: failed to start
	cpu2: failed to start
	cpu3: failed to start

	### system hangs here




	Some on-the-fly analysis on irc seems to conclude that this is
	likely to be a qemu bug, with the secondary CPU(s) stalled in
	cpu_init().

	The same behavior is seen with NetBSD kernels from today's -current,
	today's netbsd-7 branch, and today's netbsd-6-0 branch.
>How-To-Repeat:
	See above

>Fix:
	Unknown


>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: pkg-manager->kern-bug-people
Responsible-Changed-By: maya@NetBSD.org
Responsible-Changed-When: Sat, 17 Dec 2016 15:50:20 +0000
Responsible-Changed-Why:
move to category where this bug has a chance to be resolved


From: coypu@SDF.ORG
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/51623
Date: Mon, 20 Mar 2017 16:45:42 +0000

 it's x86_patch

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: pkg/51623: Non-0 CPUs don't properly "start" under qemu
Date: Tue, 4 Apr 2017 11:17:35 +0800 (+08)

 I wonder if the following link is related to this issue?  The mention of 
 disabling certain CPU features to prevent x86_patch() from installing 
 some patches could be significant...

 https://www.mail-archive.com/qemu-devel@nongnu.org/msg438799.html


 +------------------+--------------------------+----------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:          |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
 +------------------+--------------------------+----------------------------+

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Tue, 04 Apr 2017 05:29:20 +0000
Responsible-Changed-Why:
I've stumbled upon this and it complicates my testing, I'll look
what can be done.


Responsible-Changed-From-To: jdolecek->kern-bug-people
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Tue, 04 Apr 2017 17:10:03 +0000
Responsible-Changed-Why:
On my machine (running Mac OS X) using the -cx8,-sse2 flags doesn't help,
maybe it only helps when using kvm; so can't really even test a workaround.


From: "David H. Gutteridge" <dhgutteridge@sympatico.ca>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/51623 (Non-0 CPUs don't properly "start" under qemu)
Date: Fri, 07 Apr 2017 17:39:55 -0400

 In my Linux host environment (Fedora 25, running QEMU 2.7.0 and 2.7.1
 over the past five months), I haven't encountered this problem with
 various amd64 kernels (7.99.40, 7.99.63, 7.99.65, 7.99.67). An SMP
 configuration of two or four CPUs works fine. I mention this as an
 additional data point.

 Having said that, I've been encountering significant stability
 problems of late with NetBSD kernels in this same environment, where
 seemingly at random they'll max out all available CPUs and I have to
 power cycle the VM, as it becomes completely unresponsive. There's no
 particular trigger in terms of user input. The VM can basically be
 idling when this spike suddenly occurs, and there's no particular
 association to uptime or load I can determine. (E.g., I can run a
 complete series of ATF tests, wait an hour, begin typing a command,
 and suddenly a spike will occur and max out all the CPUs.)

 Dave

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/51623: running qemu-x86_64 with -smp 4 - the additional
 CPUs don't start
Date: Thu, 25 May 2017 12:04:13 +0000

 Switching x86_pause to use 'nop' instead of 'pause' gets me booting.

 As a quicker workaround, choosing another CPU type to emulate works too,
 for example -cpu phenom worked, but -cpu Broadwell didn't.

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: pkg/51623: Non-0 CPUs don't properly "start" under qemu
Date: Fri, 26 May 2017 08:45:33 +0800 (+08)

 I have filed a bug report upstream:

  	https://bugs.launchpad.net/qemu/+bug/1693649


 +------------------+--------------------------+----------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:          |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
 +------------------+--------------------------+----------------------------+

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: pkg/51623: Non-0 CPUs don't properly "start" under qemu
Date: Fri, 26 May 2017 11:56:40 +0800 (+08)

 Also note following upstream bug was filed.  It is because of this
 bug (missing MONITOR) that the previous bug (funky pause instruction)
 occurs!

  	https://bugs.launchpad.net/qemu/+bug/1693667


 +------------------+--------------------------+----------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:          |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
 +------------------+--------------------------+----------------------------+

From: "Maya Rashish" <maya@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/51623 CVS commit: src/sys/arch/x86/x86
Date: Wed, 31 May 2017 00:19:17 +0000

 Module Name:	src
 Committed By:	maya
 Date:		Wed May 31 00:19:17 UTC 2017

 Modified Files:
 	src/sys/arch/x86/x86: cpu.c

 Log Message:
 Do not pause many times between testing if the CPU can go.

 This only impacts QEMU as QEMU's implementation of pause is
 significantly slower than its implementation of nop.

 PR kern/51623: running qemu-x86_64 with -smp 4 - the additional
 CPUs don't start.


 To generate a diff of this commit:
 cvs rdiff -u -r1.125 -r1.126 src/sys/arch/x86/x86/cpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: pgoyette@NetBSD.org
State-Changed-When: Thu, 11 Jan 2018 02:16:30 +0000
State-Changed-Why:
This is working now.  Root-cause is an upstream bug in qemu, but they
haven't done anything with my bug report.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.