NetBSD Problem Report #52966

From mlelstv@serpens.de  Tue Jan 30 17:51:11 2018
Return-Path: <mlelstv@serpens.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id D5D567A16F
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 30 Jan 2018 17:51:11 +0000 (UTC)
Message-Id: <201801301750.w0UHoox5012906@serpens.de>
Date: Tue, 30 Jan 2018 18:50:52 +0100 (MET)
From: mlelstv@serpens.de
Reply-To: mlelstv@serpens.de
To: gnats-bugs@NetBSD.org
Subject: amd64 FPU handling broken on AMD
X-Send-Pr-Version: 3.95

>Number:         52966
>Category:       port-amd64
>Synopsis:       amd64 FPU handling broken on AMD
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    port-amd64-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jan 30 17:55:00 +0000 2018
>Closed-Date:    Tue Oct 29 12:44:38 +0000 2019
>Last-Modified:  Tue Oct 29 12:44:38 +0000 2019
>Originator:     Michael van Elst
>Release:        NetBSD 8.99.12
>Organization:
--
                                Michael van Elst
Internet: mlelstv@serpens.de
                                "A potential Snark may lurk in every tree."
>Environment:


System: NetBSD slowpoke 8.99.12 NetBSD 8.99.12 (SLOWPOKE) #19: Tue Jan 30 13:57:16 CET 2018 mlelstv@gossam:/home/netbsd-current/obj.amd64/home/netbsd-current/src/sys/arch/amd64/compile/SLOWPOKE amd64
Architecture: x86_64
Machine: amd64
>Description:
A version of the stream benchmark fails on AMD Ryzen CPUs. The benchmark
does multi-threaded floating point operations (using OpenMP) for testing
memory bandwidth and also validates the result by comparing it with a scalar
compuation. While the benchmark runs fine, the validation fails if multiple
threads are used. With more than 4 threads it fails almost always, with less
threads it some times succeeds, with a single thread it succeeds.

>How-To-Repeat:

Get source from

http://ftp.netbsd.org/pub/NetBSD/misc/mlelstv/stream.c

which has been slightly adjusted from the original to compile without -lnuma,
and compile with:

gcc -O3 -std=c99 -fopenmp -DNON_NUMA -DN=80000000 -DNTIMES=100 stream.c -o stream

and let it run. With that value of N you need about 1.8GB RAM.

When the validation succeeds the program reports "Solution validates",
otherwise it reports the error. On the Ryzen system the errors are somewhat
random.

The same machine runs the benchmark fine with the latest netbsd-8 kernel
as it preceeds XSAVEOPT support.

>Fix:

A workaround suggested by maxv@ is to disable the use of XSAVEOPT by
commenting out:

        if (descs[0] & CPUID_PES1_XSAVEOPT)
                x86_fpu_save = FPU_SAVE_XSAVEOPT;

in sys/arch/x86/x86/identcpu.c. The kernel then falls back to use XSAVE
to save and restore the FPU registers.

>Release-Note:

>Audit-Trail:
From: "Maya Rashish" <maya@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52966 CVS commit: src/sys/arch/x86/x86
Date: Wed, 7 Feb 2018 22:49:32 +0000

 Module Name:	src
 Committed By:	maya
 Date:		Wed Feb  7 22:49:32 UTC 2018

 Modified Files:
 	src/sys/arch/x86/x86: identcpu.c

 Log Message:
 stopgap fix: restrict XSAVEOPT to Intel CPUs

 The current code causes floating point miscalculations on AMD Ryzen.
 PR port-amd64/52966: amd64 FPU handling broken on AMD


 To generate a diff of this commit:
 cvs rdiff -u -r1.67 -r1.68 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Maxime Villard" <maxv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52966 CVS commit: src/sys/arch/x86/x86
Date: Fri, 9 Feb 2018 18:45:55 +0000

 Module Name:	src
 Committed By:	maxv
 Date:		Fri Feb  9 18:45:55 UTC 2018

 Modified Files:
 	src/sys/arch/x86/x86: identcpu.c

 Log Message:
 Disable XSAVEOPT, until it is clear what's wrong with it (PR/52966).


 To generate a diff of this commit:
 cvs rdiff -u -r1.68 -r1.69 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Maxime Villard" <maxv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52966 CVS commit: src/sys/arch
Date: Thu, 14 Jun 2018 14:36:46 +0000

 Module Name:	src
 Committed By:	maxv
 Date:		Thu Jun 14 14:36:46 UTC 2018

 Modified Files:
 	src/sys/arch/amd64/amd64: locore.S
 	src/sys/arch/x86/include: cpu.h fpu.h
 	src/sys/arch/x86/x86: fpu.c x86_machdep.c

 Log Message:
 Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
 FPU state of the lwp right away during context switches. This guarantees
 that when the CPU executes in userland, the FPU doesn't contain secrets.

 Maybe we also need to clear the FPU in setregs(), not sure about this one.

 Can be enabled/disabled via:

 	machdep.fpu_eager = {0/1}

 Not yet turned on automatically on affected CPUs (Intel Family 6).

 More generally it would be good to turn it on automatically when XSAVEOPT
 is supported, because in this case there is probably a non-negligible
 performance gain; but we need to fix PR/52966.


 To generate a diff of this commit:
 cvs rdiff -u -r1.165 -r1.166 src/sys/arch/amd64/amd64/locore.S
 cvs rdiff -u -r1.91 -r1.92 src/sys/arch/x86/include/cpu.h
 cvs rdiff -u -r1.8 -r1.9 src/sys/arch/x86/include/fpu.h
 cvs rdiff -u -r1.32 -r1.33 src/sys/arch/x86/x86/fpu.c
 cvs rdiff -u -r1.115 -r1.116 src/sys/arch/x86/x86/x86_machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: maxv@NetBSD.org
State-Changed-When: Mon, 06 Aug 2018 13:06:58 +0000
State-Changed-Why:
I don't see where the problem is. I had added XSAVEOPT after following the
Intel spec, and everything worked fine back then. Today it still works
fine on my Intel hardware, with and without EagerFPU (I have tested your
"stream" program, repeatedly, I always get "solution validates").

It may be a Ryzen CPU bug. So can you:

 * Update the BIOS of your motherboard. Make sure you have the latest
   version. Then re-test. If there's still a problem:

 * Disable SMT in your BIOS, and re-test. Maybe it is an SMT-related
   problem.

Once this is done, and if there is still an issue, we'll have to dig
deeper.


From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: PR/52966 CVS commit: src/sys/arch/x86/x86
Date: Sat, 26 Oct 2019 07:51:00 +0000

 with:
 AMD Ryzen 7 2700X Eight-Core Processor
 NetBSD 9.99.17
 the solution validates.

 But I don't know if it failed before in the same setup. It'd be
 interesting to hear whether mlelstv's machine still fails.

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: PR/52966 CVS commit: src/sys/arch/x86/x86
Date: Sat, 26 Oct 2019 08:27:02 -0000 (UTC)

 coypu@sdf.org writes:

 >The following reply was made to PR port-amd64/52966; it has been noted by GNATS.

 >From: coypu@sdf.org
 >To: gnats-bugs@netbsd.org
 >Cc: 
 >Subject: Re: PR/52966 CVS commit: src/sys/arch/x86/x86
 >Date: Sat, 26 Oct 2019 07:51:00 +0000

 > with:
 > AMD Ryzen 7 2700X Eight-Core Processor
 > NetBSD 9.99.17
 > the solution validates.
 > 
 > But I don't know if it failed before in the same setup. It'd be
 > interesting to hear whether mlelstv's machine still fails.
 > 

 -current as of today works. But that's probably because using
 XSAVEOPT is still #ifdef'd out in identcpu.c.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Lars Reichardt <lars@paradoxon.info>
To: gnats-bugs@netbsd.org, port-amd64-maintainer@netbsd.org,
 netbsd-bugs@netbsd.org, mlelstv@serpens.de, coypu@sdf.org
Cc: 
Subject: Re: PR/52966 CVS commit: src/sys/arch/x86/x86
Date: Sat, 26 Oct 2019 15:37:26 +0200

 This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
 --wOlCq1XK6CINDQJnwim3YDaBviELyilHA
 Content-Type: multipart/mixed; boundary="RPBSKkCgMc9AJzYa2pUH2I0940rZ7tmjT"

 --RPBSKkCgMc9AJzYa2pUH2I0940rZ7tmjT
 Content-Type: text/plain; charset=utf-8
 Content-Transfer-Encoding: quoted-printable
 Content-Language: de-LU

 I've checked both variants XSAVEOPT and XSAVE on my AMD Ryzen 2700 month
 ago and both worked correctly.

 I can recheck.

 On 10/26/19 10:30 AM, Michael van Elst wrote:
 > The following reply was made to PR port-amd64/52966; it has been noted =
 by GNATS.
 >
 > From: mlelstv@serpens.de (Michael van Elst)
 > To: gnats-bugs@netbsd.org
 > Cc:=20
 > Subject: Re: PR/52966 CVS commit: src/sys/arch/x86/x86
 > Date: Sat, 26 Oct 2019 08:27:02 -0000 (UTC)
 >
 >  coypu@sdf.org writes:
 > =20
 >  >The following reply was made to PR port-amd64/52966; it has been note=
 d by GNATS.
 > =20
 >  >From: coypu@sdf.org
 >  >To: gnats-bugs@netbsd.org
 >  >Cc:=20
 >  >Subject: Re: PR/52966 CVS commit: src/sys/arch/x86/x86
 >  >Date: Sat, 26 Oct 2019 07:51:00 +0000
 > =20
 >  > with:
 >  > AMD Ryzen 7 2700X Eight-Core Processor
 >  > NetBSD 9.99.17
 >  > the solution validates.
 >  >=20
 >  > But I don't know if it failed before in the same setup. It'd be
 >  > interesting to hear whether mlelstv's machine still fails.
 >  >=20
 > =20
 >  -current as of today works. But that's probably because using
 >  XSAVEOPT is still #ifdef'd out in identcpu.c.
 > =20
 >  --=20
 >  --=20
 >                                  Michael van Elst
 >  Internet: mlelstv@serpens.de
 >                                  "A potential Snark may lurk in every t=
 ree."
 > =20

 --=20
 -----
 You will continue to suffer
 if you have an emotional reaction to everything that is said to you.
 True power is sitting back and observing everything with logic.
 If words control you that means everyone else can control you.
 Breathe and allow things to pass.

 --- Bruce Lee



 --RPBSKkCgMc9AJzYa2pUH2I0940rZ7tmjT--

 --wOlCq1XK6CINDQJnwim3YDaBviELyilHA
 Content-Type: application/pgp-signature; name="signature.asc"
 Content-Description: OpenPGP digital signature
 Content-Disposition: attachment; filename="signature.asc"

 -----BEGIN PGP SIGNATURE-----

 iQIzBAEBCAAdFiEEKr+CRUEAsbCC4oKDexg3nfOkUnAFAl20TBsACgkQexg3nfOk
 UnAEuA/+I9YFHw5QAJ9Okz4iCQsUXowu2qpL7Xxy4Wef5lCJ6zbIV/h4I1lWr7tL
 QmV93GjduQXRY2/eMN4cu0HVQEPRCNjj9vcN/ees2yu8bIPFOeydGsokUKWxO1xa
 i2m3LaUsZJrFcEcpEgNt3wwenJO9D2Cw7IT+2vJZ8LyCRzobkEOgqwTJ1kp7QdpX
 hQn+r6Vc+4u9RZrA0mWOYpw/I3/m3JorZbh11o7E0IPwDkSLEppWZh5ue1IwgQXY
 aHoxaPEBLLOVpKPuCrlfW+g+gqhU7YSkI99AopxmffKrMfU4PQKW+ktpZnRxhXn9
 n1/0o8/1RIJ/b+XGcgu+kcnLqe7yT/Y/p5LMuNF4HoQ7Jh4m7CRqsG4udilOrfGP
 J843/GPFcbcqbq7Hey73tqow+BfrtMOm+2q79yNVVrxQurnBqJ14oY0/ltYaIoPc
 ouU9Dp1NUCTkA7SgVvsD3WjadWj0GZ8dX+/WhUoTAPYHRWb0Rb7uBpANEXeSatTn
 EUCnJFYG+SI7dZLkFOyJemKOLe9kwIC+UsY9xDyrO1j/E4pB5zTQn8CqsrQxKliZ
 MXadusAanXKr1vWPLGCm6n/SnsYfvkBI8bZ1kr1GnoV/krqNoD8RHWhG9RFYzDkh
 ZBnz1V1FjPL5ywYZBd9y4guEyCwSOn31ZYYyVdf+9Nes8Iikxiw=
 =lLSW
 -----END PGP SIGNATURE-----

 --wOlCq1XK6CINDQJnwim3YDaBviELyilHA--

State-Changed-From-To: feedback->closed
State-Changed-By: maxv@NetBSD.org
State-Changed-When: Tue, 29 Oct 2019 12:44:38 +0000
State-Changed-Why:
Close this PR. The code has changed a lot, we've now dropped LazyFPU, the
code is a lot simpler, and I've re-enabled XSAVEOPT after several days of
testing with NVMM and also your stream.c code.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.