NetBSD Problem Report #58372
From www@netbsd.org Thu Jun 27 06:47:05 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
client-signature RSA-PSS (2048 bits) client-digest SHA256)
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id E28D51A923A
for <gnats-bugs@gnats.NetBSD.org>; Thu, 27 Jun 2024 06:47:05 +0000 (UTC)
Message-Id: <20240627064704.A2D501A923C@mollari.NetBSD.org>
Date: Thu, 27 Jun 2024 06:47:04 +0000 (UTC)
From: sotiris@lamprinidis.com
Reply-To: sotiris@lamprinidis.com
To: gnats-bugs@NetBSD.org
Subject: Temperature reading broken on Intel Core 2 Duo E6300 (patch provided)
X-Send-Pr-Version: www-1.0
>Number: 58372
>Category: kern
>Synopsis: Temperature reading broken on Intel Core 2 Duo E6300 (patch provided)
>Confidential: no
>Severity: non-critical
>Priority: low
>Responsible: gutteridge
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Jun 27 06:50:00 +0000 2024
>Last-Modified: Wed Sep 04 09:05:01 +0000 2024
>Originator: Sotiris Lamprinidis
>Release: netbsd-10
>Organization:
>Environment:
NetBSD 10.0_STABLE amd64
>Description:
On a core2duo E6300 machine the Tjmax cannot be detected and set to 0, consequently displaying negative temperatures in envstat. Using the table from [1] for processor signature 000006Fxh and MSR_IA32_PLATFORM_ID(0x17)[28] == 0 this is a non-mobile processor and according to [2,pp.77] Tjmax is 74.1 at its TDP.
References
[1]: mobile intel core2 processor detection table (https://web.archive.org/web/20110608131711/http://software.intel.com/en-us/articles/mobile-intel-core2-processor-detection-table/)
[2]: intel pentium dual-core processor e6000 and e5000 series datasheet (https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/pentium-dual-core-e6000-e5000-datasheet.pdf)
>How-To-Repeat:
>Fix:
Index: sys/arch/x86/x86/coretemp.c
===================================================================
RCS file: /cvsroot/src/sys/arch/x86/x86/coretemp.c,v
retrieving revision 1.38.4.1
diff -u -r1.38.4.1 coretemp.c
--- sys/arch/x86/x86/coretemp.c 29 Jul 2023 10:58:02 -0000 1.38.4.1
+++ sys/arch/x86/x86/coretemp.c 27 Jun 2024 06:39:49 -0000
@@ -285,6 +285,14 @@
*/
if (rdmsr_safe(MSR_IA32_PLATFORM_ID, &msr) != 0)
goto notee;
+
+ if (((ci->ci_signature & __BITS(4, 32)) == 0x006f0) &&
+ ((msr & __BIT(28)) == 0)) {
+ /* core duo 65nm */
+ sc->sc_tjmax = 74;
+ return;
+ }
+
if ((model < 0x17) && ((msr & __BIT(28)) == 0))
goto notee;
>Release-Note:
>Audit-Trail:
From: "David H. Gutteridge" <david@gutteridge.ca>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58372: Temperature reading broken on Intel Core 2 Duo E6300
(patch provided)
Date: Thu, 27 Jun 2024 12:18:31 -0400
A couple of things. The zero value assignment itself was a regression
that has since been addressed differently and pulled up to the 9 and 10
branches. It still doesn't address that we can't correctly read the
value in some cases, it just moves the default back to 100 as it was
previously.
74 seems oddly low. I realize that's what's in the table you referenced
but I would expect it to be a bit higher. That table lists T case max,
which is not the same as T junction max, as I read Intel's definitions.
If you look at page 72 in the datasheet, you'll see under "THERMTRIP#"
it says "the processor will automatically shut down when the silicon
has reached a temperature approximately 20 °C above the maximum TC.
Assertion of THERMTRIP# (Thermal Trip) indicates the processor junction
temperature has reached a level beyond where permanent silicon damage
may occur." So my interpretation of that is TjMax is somewhere between
74 and 94, but not clearly defined. (I could be quite wrong, this is
not an area I've delved into much.)
Regards,
Dave
PS my reference for T junction vs. T case is:
https://www.intel.com/content/www/us/en/support/articles/000005597/processors.html
From: Sotiris Lamprinidis <sotiris@lamprinidis.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58372: Temperature reading broken on Intel Core 2 Duo E6300
(patch provided)
Date: Fri, 28 Jun 2024 17:41:17 +0200
I see a fix was pulled - my bad.
Also thanks for the input. Indeed tjmax=3D94 would make the reading 72 =
(-22) under load
which seems right.
There is also this (model < 0x17) check which looks weird since we're =
already inside
the ((model =3D=3D 0x0f) || (model =3D=3D 0x0e)) block. Since we're =
here, maybe
something like this would be reasonable: =20
Index: sys/arch/x86/x86/coretemp.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /cvsroot/src/sys/arch/x86/x86/coretemp.c,v
retrieving revision 1.38.4.2
diff -u -u -r1.38.4.2 coretemp.c
--- sys/arch/x86/x86/coretemp.c 22 Jun 2024 09:57:21 -0000 1.38.4.2
+++ sys/arch/x86/x86/coretemp.c 28 Jun 2024 15:34:23 -0000
@@ -275,6 +275,8 @@
sc->sc_tjmax =3D TJMAX_DEFAULT;
if ((model =3D=3D 0x0f && stepping >=3D 2) || (model =3D=3D =
0x0e)) {
+ /* Default for 65mm Core 2s */
+ sc->sc_tjmax =3D 94;
/*
* Check MSR_IA32_PLATFORM_ID(0x17) bit 28. It's not =
documented
* in the datasheet, but the following page describes =
the
@@ -284,9 +286,15 @@
* Was: =
http://softwarecommunity.intel.com/Wiki/Mobility/
* 720.htm
*/
- if (rdmsr_safe(MSR_IA32_PLATFORM_ID, &msr) !=3D 0)
- goto notee;
- if ((model < 0x17) && ((msr & __BIT(28)) =3D=3D 0))
+ if (rdmsr_safe(MSR_IA32_PLATFORM_ID, &msr) =3D=3D =
EFAULT) {
+ aprint_normal("\n");
+ aprint_error_dev(sc->sc_dev,
+ "Failed to read MSR_IA32_PLATFORM_ID. "
+ "Using default (%d)\n", sc->sc_tjmax);
+ return 1;
+ }
+
+ if ((msr & __BIT(28)) =3D=3D 0)
goto notee;
if (rdmsr_safe(MSR_IA32_EXT_CONFIG, &msr) =3D=3D EFAULT) =
{
=
=
=
=
=
=
=
=20
Thanks,
S.
> A couple of things. The zero value assignment itself was a regression
> that has since been addressed differently and pulled up to the 9 and =
10
> branches. It still doesn't address that we can't correctly read the
> value in some cases, it just moves the default back to 100 as it was
> previously.
>=20
> 74 seems oddly low. I realize that's what's in the table you =
referenced
> but I would expect it to be a bit higher. That table lists T case max,
> which is not the same as T junction max, as I read Intel's =
definitions.
> If you look at page 72 in the datasheet, you'll see under "THERMTRIP#"
> it says "the processor will automatically shut down when the silicon
> has reached a temperature approximately 20 =C2=B0C above the maximum =
TC.
> Assertion of THERMTRIP# (Thermal Trip) indicates the processor =
junction
> temperature has reached a level beyond where permanent silicon damage
> may occur." So my interpretation of that is TjMax is somewhere between
> 74 and 94, but not clearly defined. (I could be quite wrong, this is
> not an area I've delved into much.)
>=20
> Regards,
>=20
> Dave
>=20
> PS my reference for T junction vs. T case is:
> =
https://www.intel.com/content/www/us/en/support/articles/000005597/process=
ors.html
>=20
From: "David H. Gutteridge" <gutteridge@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/58372 CVS commit: src/sys/arch/x86/x86
Date: Mon, 15 Jul 2024 01:57:24 +0000
Module Name: src
Committed By: gutteridge
Date: Mon Jul 15 01:57:23 UTC 2024
Modified Files:
src/sys/arch/x86/x86: coretemp.c
Log Message:
coretemp.c: drop redundant condition (NFCI)
Checking for a processor model upper limit has no point inside a block
that is already limited further. Noted from code inspection by Sotiris
Lamprinidis in PR kern/58372.
While here, also update to a cached version of an URL for processor
references, as both original URLs have now been removed by Intel.
To generate a diff of this commit:
cvs rdiff -u -r1.41 -r1.42 src/sys/arch/x86/x86/coretemp.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
Responsible-Changed-From-To: kern-bug-people->gutteridge
Responsible-Changed-By: gutteridge@NetBSD.org
Responsible-Changed-When: Mon, 15 Jul 2024 03:04:24 +0000
Responsible-Changed-Why:
Take this, since I've committed something and am looking at it.
From: "David H. Gutteridge" <david@gutteridge.ca>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58372: Temperature reading broken on Intel Core 2 Duo E6300
(patch provided)
Date: Sun, 14 Jul 2024 23:01:37 -0400
The history of the relevant changes to this driver is a bit hard to
follow, as things have sometimes been committed over time without much
in the way of explanation, e.g., what exact hardware was involved,
whether there was a PR, etc. Stuff like this is very hard to support
without entirely adequate documentation. (This is all rather old by
now, as well.)
I don't think we could apply your second proposed patch as-is. It would
undo one of the intentions of a previous change. I did just remove the
redundant model check you pointed out.
The original expectation for any hardware that matches the first model
check is that:
- All mobile processors are 85
- All desktop processors are 100
(This is still how it is with the FreeBSD version of this driver.)
Later, further checks were added for more means of distinguishing
mobile vs. desktop. The intent was to jump to an entirely different
TjMax sourcing method if not able to definitively identify mobile vs.
desktop. We shouldn't be dropping either goto. (When I added more
detailed reporting of errors, I didn't add such where you're proposing
on purpose, as it seems perhaps too verbose to report a failure there,
given there's another try still to occur.)
Yes, the check for < 017 is redundant inside that block. It's unclear
why it was put there (i.e. was it meant to be "stepping" not "model",
or it was intended to be placed elsewhere, or just a copy and paste
mistake -- I don't think there could be such a value possible for
"stepping" here anyway).
Also, we can't universally change the default to 94 inside that block
before further tries, as the original expectation is 100. We'd need to
have very specific checks that narrow to your case. In other words,
something like your first proposed patch. But, the idea across more
than one OS is that what falls into your processor's generation would
either be 85 or 100. I believe Linux also does this (I would have to
look). So I'm a bit hesitant to touch that.
We certainly do appreciate your work here, and I'm not suggesting we
can't do more. I'd like to get feedback from others before making
further adjustments.
Thanks,
Dave
From: Sotiris Lamprinidis <sotiris@lamprinidis.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58372: Temperature reading broken on Intel Core 2 Duo E6300
(patch provided)
Date: Wed, 4 Sep 2024 11:00:30 +0200
Hi, thanks for your time. I took a look at the Linux source and ran it
on my machine and it seems indeed that it returns the fall-back 100 C.
I'm happy with the situation right now, I may or may not come back to
this at some point :).
Best,
S.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.