NetBSD Problem Report #58372

From www@netbsd.org  Thu Jun 27 06:47:05 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id E28D51A923A
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 27 Jun 2024 06:47:05 +0000 (UTC)
Message-Id: <20240627064704.A2D501A923C@mollari.NetBSD.org>
Date: Thu, 27 Jun 2024 06:47:04 +0000 (UTC)
From: sotiris@lamprinidis.com
Reply-To: sotiris@lamprinidis.com
To: gnats-bugs@NetBSD.org
Subject: Temperature reading broken on Intel Core 2 Duo E6300 (patch provided)
X-Send-Pr-Version: www-1.0

>Number:         58372
>Category:       kern
>Synopsis:       Temperature reading broken on Intel Core 2 Duo E6300 (patch provided)
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    gutteridge
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Jun 27 06:50:00 +0000 2024
>Last-Modified:  Wed Sep 04 09:05:01 +0000 2024
>Originator:     Sotiris Lamprinidis
>Release:        netbsd-10
>Organization:
>Environment:
NetBSD 10.0_STABLE amd64
>Description:
On a core2duo E6300 machine the Tjmax cannot be detected and set to 0, consequently displaying negative temperatures in envstat. Using the table from [1] for processor signature 000006Fxh and MSR_IA32_PLATFORM_ID(0x17)[28] == 0 this is a non-mobile processor and according to [2,pp.77] Tjmax is 74.1 at its TDP. 

References
[1]: mobile intel core2 processor detection table (https://web.archive.org/web/20110608131711/http://software.intel.com/en-us/articles/mobile-intel-core2-processor-detection-table/)
[2]: intel pentium dual-core processor e6000 and e5000 series datasheet (https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/pentium-dual-core-e6000-e5000-datasheet.pdf)
>How-To-Repeat:

>Fix:
Index: sys/arch/x86/x86/coretemp.c
===================================================================
RCS file: /cvsroot/src/sys/arch/x86/x86/coretemp.c,v
retrieving revision 1.38.4.1
diff -u -r1.38.4.1 coretemp.c
--- sys/arch/x86/x86/coretemp.c 29 Jul 2023 10:58:02 -0000      1.38.4.1
+++ sys/arch/x86/x86/coretemp.c 27 Jun 2024 06:39:49 -0000
@@ -285,6 +285,14 @@
                 */
                if (rdmsr_safe(MSR_IA32_PLATFORM_ID, &msr) != 0)
                        goto notee;
+
+               if (((ci->ci_signature & __BITS(4, 32)) == 0x006f0) &&
+                               ((msr & __BIT(28)) == 0)) {
+                       /* core duo 65nm */
+                       sc->sc_tjmax = 74;
+                       return;
+               }
+
                if ((model < 0x17) && ((msr & __BIT(28)) == 0))
                        goto notee;


>Release-Note:

>Audit-Trail:
From: "David H. Gutteridge" <david@gutteridge.ca>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58372: Temperature reading broken on Intel Core 2 Duo E6300
 (patch provided)
Date: Thu, 27 Jun 2024 12:18:31 -0400

 A couple of things. The zero value assignment itself was a regression
 that has since been addressed differently and pulled up to the 9 and 10
 branches. It still doesn't address that we can't correctly read the
 value in some cases, it just moves the default back to 100 as it was
 previously.

 74 seems oddly low. I realize that's what's in the table you referenced
 but I would expect it to be a bit higher. That table lists T case max,
 which is not the same as T junction max, as I read Intel's definitions.
 If you look at page 72 in the datasheet, you'll see under "THERMTRIP#"
 it says "the processor will automatically shut down when the silicon
 has reached a temperature approximately 20 °C above the maximum TC.
 Assertion of THERMTRIP# (Thermal Trip) indicates the processor junction
 temperature has reached a level beyond where permanent silicon damage
 may occur." So my interpretation of that is TjMax is somewhere between
 74 and 94, but not clearly defined. (I could be quite wrong, this is
 not an area I've delved into much.)

 Regards,

 Dave

 PS my reference for T junction vs. T case is:
 https://www.intel.com/content/www/us/en/support/articles/000005597/processors.html

From: Sotiris Lamprinidis <sotiris@lamprinidis.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58372: Temperature reading broken on Intel Core 2 Duo E6300
 (patch provided)
Date: Fri, 28 Jun 2024 17:41:17 +0200

 I see a fix was pulled - my bad.

 Also thanks for the input. Indeed tjmax=3D94 would make the reading 72 =
 (-22) under load
 which seems right.

 There is also this (model < 0x17) check which looks weird since we're =
 already inside
 the ((model =3D=3D 0x0f) || (model =3D=3D 0x0e)) block. Since we're =
 here, maybe
 something like this would be reasonable: =20

 Index: sys/arch/x86/x86/coretemp.c
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 RCS file: /cvsroot/src/sys/arch/x86/x86/coretemp.c,v
 retrieving revision 1.38.4.2
 diff -u -u -r1.38.4.2 coretemp.c
 --- sys/arch/x86/x86/coretemp.c 22 Jun 2024 09:57:21 -0000      1.38.4.2
 +++ sys/arch/x86/x86/coretemp.c 28 Jun 2024 15:34:23 -0000
 @@ -275,6 +275,8 @@
         sc->sc_tjmax =3D TJMAX_DEFAULT;

         if ((model =3D=3D 0x0f && stepping >=3D 2) || (model =3D=3D =
 0x0e)) {
 +               /* Default for 65mm Core 2s */
 +               sc->sc_tjmax =3D 94;
                 /*
                  * Check MSR_IA32_PLATFORM_ID(0x17) bit 28. It's not =
 documented
                  * in the datasheet, but the following page describes =
 the
 @@ -284,9 +286,15 @@
                  *   Was: =
 http://softwarecommunity.intel.com/Wiki/Mobility/
                  *     720.htm
                  */
 -               if (rdmsr_safe(MSR_IA32_PLATFORM_ID, &msr) !=3D 0)
 -                       goto notee;
 -               if ((model < 0x17) && ((msr & __BIT(28)) =3D=3D 0))
 +               if (rdmsr_safe(MSR_IA32_PLATFORM_ID, &msr) =3D=3D =
 EFAULT) {
 +                       aprint_normal("\n");
 +                       aprint_error_dev(sc->sc_dev,
 +                           "Failed to read MSR_IA32_PLATFORM_ID. "
 +                           "Using default (%d)\n", sc->sc_tjmax);
 +                       return 1;
 +               }
 +
 +               if ((msr & __BIT(28)) =3D=3D 0)
                         goto notee;

                 if (rdmsr_safe(MSR_IA32_EXT_CONFIG, &msr) =3D=3D EFAULT) =
 {
                                                                          =
                                                                          =
                                                                          =
                                                                          =
                                                                          =
                                                                          =
                                                                          =
                                                                   =20
 Thanks,
 S.

 > A couple of things. The zero value assignment itself was a regression
 > that has since been addressed differently and pulled up to the 9 and =
 10
 > branches. It still doesn't address that we can't correctly read the
 > value in some cases, it just moves the default back to 100 as it was
 > previously.
 >=20
 > 74 seems oddly low. I realize that's what's in the table you =
 referenced
 > but I would expect it to be a bit higher. That table lists T case max,
 > which is not the same as T junction max, as I read Intel's =
 definitions.
 > If you look at page 72 in the datasheet, you'll see under "THERMTRIP#"
 > it says "the processor will automatically shut down when the silicon
 > has reached a temperature approximately 20 =C2=B0C above the maximum =
 TC.
 > Assertion of THERMTRIP# (Thermal Trip) indicates the processor =
 junction
 > temperature has reached a level beyond where permanent silicon damage
 > may occur." So my interpretation of that is TjMax is somewhere between
 > 74 and 94, but not clearly defined. (I could be quite wrong, this is
 > not an area I've delved into much.)
 >=20
 > Regards,
 >=20
 > Dave
 >=20
 > PS my reference for T junction vs. T case is:
 > =
 https://www.intel.com/content/www/us/en/support/articles/000005597/process=
 ors.html
 >=20

From: "David H. Gutteridge" <gutteridge@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/58372 CVS commit: src/sys/arch/x86/x86
Date: Mon, 15 Jul 2024 01:57:24 +0000

 Module Name:	src
 Committed By:	gutteridge
 Date:		Mon Jul 15 01:57:23 UTC 2024

 Modified Files:
 	src/sys/arch/x86/x86: coretemp.c

 Log Message:
 coretemp.c: drop redundant condition (NFCI)

 Checking for a processor model upper limit has no point inside a block
 that is already limited further. Noted from code inspection by Sotiris
 Lamprinidis in PR kern/58372.

 While here, also update to a cached version of an URL for processor
 references, as both original URLs have now been removed by Intel.


 To generate a diff of this commit:
 cvs rdiff -u -r1.41 -r1.42 src/sys/arch/x86/x86/coretemp.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: kern-bug-people->gutteridge
Responsible-Changed-By: gutteridge@NetBSD.org
Responsible-Changed-When: Mon, 15 Jul 2024 03:04:24 +0000
Responsible-Changed-Why:
Take this, since I've committed something and am looking at it.

From: "David H. Gutteridge" <david@gutteridge.ca>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58372: Temperature reading broken on Intel Core 2 Duo E6300
 (patch provided)
Date: Sun, 14 Jul 2024 23:01:37 -0400

 The history of the relevant changes to this driver is a bit hard to
 follow, as things have sometimes been committed over time without much
 in the way of explanation, e.g., what exact hardware was involved,
 whether there was a PR, etc. Stuff like this is very hard to support
 without entirely adequate documentation. (This is all rather old by
 now, as well.)

 I don't think we could apply your second proposed patch as-is. It would
 undo one of the intentions of a previous change. I did just remove the
 redundant model check you pointed out.

 The original expectation for any hardware that matches the first model
 check is that:

 - All mobile processors are 85
 - All desktop processors are 100
 (This is still how it is with the FreeBSD version of this driver.)

 Later, further checks were added for more means of distinguishing
 mobile vs. desktop. The intent was to jump to an entirely different
 TjMax sourcing method if not able to definitively identify mobile vs.
 desktop. We shouldn't be dropping either goto. (When I added more
 detailed reporting of errors, I didn't add such where you're proposing
 on purpose, as it seems perhaps too verbose to report a failure there,
 given there's another try still to occur.)

 Yes, the check for < 017 is redundant inside that block. It's unclear
 why it was put there (i.e. was it meant to be "stepping" not "model",
 or it was intended to be placed elsewhere, or just a copy and paste
 mistake -- I don't think there could be such a value possible for
 "stepping" here anyway).

 Also, we can't universally change the default to 94 inside that block
 before further tries, as the original expectation is 100. We'd need to
 have very specific checks that narrow to your case. In other words,
 something like your first proposed patch. But, the idea across more
 than one OS is that what falls into your processor's generation would
 either be 85 or 100. I believe Linux also does this (I would have to
 look). So I'm a bit hesitant to touch that.

 We certainly do appreciate your work here, and I'm not suggesting we
 can't do more. I'd like to get feedback from others before making
 further adjustments.

 Thanks,

 Dave

From: Sotiris Lamprinidis <sotiris@lamprinidis.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58372: Temperature reading broken on Intel Core 2 Duo E6300
 (patch provided)
Date: Wed, 4 Sep 2024 11:00:30 +0200

 Hi, thanks for your time. I took a look at the Linux source and ran it
 on my machine and it seems indeed that it returns the fall-back 100 C.

 I'm happy with the situation right now, I may or may not come back to
 this at some point :).

 Best,
 S.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.