NetBSD Problem Report #53993

From www@NetBSD.org  Tue Feb 19 17:13:29 2019
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id F0A787A177
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 19 Feb 2019 17:13:28 +0000 (UTC)
Message-Id: <20190219171328.081DF7A1D7@mollari.NetBSD.org>
Date: Tue, 19 Feb 2019 17:13:28 +0000 (UTC)
From: fstd.lkml@gmail.com
Reply-To: fstd.lkml@gmail.com
To: gnats-bugs@NetBSD.org
Subject: panic on wrmsr in cpu_switchto
X-Send-Pr-Version: www-1.0

>Number:         53993
>Category:       port-amd64
>Synopsis:       panic on wrmsr in cpu_switchto
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    maxv
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Feb 19 17:15:00 +0000 2019
>Closed-Date:    Sat Apr 13 06:23:18 +0000 2019
>Last-Modified:  Sat Apr 13 06:23:18 +0000 2019
>Originator:     Timo Buhrmester
>Release:        8-stable
>Organization:
>Environment:
NetBSD lemon.pr0.tips 8.0_STABLE NetBSD 8.0_STABLE (LEMONKERND) #0: Mon Feb 18 00:16:18 CET 2019  build@kiwi.pr0.tips:/stor/netbsd/foreign/lemon-apu/obj/sys/arch/amd64/compile/LEMONKERND amd64
>Description:
I'm running 8-stable on this (mostly open) hardware: https://www.pcengines.ch/apu2c4.htm.

Every so many days, I'm getting this panic:

| fatal protection fault in supervisor mode
| trap type 4 code 0 rip 0xffffffff80208653 cs 0x8 rflags 0x10246 cr2 0x7f7fff97c328 ilevel 0x8 rsp 0xffff800068235a20
| curlwp 0xfffffe810e4df8c0 pid 13063.1 lowest kstack 0xffff8000682322c0
| kernel: protection fault trap, code=0
| Stopped in pid 13063.1 (grep) at        netbsd:cpu_switchto+0x153:      wrmsr
| db{0}> bt
| cpu_switchto() at netbsd:cpu_switchto+0x153
| kpreempt() at netbsd:kpreempt+0xe0
| DDB lost frame for netbsd:Xresume_preempt+0x28, trying 0xffff800068235af0
| Xresume_preempt() at netbsd:Xresume_preempt+0x28
| --- interrupt ---
| ?() at 206
| allproc() at ffffffff814c4240
| ?() at fffffe81079b92d0
| Bad frame pointer: 0xfffffe8107dcf2d8
| db{0}> show registers
| ds          92d0
| es          babb
| fs          59f0
| gs          babb
| rdi         fffffe81079b92d0
| rsi         fffffe810e4df8c0
| rbp         ffff800068235ab0
| rbx         fffffe81079ae1e0
| rdx         f8cff33d
| rcx         c0000102
| rax         9044ffff
| r8          ffffffff81400600    cpu_info_primary
| r9          fffffe811ea1a044
| r10         400
| r11         fffffe8107401048
| r12         fffffe810e4df8c0
| r13         fffffe81079ae1e0
| r14         ffff800068232000
| r15         fffffe811ea1ef40
| rip         ffffffff80208653    cpu_switchto+0x153
| cs          8 
| rflags      10246
| rsp         ffff800068235a20
| ss          0 
| netbsd:cpu_switchto+0x153:      wrmsr

Does this ring a bell for anybody?  Does it look like it might rather be a hardware issue?

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:
From: fstd <fstd.lkml@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53993: panic on wrmsr in cpu_switchto
Date: Sun, 3 Mar 2019 11:52:21 +0100

 After getting the same panic with another, identical machine, I think I can now rule out the hardware being faulty.

Responsible-Changed-From-To: kern-bug-people->maxv
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Sun, 03 Mar 2019 11:08:13 +0000
Responsible-Changed-Why:
Maxime, could you please have a look?


From: fstd <fstd.lkml@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: maxv@NetBSD.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
	gnats-admin@netbsd.org, martin@NetBSD.org
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Sun, 3 Mar 2019 15:41:39 +0100

 > Maxime, could you please have a look?
 To Maxime, since I currently have two of these boards: If it helps, I could set you up with SSH access to a machine that's hooked up to the serial console of one of these boards, and freedom to mess around with it.

State-Changed-From-To: open->feedback
State-Changed-By: maxv@NetBSD.org
State-Changed-When: Sun, 24 Mar 2019 12:12:29 +0000
State-Changed-Why:
You run 32bit binaries on your system, right? Are these NetBSD 32bit binaries
or Linux 32bit binaries? (Ie, is it compat_netbsd32 or compat_linux32)
:w


From: Maxime Villard <max@m00nbsd.net>
To: gnats-bugs@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
 fstd.lkml@gmail.com
Cc: 
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Sun, 24 Mar 2019 13:50:27 +0100

 Le 24/03/2019 à 13:12, maxv@NetBSD.org a écrit :
 > Synopsis: panic on wrmsr in cpu_switchto
 > 
 > State-Changed-From-To: open->feedback
 > State-Changed-By: maxv@NetBSD.org
 > State-Changed-When: Sun, 24 Mar 2019 12:12:29 +0000
 > State-Changed-Why:
 > You run 32bit binaries on your system, right? Are these NetBSD 32bit binaries
 > or Linux 32bit binaries? (Ie, is it compat_netbsd32 or compat_linux32)
 > :w

 ok, it's a race in setregs()

From: "Maxime Villard" <maxv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53993 CVS commit: src/sys
Date: Sun, 24 Mar 2019 13:15:43 +0000

 Module Name:	src
 Committed By:	maxv
 Date:		Sun Mar 24 13:15:43 UTC 2019

 Modified Files:
 	src/sys/arch/amd64/amd64: machdep.c
 	src/sys/compat/linux/arch/amd64: linux_machdep.c

 Log Message:
 Fix a tiny race in setregs and linux_setregs. Between the moment we set
 pcb_flags to zero, and the moment cpu_segregs64_zero resets pcb_gs, we may
 be preempted.

 If this happens, and if the calling LWP was a 32bit thread, when switching
 back to that LWP, the context switcher sees that PCB_COMPAT32 is not set in
 pcb_flags and tries to perform a 64bit context switch; but pcb_gs contains
 a 32bit GDT descriptor, and not a 64bit GS.base value. The wrmsr therefore
 faults because the value is non-canonical, and this fault is fatal.

 Rearrange the code so that the update of pcb_flags and pcb_gs/pcb_fs is non
 interruptible. This fixes the problem, tested with a reproducer (which
 therefore doesn't work anymore).

 Likely fixes PR/53993.


 To generate a diff of this commit:
 cvs rdiff -u -r1.327 -r1.328 src/sys/arch/amd64/amd64/machdep.c
 cvs rdiff -u -r1.56 -r1.57 src/sys/compat/linux/arch/amd64/linux_machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: fstd <fstd.lkml@gmail.com>
To: gnats-bugs@netbsd.org
Cc: maxv@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Sun, 24 Mar 2019 15:37:57 +0100

 > You run 32bit binaries on your system, right? Are these NetBSD 32bit binaries
 > or Linux 32bit binaries? (Ie, is it compat_netbsd32 or compat_linux32)
 Yes, indeed!  All my packages are still 32 bit as I migrated from
 an i386 board recently.

 I'll try a new kernel with your latest commit and report back.

 Thank you so much!
 Timo

From: Maxime Villard <max@m00nbsd.net>
To: fstd <fstd.lkml@gmail.com>, gnats-bugs@netbsd.org
Cc: netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Sun, 24 Mar 2019 15:52:31 +0100

 Le 24/03/2019 à 15:37, fstd a écrit :
 >> You run 32bit binaries on your system, right? Are these NetBSD 32bit binaries
 >> or Linux 32bit binaries? (Ie, is it compat_netbsd32 or compat_linux32)
 > Yes, indeed!  All my packages are still 32 bit as I migrated from
 > an i386 board recently.
 > 
 > I'll try a new kernel with your latest commit and report back.
 > 
 > Thank you so much!
 > Timo

 The patch has been applied to NetBSD-current, but not yet to NetBSD-8. I am
 currently setting up a NetBSD-8 VM to test...

From: fstd <fstd.lkml@gmail.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
	Maxime Villard <max@m00nbsd.net>
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Sun, 24 Mar 2019 18:21:22 +0100

 > Can you test this patch [1]? I've tested on my side, it's fine.
 > [1] https://m00nbsd.net/garbage/x86/wrmsr8.diff
 I'm running this patch now.  Looking at my Munin graphs, the highest
 uptime I've seen in the last months was 9 days (but usually much less
 than that), so I guess it'll take a while to verify whether this
 actually fixed the panic.

 I'll report back when I reach 20 days (or if it crashes).

 Again, thanks a ton for your effort!

 Timo

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53993 CVS commit: [netbsd-8] src/sys
Date: Fri, 5 Apr 2019 07:48:05 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Fri Apr  5 07:48:05 UTC 2019

 Modified Files:
 	src/sys/arch/amd64/amd64 [netbsd-8]: machdep.c netbsd32_machdep.c
 	src/sys/compat/linux/arch/amd64 [netbsd-8]: linux_machdep.c
 	src/sys/compat/linux32/arch/amd64 [netbsd-8]: linux32_machdep.c

 Log Message:
 Pull up following revision(s) (requested by maxv):

 	sys/arch/amd64/amd64/netbsd32_machdep.c: revision 1.120
 	sys/compat/linux/arch/amd64/linux_machdep.c: revision 1.57
 	sys/compat/linux32/arch/amd64/linux32_machdep.c: revision 1.44
 	sys/arch/amd64/amd64/machdep.c: revision 1.328
 	sys/arch/amd64/amd64/machdep.c: revision 1.329

 Fix a tiny race in setregs and linux_setregs. Between the moment we set
 pcb_flags to zero, and the moment cpu_segregs64_zero resets pcb_gs, we may
 be preempted.

 If this happens, and if the calling LWP was a 32bit thread, when switching
 back to that LWP, the context switcher sees that PCB_COMPAT32 is not set in
 pcb_flags and tries to perform a 64bit context switch; but pcb_gs contains
 a 32bit GDT descriptor, and not a 64bit GS.base value. The wrmsr therefore
 faults because the value is non-canonical, and this fault is fatal.

 Rearrange the code so that the update of pcb_flags and pcb_gs/pcb_fs is non
 interruptible. This fixes the problem, tested with a reproducer (which
 therefore doesn't work anymore).

 Likely fixes PR/53993.

 Disable preemption when setting PCB_COMPAT32, to prevent a context switch
 before cpu_fsgs_reload() finishes, otherwise we write garbage in the GDT.

 On NetBSD-current it is harmless, however in NetBSD-8 it might cause
 panics, because NetBSD-8 uses the old SegRegs model and under this model
 we reload %fs and %gs during switches.


 To generate a diff of this commit:
 cvs rdiff -u -r1.255.6.8 -r1.255.6.9 src/sys/arch/amd64/amd64/machdep.c
 cvs rdiff -u -r1.105.2.2 -r1.105.2.3 \
     src/sys/arch/amd64/amd64/netbsd32_machdep.c
 cvs rdiff -u -r1.51.6.1 -r1.51.6.2 \
     src/sys/compat/linux/arch/amd64/linux_machdep.c
 cvs rdiff -u -r1.38.6.1 -r1.38.6.2 \
     src/sys/compat/linux32/arch/amd64/linux32_machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Timo Buhrmester <fstd.lkml@gmail.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
	Maxime Villard <max@m00nbsd.net>
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Fri, 12 Apr 2019 19:39:41 +0200

 > I'll report back when I reach 20 days (or if it crashes).
 20 days and 0 panics later, I'm happy to confirm this fixed the issue!

 Thank you so much, maxv
 Timo

State-Changed-From-To: feedback->closed
State-Changed-By: maxv@NetBSD.org
State-Changed-When: Sat, 13 Apr 2019 06:23:18 +0000
State-Changed-Why:
Close this PR, fixed. Thanks for the report


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.