NetBSD Problem Report #53993
From www@NetBSD.org Tue Feb 19 17:13:29 2019
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id F0A787A177
for <gnats-bugs@gnats.NetBSD.org>; Tue, 19 Feb 2019 17:13:28 +0000 (UTC)
Message-Id: <20190219171328.081DF7A1D7@mollari.NetBSD.org>
Date: Tue, 19 Feb 2019 17:13:28 +0000 (UTC)
From: fstd.lkml@gmail.com
Reply-To: fstd.lkml@gmail.com
To: gnats-bugs@NetBSD.org
Subject: panic on wrmsr in cpu_switchto
X-Send-Pr-Version: www-1.0
>Number: 53993
>Category: port-amd64
>Synopsis: panic on wrmsr in cpu_switchto
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: maxv
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Feb 19 17:15:00 +0000 2019
>Closed-Date: Sat Apr 13 06:23:18 +0000 2019
>Last-Modified: Sat Apr 13 06:23:18 +0000 2019
>Originator: Timo Buhrmester
>Release: 8-stable
>Organization:
>Environment:
NetBSD lemon.pr0.tips 8.0_STABLE NetBSD 8.0_STABLE (LEMONKERND) #0: Mon Feb 18 00:16:18 CET 2019 build@kiwi.pr0.tips:/stor/netbsd/foreign/lemon-apu/obj/sys/arch/amd64/compile/LEMONKERND amd64
>Description:
I'm running 8-stable on this (mostly open) hardware: https://www.pcengines.ch/apu2c4.htm.
Every so many days, I'm getting this panic:
| fatal protection fault in supervisor mode
| trap type 4 code 0 rip 0xffffffff80208653 cs 0x8 rflags 0x10246 cr2 0x7f7fff97c328 ilevel 0x8 rsp 0xffff800068235a20
| curlwp 0xfffffe810e4df8c0 pid 13063.1 lowest kstack 0xffff8000682322c0
| kernel: protection fault trap, code=0
| Stopped in pid 13063.1 (grep) at netbsd:cpu_switchto+0x153: wrmsr
| db{0}> bt
| cpu_switchto() at netbsd:cpu_switchto+0x153
| kpreempt() at netbsd:kpreempt+0xe0
| DDB lost frame for netbsd:Xresume_preempt+0x28, trying 0xffff800068235af0
| Xresume_preempt() at netbsd:Xresume_preempt+0x28
| --- interrupt ---
| ?() at 206
| allproc() at ffffffff814c4240
| ?() at fffffe81079b92d0
| Bad frame pointer: 0xfffffe8107dcf2d8
| db{0}> show registers
| ds 92d0
| es babb
| fs 59f0
| gs babb
| rdi fffffe81079b92d0
| rsi fffffe810e4df8c0
| rbp ffff800068235ab0
| rbx fffffe81079ae1e0
| rdx f8cff33d
| rcx c0000102
| rax 9044ffff
| r8 ffffffff81400600 cpu_info_primary
| r9 fffffe811ea1a044
| r10 400
| r11 fffffe8107401048
| r12 fffffe810e4df8c0
| r13 fffffe81079ae1e0
| r14 ffff800068232000
| r15 fffffe811ea1ef40
| rip ffffffff80208653 cpu_switchto+0x153
| cs 8
| rflags 10246
| rsp ffff800068235a20
| ss 0
| netbsd:cpu_switchto+0x153: wrmsr
Does this ring a bell for anybody? Does it look like it might rather be a hardware issue?
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
From: fstd <fstd.lkml@gmail.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/53993: panic on wrmsr in cpu_switchto
Date: Sun, 3 Mar 2019 11:52:21 +0100
After getting the same panic with another, identical machine, I think I can now rule out the hardware being faulty.
Responsible-Changed-From-To: kern-bug-people->maxv
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Sun, 03 Mar 2019 11:08:13 +0000
Responsible-Changed-Why:
Maxime, could you please have a look?
From: fstd <fstd.lkml@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: maxv@NetBSD.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
gnats-admin@netbsd.org, martin@NetBSD.org
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Sun, 3 Mar 2019 15:41:39 +0100
> Maxime, could you please have a look?
To Maxime, since I currently have two of these boards: If it helps, I could set you up with SSH access to a machine that's hooked up to the serial console of one of these boards, and freedom to mess around with it.
State-Changed-From-To: open->feedback
State-Changed-By: maxv@NetBSD.org
State-Changed-When: Sun, 24 Mar 2019 12:12:29 +0000
State-Changed-Why:
You run 32bit binaries on your system, right? Are these NetBSD 32bit binaries
or Linux 32bit binaries? (Ie, is it compat_netbsd32 or compat_linux32)
:w
From: Maxime Villard <max@m00nbsd.net>
To: gnats-bugs@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
fstd.lkml@gmail.com
Cc:
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Sun, 24 Mar 2019 13:50:27 +0100
Le 24/03/2019 à 13:12, maxv@NetBSD.org a écrit :
> Synopsis: panic on wrmsr in cpu_switchto
>
> State-Changed-From-To: open->feedback
> State-Changed-By: maxv@NetBSD.org
> State-Changed-When: Sun, 24 Mar 2019 12:12:29 +0000
> State-Changed-Why:
> You run 32bit binaries on your system, right? Are these NetBSD 32bit binaries
> or Linux 32bit binaries? (Ie, is it compat_netbsd32 or compat_linux32)
> :w
ok, it's a race in setregs()
From: "Maxime Villard" <maxv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/53993 CVS commit: src/sys
Date: Sun, 24 Mar 2019 13:15:43 +0000
Module Name: src
Committed By: maxv
Date: Sun Mar 24 13:15:43 UTC 2019
Modified Files:
src/sys/arch/amd64/amd64: machdep.c
src/sys/compat/linux/arch/amd64: linux_machdep.c
Log Message:
Fix a tiny race in setregs and linux_setregs. Between the moment we set
pcb_flags to zero, and the moment cpu_segregs64_zero resets pcb_gs, we may
be preempted.
If this happens, and if the calling LWP was a 32bit thread, when switching
back to that LWP, the context switcher sees that PCB_COMPAT32 is not set in
pcb_flags and tries to perform a 64bit context switch; but pcb_gs contains
a 32bit GDT descriptor, and not a 64bit GS.base value. The wrmsr therefore
faults because the value is non-canonical, and this fault is fatal.
Rearrange the code so that the update of pcb_flags and pcb_gs/pcb_fs is non
interruptible. This fixes the problem, tested with a reproducer (which
therefore doesn't work anymore).
Likely fixes PR/53993.
To generate a diff of this commit:
cvs rdiff -u -r1.327 -r1.328 src/sys/arch/amd64/amd64/machdep.c
cvs rdiff -u -r1.56 -r1.57 src/sys/compat/linux/arch/amd64/linux_machdep.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: fstd <fstd.lkml@gmail.com>
To: gnats-bugs@netbsd.org
Cc: maxv@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Sun, 24 Mar 2019 15:37:57 +0100
> You run 32bit binaries on your system, right? Are these NetBSD 32bit binaries
> or Linux 32bit binaries? (Ie, is it compat_netbsd32 or compat_linux32)
Yes, indeed! All my packages are still 32 bit as I migrated from
an i386 board recently.
I'll try a new kernel with your latest commit and report back.
Thank you so much!
Timo
From: Maxime Villard <max@m00nbsd.net>
To: fstd <fstd.lkml@gmail.com>, gnats-bugs@netbsd.org
Cc: netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Sun, 24 Mar 2019 15:52:31 +0100
Le 24/03/2019 à 15:37, fstd a écrit :
>> You run 32bit binaries on your system, right? Are these NetBSD 32bit binaries
>> or Linux 32bit binaries? (Ie, is it compat_netbsd32 or compat_linux32)
> Yes, indeed! All my packages are still 32 bit as I migrated from
> an i386 board recently.
>
> I'll try a new kernel with your latest commit and report back.
>
> Thank you so much!
> Timo
The patch has been applied to NetBSD-current, but not yet to NetBSD-8. I am
currently setting up a NetBSD-8 VM to test...
From: fstd <fstd.lkml@gmail.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
Maxime Villard <max@m00nbsd.net>
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Sun, 24 Mar 2019 18:21:22 +0100
> Can you test this patch [1]? I've tested on my side, it's fine.
> [1] https://m00nbsd.net/garbage/x86/wrmsr8.diff
I'm running this patch now. Looking at my Munin graphs, the highest
uptime I've seen in the last months was 9 days (but usually much less
than that), so I guess it'll take a while to verify whether this
actually fixed the panic.
I'll report back when I reach 20 days (or if it crashes).
Again, thanks a ton for your effort!
Timo
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/53993 CVS commit: [netbsd-8] src/sys
Date: Fri, 5 Apr 2019 07:48:05 +0000
Module Name: src
Committed By: martin
Date: Fri Apr 5 07:48:05 UTC 2019
Modified Files:
src/sys/arch/amd64/amd64 [netbsd-8]: machdep.c netbsd32_machdep.c
src/sys/compat/linux/arch/amd64 [netbsd-8]: linux_machdep.c
src/sys/compat/linux32/arch/amd64 [netbsd-8]: linux32_machdep.c
Log Message:
Pull up following revision(s) (requested by maxv):
sys/arch/amd64/amd64/netbsd32_machdep.c: revision 1.120
sys/compat/linux/arch/amd64/linux_machdep.c: revision 1.57
sys/compat/linux32/arch/amd64/linux32_machdep.c: revision 1.44
sys/arch/amd64/amd64/machdep.c: revision 1.328
sys/arch/amd64/amd64/machdep.c: revision 1.329
Fix a tiny race in setregs and linux_setregs. Between the moment we set
pcb_flags to zero, and the moment cpu_segregs64_zero resets pcb_gs, we may
be preempted.
If this happens, and if the calling LWP was a 32bit thread, when switching
back to that LWP, the context switcher sees that PCB_COMPAT32 is not set in
pcb_flags and tries to perform a 64bit context switch; but pcb_gs contains
a 32bit GDT descriptor, and not a 64bit GS.base value. The wrmsr therefore
faults because the value is non-canonical, and this fault is fatal.
Rearrange the code so that the update of pcb_flags and pcb_gs/pcb_fs is non
interruptible. This fixes the problem, tested with a reproducer (which
therefore doesn't work anymore).
Likely fixes PR/53993.
Disable preemption when setting PCB_COMPAT32, to prevent a context switch
before cpu_fsgs_reload() finishes, otherwise we write garbage in the GDT.
On NetBSD-current it is harmless, however in NetBSD-8 it might cause
panics, because NetBSD-8 uses the old SegRegs model and under this model
we reload %fs and %gs during switches.
To generate a diff of this commit:
cvs rdiff -u -r1.255.6.8 -r1.255.6.9 src/sys/arch/amd64/amd64/machdep.c
cvs rdiff -u -r1.105.2.2 -r1.105.2.3 \
src/sys/arch/amd64/amd64/netbsd32_machdep.c
cvs rdiff -u -r1.51.6.1 -r1.51.6.2 \
src/sys/compat/linux/arch/amd64/linux_machdep.c
cvs rdiff -u -r1.38.6.1 -r1.38.6.2 \
src/sys/compat/linux32/arch/amd64/linux32_machdep.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Timo Buhrmester <fstd.lkml@gmail.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
Maxime Villard <max@m00nbsd.net>
Subject: Re: port-amd64/53993 (panic on wrmsr in cpu_switchto)
Date: Fri, 12 Apr 2019 19:39:41 +0200
> I'll report back when I reach 20 days (or if it crashes).
20 days and 0 panics later, I'm happy to confirm this fixed the issue!
Thank you so much, maxv
Timo
State-Changed-From-To: feedback->closed
State-Changed-By: maxv@NetBSD.org
State-Changed-When: Sat, 13 Apr 2019 06:23:18 +0000
State-Changed-Why:
Close this PR, fixed. Thanks for the report
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.