NetBSD Problem Report #53017

From kivinen@vielako.iki.fi  Mon Feb 12 18:26:27 2018
Return-Path: <kivinen@vielako.iki.fi>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id DA1E57A1EA
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 12 Feb 2018 18:26:27 +0000 (UTC)
Message-Id: <201802121826.w1CIQODW004667@vielako.iki.fi>
Date: Mon, 12 Feb 2018 20:26:24 +0200 (EET)
From: kivinen@iki.fi
Reply-To: kivinen@iki.fi
To: gnats-bugs@NetBSD.org
Subject: Kernel panic with "fpusave_lwp: did not" message
X-Send-Pr-Version: 3.95

>Number:         53017
>Category:       kern
>Synopsis:       Kernel panics every now and then with "fpusave_lwp: did not" message.
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Feb 12 18:30:00 +0000 2018
>Closed-Date:    Fri Dec 27 21:08:30 +0000 2019
>Last-Modified:  Fri Dec 27 21:08:30 +0000 2019
>Originator:     Tero Kivinen
>Release:        NetBSD 8.0_BETA
>Organization:
IKI ry.
>Environment:
System: NetBSD vielako.iki.fi 8.0_BETA NetBSD 8.0_BETA (GENERIC) #0: Wed Nov 8 23:20:26 EET 2017 kivinen@vielako.iki.fi:/usr/obj/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64

Supermicro X8SIU (0123456789)
cpu0 at mainbus0 apid 0
cpu0: Intel(R) Xeon(R) CPU           X3430  @ 2.40GHz, id 0x106e5
cpu0: package 0, core 0, smt 0
cpu1 at mainbus0 apid 2
cpu1: Intel(R) Xeon(R) CPU           X3430  @ 2.40GHz, id 0x106e5
cpu1: package 0, core 1, smt 0
cpu2 at mainbus0 apid 4
cpu2: Intel(R) Xeon(R) CPU           X3430  @ 2.40GHz, id 0x106e5
cpu2: package 0, core 2, smt 0
cpu3 at mainbus0 apid 6
cpu3: Intel(R) Xeon(R) CPU           X3430  @ 2.40GHz, id 0x106e5
cpu3: package 0, core 3, smt 0

>Description:

Every now and then the machine crashes with following panic:

"fpusave_lwp: did not"

I have crash dumps available for all of the crashes:

-rw-------  1 root  wheel  1966774296 Dec 23 01:08 /var/crash/netbsd.0.core
-rw-------  1 root  wheel  1927200792 Jan 12 00:08 /var/crash/netbsd.1.core
-rw-------  1 root  wheel  1903505432 Jan 15 11:08 /var/crash/netbsd.2.core
-rw-------  1 root  wheel  1947403800 Jan 24 15:09 /var/crash/netbsd.3.core
-rw-------  1 root  wheel  1917088792 Jan 30 01:24 /var/crash/netbsd.4.core
-rw-------  1 root  wheel  1929837080 Feb 12 13:09 /var/crash/netbsd.5.core

Looking at sys/arch/x86/x86/fpu.c it seems it does loop checking
hardware_ticks and loops until they change, and has spin count to make
sure it does not stay there forever. This panic is triggered when it
has been there more than 100 million times.

This panic usually happens few minutes after hour, because we run our
configuration update script every hour, and it takes few minutes to
run, and during that time it does do some floating point mathematics
when generating graphics etc. During the rest of the time the machine
just runs apache and wiki, so there is no real floating point
calculations done at all.

The bad thing was that as it crashed during the config file update,
some of the config files were not written to the disk when it crashed,
thus some of the config files had lots of nuls in the end. I.e., the
size of the file was correct, but last few hundred kb of it was just
zero. This we fixed by adding sync commands between the generation of
the file, and before doing the rest of the processing...

I have crash dumps available and if some more information is needed
from them I can try to dig things out. 

>How-To-Repeat:

Seems to repeat itself every now and then on our hardware.

This might be related to the kern/53016 as it is running on the same
hardware and clock drift might be related to the this failure too. Or
it might be they are completely unrelated.

>Fix:

No fix known.

>Release-Note:

>Audit-Trail:
From: Tero Kivinen <kivinen@iki.fi>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53017: Kernel panic with "fpusave_lwp: did not" message
Date: Fri, 9 Mar 2018 06:40:04 +0200

 gnats-admin@netbsd.org writes:
 > Thank you very much for your problem report.
 > It has the internal identification `kern/53017'.
 > The individual assigned to look at your
 > report is: kern-bug-people. 
 > 
 > >Category:       kern
 > >Responsible:    kern-bug-people
 > >Synopsis:       Kernel panics every now and then with "fpusave_lwp: did not" message.
 > >Arrival-Date:   Mon Feb 12 18:30:00 +0000 2018

 I updated the kernel to newer version:

 NetBSD vielako.iki.fi 8.0_BETA NetBSD 8.0_BETA (GENERIC) #0: Tue Feb 27 03:12:27 EET 2018
 kivinen@vielako.iki.fi:/usr/obj/sys/arch/amd64/compile/GENERIC amd64

 but the bug seems to still be there, i.e., the machine stayed up for 9
 days and now crashed with exactly same panic string.
 -- 
 kivinen@iki.fi

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/53017: Kernel panics every now and then with "fpusave_lwp:
 did not" message.
Date: Thu, 20 Dec 2018 01:14:52 +0000

 The only thing I can think of that might have solved the other bug is
 the 'lazy FPU' thing. Now enabled by machdep.fpu_eager=1.

State-Changed-From-To: open->closed
State-Changed-By: maya@NetBSD.org
State-Changed-When: Fri, 27 Dec 2019 21:08:30 +0000
State-Changed-Why:
Close this bug. We don't know what fixed it exactly (I am putting my money on eagerFPU), but the code in question has seen significant changes (https://v4.freshbsd.org/commit/netbsd/src/qgKnez1sYNR8LxGB), so having this bug remain open is not useful.


>Unformatted:

 NetBSD 8.0_BETA GENERIC from 2017-11-08.

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.45 2018/12/21 14:23:33 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.