NetBSD Problem Report #54761
From abs@forsaken.absd.org Fri Dec 13 18:24:19 2019
Return-Path: <abs@forsaken.absd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 2B6427A14F
for <gnats-bugs@gnats.NetBSD.org>; Fri, 13 Dec 2019 18:24:19 +0000 (UTC)
Message-Id: <20191213155523.1A85D1F5038@forsaken.absd.org>
Date: Fri, 13 Dec 2019 15:55:23 +0000 (GMT)
From: abs@absd.org
Reply-To: abs@absd.org
To: gnats-bugs@NetBSD.org
Subject: nvme corruption on GENERIC without DIAGNOSTIC
X-Send-Pr-Version: 3.95
>Number: 54761
>Category: kern
>Synopsis: nvme corruption on GENERIC without DIAGNOSTIC
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Dec 13 18:25:00 +0000 2019
>Last-Modified: Thu Dec 19 10:50:01 +0000 2019
>Originator: abs@absd.org
>Release: NetBSD 9.0_RC1
>Organization:
>Environment:
System: NetBSD forsaken.absd.org 9.0_RC1 NetBSD 9.0_RC1 (GENERIC_DIAGNOSTIC) #0: Thu Dec 5 11:44:08 GMT 2019 abs@iris.absd.org:/opt/netbsd/9/sys/arch/amd64/compile/GENERIC_DIAGNOSTIC amd64
Architecture: x86_64
Machine: amd64
>Description:
Significant filesystem corruption seen when updating a netbsd-9 machine from releng 2019-11-19 to 2019-12-03
System is running /home as wapbl on cgd0 on ld0 at nvme0 (samsung MZVLB512HAJQ-000L7)
System "seemed" stuttery in Firefox and IDEA, gradle build hung and on clean reboot significant
corruption found (lost local copy of two git repos)
Tracking releng builds of netbsd-9
2019-11-19 OK
2019-12-03 BAD
2019-12-05 BAD
2019-12-05 Locally built GENERIC+DIAGNOSTIC OK
>How-To-Repeat:
Boot a recent netbsd-9 kernel without DIAGNOSTIC using wapbl on cgd0 on ld0 at nvme0, or
(presumably) the appropriate subset thereof
>Fix:
Obvkously a workaround is to include DIAGNOSTIC
>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Fri, 13 Dec 2019 21:44:28 +0100
Can you please be a bit more specific: what kind of corruption do you see?
What workload does trigger it? How can we reproduce it?
Martin
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Fri, 13 Dec 2019 21:51:40 +0100
Since this might be something specific to your machine it would also
be helpfull if you could elliminate single componets from the possible
candidates, e.g try a plain ffs on ld on nvme, then retry with options
log.
I know this is asking for much.
Martin
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Mon, 16 Dec 2019 14:34:46 +0100
I tried on a VirtualBox installation with 16 CPUs and two emulated NVME
devices, used for pkg obj during massive parallel builds and also running
bonnie and bonnie++ on the ld drives.
I could not reproduce any issues with this setup when using ld0/ld1
with plain ffs and option log.
Martin
From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, David Brownlee <abs@absd.org>
Cc: msaitoh@execsw.org
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Tue, 17 Dec 2019 16:09:33 +0900
> Obvkously a workaround is to include DIAGNOSTIC
>
sys/dev/ic/nvme.c uses bus_space_barrier().
x86's bus_space_barrier() change will be pulled up by the
following ticket:
http://releng.netbsd.org/cgi-bin/req-9.cgi?show=566
Could you test with this change?
--
-----------------------------------------------
SAITOH Masanobu (msaitoh@execsw.org
msaitoh@netbsd.org)
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Wed, 18 Dec 2019 08:49:31 +0000
On Fri, Dec 13, 2019 at 06:25:01PM +0000, abs@absd.org wrote:
> Significant filesystem corruption seen when updating a netbsd-9 machine from releng 2019-11-19 to 2019-12-03
>
> System is running /home as wapbl on cgd0 on ld0 at nvme0 (samsung MZVLB512HAJQ-000L7)
This week I ran into fs corruption on wapbl on a raid1 on a pair of
SATA SSDs, with a current kernel from 2019-11-14 that included
DIAGNOSTIC.
Circumstances are confusing and there were no operational symptoms
until the nightly find / tripped on a corrupted directory entry, so it
might not be in any way related; but I figured I'd mention it.
Regardless of that, it also is kinda weird that DIAGNOSTIC would
affect the problem.
--
David A. Holland
dholland@netbsd.org
From: David Brownlee <abs@absd.org>
To: Masanobu SAITOH <msaitoh@execsw.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Thu, 19 Dec 2019 01:05:55 +0000
On Tue, 17 Dec 2019 at 07:09, Masanobu SAITOH <msaitoh@execsw.org> wrote:
>
> > Obvkously a workaround is to include DIAGNOSTIC
>
> sys/dev/ic/nvme.c uses bus_space_barrier().
> x86's bus_space_barrier() change will be pulled up by the
> following ticket:
>
> http://releng.netbsd.org/cgi-bin/req-9.cgi?show=566
>
> Could you test with this change?
I've switched from using cgd, and I've not seen corruption, but I have
seen lockups (still trying to see if I can get into ddb) with the
previous (problem) kernel .
Testing with Tue Dec 17 16:14:25 UTC 2019 from releng, which includes
bus_space.c,v 1.41.4.1, and I saw the same type of lockup (stupidly
did not have ddb.fromconsole=1)
This is with some xterms, firefox, and network over iwm0. I'll leave
it running overnight to see if it locks up again and if so if I can
get into ddb...
Thanks
David
From: David Brownlee <abs@absd.org>
To: Masanobu SAITOH <msaitoh@execsw.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Thu, 19 Dec 2019 10:46:29 +0000
On Thu, 19 Dec 2019 at 01:05, David Brownlee <abs@absd.org> wrote:
>
> On Tue, 17 Dec 2019 at 07:09, Masanobu SAITOH <msaitoh@execsw.org> wrote:
> >
> > > Obvkously a workaround is to include DIAGNOSTIC
> >
> > sys/dev/ic/nvme.c uses bus_space_barrier().
> > x86's bus_space_barrier() change will be pulled up by the
> > following ticket:
> >
> > http://releng.netbsd.org/cgi-bin/req-9.cgi?show=566
> >
> > Could you test with this change?
>
> I've switched from using cgd, and I've not seen corruption, but I have
> seen lockups (still trying to see if I can get into ddb) with the
> previous (problem) kernel .
>
> Testing with Tue Dec 17 16:14:25 UTC 2019 from releng, which includes
> bus_space.c,v 1.41.4.1, and I saw the same type of lockup (stupidly
> did not have ddb.fromconsole=1)
>
> This is with some xterms, firefox, and network over iwm0. I'll leave
> it running overnight to see if it locks up again and if so if I can
> get into ddb...
Interesting unrelated side effect with Tue Dec 17 16:14:25 UTC 2019
from releng - now typing text into firefox briefly shows garbage under
each new character...
(Contact us)
$NetBSD: query-full-pr,v 1.45 2018/12/21 14:23:33 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.