NetBSD Problem Report #54761

From abs@forsaken.absd.org  Fri Dec 13 18:24:19 2019
Return-Path: <abs@forsaken.absd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 2B6427A14F
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 13 Dec 2019 18:24:19 +0000 (UTC)
Message-Id: <20191213155523.1A85D1F5038@forsaken.absd.org>
Date: Fri, 13 Dec 2019 15:55:23 +0000 (GMT)
From: abs@absd.org
Reply-To: abs@absd.org
To: gnats-bugs@NetBSD.org
Subject: nvme corruption on GENERIC without DIAGNOSTIC
X-Send-Pr-Version: 3.95

>Number:         54761
>Category:       kern
>Synopsis:       nvme corruption on GENERIC without DIAGNOSTIC
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Dec 13 18:25:00 +0000 2019
>Last-Modified:  Thu Dec 19 10:50:01 +0000 2019
>Originator:     abs@absd.org
>Release:        NetBSD 9.0_RC1
>Organization:

>Environment:
System: NetBSD forsaken.absd.org 9.0_RC1 NetBSD 9.0_RC1 (GENERIC_DIAGNOSTIC) #0: Thu Dec 5 11:44:08 GMT 2019 abs@iris.absd.org:/opt/netbsd/9/sys/arch/amd64/compile/GENERIC_DIAGNOSTIC amd64
Architecture: x86_64
Machine: amd64
>Description:
	Significant filesystem corruption seen when updating a netbsd-9 machine from releng 2019-11-19 to 2019-12-03

	System is running /home as wapbl on cgd0 on ld0 at nvme0 (samsung MZVLB512HAJQ-000L7)

	System "seemed" stuttery in Firefox and IDEA, gradle build hung and on clean reboot significant
        corruption found (lost local copy of two git repos)

	Tracking releng builds of netbsd-9
          2019-11-19 OK
          2019-12-03 BAD
          2019-12-05 BAD
          2019-12-05 Locally built GENERIC+DIAGNOSTIC OK

>How-To-Repeat:
        Boot a recent netbsd-9 kernel without DIAGNOSTIC using wapbl on cgd0 on ld0 at nvme0, or
        (presumably) the appropriate subset thereof
>Fix:
        Obvkously a workaround is to include DIAGNOSTIC

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Fri, 13 Dec 2019 21:44:28 +0100

 Can you please be a bit more specific: what kind of corruption do you see?
 What workload does trigger it? How can we reproduce it?

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Fri, 13 Dec 2019 21:51:40 +0100

 Since this might be something specific to your machine it would also
 be helpfull if you could elliminate single componets from the possible
 candidates, e.g try a plain ffs on ld on nvme, then retry with options
 log.

 I know this is asking for much.

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Mon, 16 Dec 2019 14:34:46 +0100

 I tried on a VirtualBox installation with 16 CPUs and two emulated NVME
 devices, used for pkg obj during massive parallel builds and also running
 bonnie and bonnie++ on the ld drives.

 I could not reproduce any issues with this setup when using ld0/ld1
 with plain ffs and option log.

 Martin

From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, David Brownlee <abs@absd.org>
Cc: msaitoh@execsw.org
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Tue, 17 Dec 2019 16:09:33 +0900

 >         Obvkously a workaround is to include DIAGNOSTIC
 > 

 sys/dev/ic/nvme.c uses bus_space_barrier().
 x86's bus_space_barrier() change will be pulled up by the
 following ticket:

 	http://releng.netbsd.org/cgi-bin/req-9.cgi?show=566

  Could you test with this change?

 -- 
 -----------------------------------------------
                 SAITOH Masanobu (msaitoh@execsw.org
                                  msaitoh@netbsd.org)

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Wed, 18 Dec 2019 08:49:31 +0000

 On Fri, Dec 13, 2019 at 06:25:01PM +0000, abs@absd.org wrote:
  > 	Significant filesystem corruption seen when updating a netbsd-9 machine from releng 2019-11-19 to 2019-12-03
  > 
  > 	System is running /home as wapbl on cgd0 on ld0 at nvme0 (samsung MZVLB512HAJQ-000L7)

 This week I ran into fs corruption on wapbl on a raid1 on a pair of
 SATA SSDs, with a current kernel from 2019-11-14 that included
 DIAGNOSTIC.

 Circumstances are confusing and there were no operational symptoms
 until the nightly find / tripped on a corrupted directory entry, so it
 might not be in any way related; but I figured I'd mention it.

 Regardless of that, it also is kinda weird that DIAGNOSTIC would
 affect the problem.

 -- 
 David A. Holland
 dholland@netbsd.org

From: David Brownlee <abs@absd.org>
To: Masanobu SAITOH <msaitoh@execsw.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Thu, 19 Dec 2019 01:05:55 +0000

 On Tue, 17 Dec 2019 at 07:09, Masanobu SAITOH <msaitoh@execsw.org> wrote:
 >
 > >         Obvkously a workaround is to include DIAGNOSTIC
 >
 > sys/dev/ic/nvme.c uses bus_space_barrier().
 > x86's bus_space_barrier() change will be pulled up by the
 > following ticket:
 >
 >         http://releng.netbsd.org/cgi-bin/req-9.cgi?show=566
 >
 >  Could you test with this change?

 I've switched from using cgd, and I've not seen corruption, but I have
 seen lockups (still trying to see if I can get into ddb) with the
 previous (problem) kernel .

 Testing with Tue Dec 17 16:14:25 UTC 2019 from releng, which includes
 bus_space.c,v 1.41.4.1, and I saw the same type of lockup (stupidly
 did not have ddb.fromconsole=1)

 This is with some xterms, firefox, and network over iwm0. I'll leave
 it running overnight to see if it locks up again and if so if I can
 get into ddb...

 Thanks

 David

From: David Brownlee <abs@absd.org>
To: Masanobu SAITOH <msaitoh@execsw.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org
Subject: Re: kern/54761: nvme corruption on GENERIC without DIAGNOSTIC
Date: Thu, 19 Dec 2019 10:46:29 +0000

 On Thu, 19 Dec 2019 at 01:05, David Brownlee <abs@absd.org> wrote:
 >
 > On Tue, 17 Dec 2019 at 07:09, Masanobu SAITOH <msaitoh@execsw.org> wrote:
 > >
 > > >         Obvkously a workaround is to include DIAGNOSTIC
 > >
 > > sys/dev/ic/nvme.c uses bus_space_barrier().
 > > x86's bus_space_barrier() change will be pulled up by the
 > > following ticket:
 > >
 > >         http://releng.netbsd.org/cgi-bin/req-9.cgi?show=566
 > >
 > >  Could you test with this change?
 >
 > I've switched from using cgd, and I've not seen corruption, but I have
 > seen lockups (still trying to see if I can get into ddb) with the
 > previous (problem) kernel .
 >
 > Testing with Tue Dec 17 16:14:25 UTC 2019 from releng, which includes
 > bus_space.c,v 1.41.4.1, and I saw the same type of lockup (stupidly
 > did not have ddb.fromconsole=1)
 >
 > This is with some xterms, firefox, and network over iwm0. I'll leave
 > it running overnight to see if it locks up again and if so if I can
 > get into ddb...

 Interesting unrelated side effect with Tue Dec 17 16:14:25 UTC 2019
 from releng - now typing text into firefox briefly shows garbage under
 each new character...

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.45 2018/12/21 14:23:33 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.