NetBSD Problem Report #55189

From gson@gson.org  Mon Apr 20 15:20:56 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 3E03D1A924D
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 20 Apr 2020 15:20:56 +0000 (UTC)
Message-Id: <20200420152051.8FA1F253F35@guava.gson.org>
Date: Mon, 20 Apr 2020 18:20:51 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: NVMe SSD reports unsafe shutdowns
X-Send-Pr-Version: 3.95

>Number:         55189
>Category:       kern
>Synopsis:       NVMe SSD reports unsafe shutdowns
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 20 15:25:01 +0000 2020
>Last-Modified:  Wed Apr 22 15:20:01 +0000 2020
>Originator:     Andreas Gustafsson
>Release:        NetBSD 9.0
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

I'm running NetBSD 9.0/amd64 on a server with an NVMe SSD as a
non-root disk, holding an FFS that is mounted from fstab and actively
used.  The dmesg identifies it as:

   [     1.114589] nvme0 at pci9 dev 0 function 0: vendor 144d product a804 (rev. 0x00)
   [     1.114589] nvme0: NVMe 1.2
   [     1.114589] nvme0: for admin queue interrupting at msix0 vec 0
   [     1.114589] nvme0: Samsung SSD 960 EVO 250GB, firmware 3B7QCXE7, serial S3ESNX1K203084E
   [     1.114589] nvme0: for io queue 1 interrupting at msix0 vec 1 affinity to cpu0
   [     1.114589] nvme0: for io queue 2 interrupting at msix0 vec 2 affinity to cpu1
   [     1.114589] nvme0: for io queue 3 interrupting at msix0 vec 3 affinity to cpu2
   [     1.114589] nvme0: for io queue 4 interrupting at msix0 vec 4 affinity to cpu3
   [     1.114589] nvme0: for io queue 5 interrupting at msix0 vec 5 affinity to cpu4
   [     1.114589] nvme0: for io queue 6 interrupting at msix0 vec 6 affinity to cpu5
   [     1.114589] nvme0: for io queue 7 interrupting at msix0 vec 7 affinity to cpu6
   [     1.114589] ld0 at nvme0 nsid 1
   [     1.114589] ld0: 232 GB, 30401 cyl, 255 head, 63 sec, 512 bytes/sect x 488397168 sectors

I recently figured out the command line for gettings SMART information
from the drive, namely

  sudo smartctl -d nvme,0x1 -a /dev/nvme0

The output was mostly as expected, but one item stood out:

  Unsafe Shutdowns:                   68

This is only two less than the reported number of power cycles,
and appears to increase by one for each power cycle, even after
applying the patch from PR 54969.  The machine is typically 
powered off by running "halt -p".

There are also some items in the SMART error log, but I believe those
are harmless and resulted from me running smartctl with the wrong
options while trying to figure out the right ones.

I see no detach messages for ld0 nor nvme0 on the console when halting
the machine, even though such messages are printed for other disks
(including sd0 since applying the patch from PR 54969).  Are such
messages expected?

So far, I have not experienced any data loss nor noticed any
unexpected fsck activity on boot.

The full smartctl output follows.

smartctl 7.0 2018-12-30 r4883 [NetBSD 9.0 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 250GB
Serial Number:                      S3ESNX1K203084E
Firmware Version:                   3B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 250,059,350,016 [250 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          250,059,350,016 [250 GB]
Namespace 1 Utilization:            223,633,428,480 [223 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5281b2e8cb
Local Time is:                      Mon Apr 20 17:02:41 2020 EEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.04W       -        -    0  0  0  0        0       0
 1 +     5.09W       -        -    1  1  1  1        0       0
 2 +     4.08W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    5%
Data Units Read:                    14,061,093 [7.19 TB]
Data Units Written:                 57,169,239 [29.2 TB]
Host Read Commands:                 443,435,057
Host Write Commands:                1,627,039,827
Controller Busy Time:               1,256
Power Cycles:                       70
Power On Hours:                     11,767
Unsafe Shutdowns:                   68
Media and Data Integrity Errors:    0
Error Information Log Entries:      12
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               41 Celsius
Temperature Sensor 2:               54 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         12     0  0x0000  0x4016      -            0 65535     -
  1         11     0  0x0000  0x4016  0x004            0 65535     -
  2         10     0  0x0000  0x4016      -            0 65535     -
  3          9     0  0x0000  0x4016  0x004            0 65535     -
  4          8     0  0x0000  0x4016  0x004            0 65535     -
  5          7     0  0x0000  0x4016  0x004            0 65535     -
  6          6     0  0x0000  0x4016  0x004            0 65535     -
  7          5     0  0x0000  0x4016  0x004            0 65535     -
  8          4     0  0x0000  0x4016  0x004            0 65535     -
  9          3     0  0x0000  0x4016  0x004            0 65535     -
 10          2     0  0x0000  0x4016  0x004            0 65535     -
 11          1     0  0x0000  0x4212  0x028            0 65535     -

>How-To-Repeat:

>Fix:

>Audit-Trail:
From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Mon, 20 Apr 2020 10:16:03 -0700 (PDT)

 >  sudo smartctl -d nvme,0x1 -a /dev/nvme0
 >
 > The output was mostly as expected, but one item stood out:
 >
 >  Unsafe Shutdowns:                   68
 >
 > This is only two less than the reported number of power cycles,
 > and appears to increase by one for each power cycle, even after
 > applying the patch from PR 54969.  The machine is typically
 > powered off by running "halt -p".

 I've got similar results on my "Samsung SSD 960 PRO 512GB":

     ...
     Power Cycles:                       32
     Power On Hours:                     16,890
     Unsafe Shutdowns:                   30
     ...


 +--------------------+--------------------------+-----------------------+
 | Paul Goyette       | PGP Key fingerprint:     | E-mail addresses:     |
 | (Retired)          | FA29 0E3B 35AF E8AE 6651 | paul@whooppee.com     |
 | Software Developer | 0786 F758 55DE 53BA 7731 | pgoyette@netbsd.org   |
 +--------------------+--------------------------+-----------------------+

From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Tue, 21 Apr 2020 06:53:07 +1000

 we don't do this for wd(4) as well, on some of my systems
 atactl can report eg:

 174 100    0     no  online  positive    Unexpected Power Loss Count 27

 that goes up when i have an otherwise successful "poweroff".


 .mrg.

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Tue, 21 Apr 2020 09:30:49 +0300

 matthew green wrote:
 >  we don't do this for wd(4) as well, on some of my systems
 >  atactl can report eg:
 >  
 >  174 100    0     no  online  positive    Unexpected Power Loss Count 27
 >  
 >  that goes up when i have an otherwise successful "poweroff".

 That could also be PR 54969, which was just fixed in -current.  What
 version are those systems running?
 -- 
 Andreas Gustafsson, gson@gson.org

From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org, gson@gson.org (Andreas Gustafsson)
Subject: re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Wed, 22 Apr 2020 03:51:04 +1000

 >  matthew green wrote:
 >  >  we don't do this for wd(4) as well, on some of my systems
 >  >  atactl can report eg:
 >  >  
 >  >  174 100    0     no  online  positive    Unexpected Power Loss Count 27
 >  >  
 >  >  that goes up when i have an otherwise successful "poweroff".
 >  
 >  That could also be PR 54969, which was just fixed in -current.  What
 >  version are those systems running?

 ah, none of the are -current, but the problem i saw has
 been around for a long time (netbsd-6 at least?) so i'm
 not sure it was broken by the same change, or perhaps
 fixed by the same fixed.

 it may be -- looking at a test machine, i see that wd0
 does not detach at reboot, and simple reboot didn't make
 that count go up (but i'm not near the machine with a
 long history of this...)


 i'll try to setup a test for this.


 .mrg.

From: Andreas Gustafsson <gson@gson.org>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@netbsd.org
Subject: re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Tue, 21 Apr 2020 21:23:48 +0300

 matthew green wrote:
 > >  That could also be PR 54969, which was just fixed in -current.  What
 > >  version are those systems running?
 > 
 > ah, none of the are -current, but the problem i saw has
 > been around for a long time (netbsd-6 at least?) so i'm
 > not sure it was broken by the same change, or perhaps
 > fixed by the same fixed.

 If your issue has been around since -6, it's not PR 54969 because that
 one started only in 2017 (2017.08.21.09.00.21 to be exact), and the
 only release branch affected by it is -9.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@netbsd.org
Cc: gson@gson.org (Andreas Gustafsson)
Subject: Re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Wed, 22 Apr 2020 18:17:14 +0300

 FWIW, the lack of detach messages for nvme0 and ld0 can also be
 reproduced in qemu without any physical NVMe hardware.  Just create
 a disk image:

   dd if=/dev/zero of=nvme.img bs=1k count=1024

 and add the following to the qemu-system-x86_64 command line:

   -drive file=nvme.img,if=none,id=a -device nvme,drive=a,serial=1234

 The console will show attach messages like these at boot time:

   [   1.0071408] nvme0 at pci0 dev 4 function 0: Intel product 5845 (rev. 0x02)
   [   1.0071408] nvme0: NVMe 1.2
   [   1.0071408] nvme0: for admin queue interrupting at msix0 vec 0
   [   1.0071408] nvme0: QEMU NVMe Ctrl, firmware 1.0, serial 1234
   [   1.0071408] nvme0: for io queue 1 interrupting at msix0 vec 1 affinity to cpu0
   [   1.0071408] ld0 at nvme0 nsid 1
   [   1.0071408] ld0: 1024 KB, 2 cyl, 16 head, 63 sec, 512 bytes/sect x 2048 sectors

 but no corresponding detach messages at shutdown time.
 -- 
 Andreas Gustafsson, gson@gson.org
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.