NetBSD Problem Report #55189
From gson@gson.org Mon Apr 20 15:20:56 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 3E03D1A924D
for <gnats-bugs@gnats.NetBSD.org>; Mon, 20 Apr 2020 15:20:56 +0000 (UTC)
Message-Id: <20200420152051.8FA1F253F35@guava.gson.org>
Date: Mon, 20 Apr 2020 18:20:51 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: NVMe SSD reports unsafe shutdowns
X-Send-Pr-Version: 3.95
>Number: 55189
>Category: kern
>Synopsis: NVMe SSD reports unsafe shutdowns
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Apr 20 15:25:01 +0000 2020
>Last-Modified: Wed Apr 22 15:20:01 +0000 2020
>Originator: Andreas Gustafsson
>Release: NetBSD 9.0
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
I'm running NetBSD 9.0/amd64 on a server with an NVMe SSD as a
non-root disk, holding an FFS that is mounted from fstab and actively
used. The dmesg identifies it as:
[ 1.114589] nvme0 at pci9 dev 0 function 0: vendor 144d product a804 (rev. 0x00)
[ 1.114589] nvme0: NVMe 1.2
[ 1.114589] nvme0: for admin queue interrupting at msix0 vec 0
[ 1.114589] nvme0: Samsung SSD 960 EVO 250GB, firmware 3B7QCXE7, serial S3ESNX1K203084E
[ 1.114589] nvme0: for io queue 1 interrupting at msix0 vec 1 affinity to cpu0
[ 1.114589] nvme0: for io queue 2 interrupting at msix0 vec 2 affinity to cpu1
[ 1.114589] nvme0: for io queue 3 interrupting at msix0 vec 3 affinity to cpu2
[ 1.114589] nvme0: for io queue 4 interrupting at msix0 vec 4 affinity to cpu3
[ 1.114589] nvme0: for io queue 5 interrupting at msix0 vec 5 affinity to cpu4
[ 1.114589] nvme0: for io queue 6 interrupting at msix0 vec 6 affinity to cpu5
[ 1.114589] nvme0: for io queue 7 interrupting at msix0 vec 7 affinity to cpu6
[ 1.114589] ld0 at nvme0 nsid 1
[ 1.114589] ld0: 232 GB, 30401 cyl, 255 head, 63 sec, 512 bytes/sect x 488397168 sectors
I recently figured out the command line for gettings SMART information
from the drive, namely
sudo smartctl -d nvme,0x1 -a /dev/nvme0
The output was mostly as expected, but one item stood out:
Unsafe Shutdowns: 68
This is only two less than the reported number of power cycles,
and appears to increase by one for each power cycle, even after
applying the patch from PR 54969. The machine is typically
powered off by running "halt -p".
There are also some items in the SMART error log, but I believe those
are harmless and resulted from me running smartctl with the wrong
options while trying to figure out the right ones.
I see no detach messages for ld0 nor nvme0 on the console when halting
the machine, even though such messages are printed for other disks
(including sd0 since applying the patch from PR 54969). Are such
messages expected?
So far, I have not experienced any data loss nor noticed any
unexpected fsck activity on boot.
The full smartctl output follows.
smartctl 7.0 2018-12-30 r4883 [NetBSD 9.0 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 960 EVO 250GB
Serial Number: S3ESNX1K203084E
Firmware Version: 3B7QCXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 250,059,350,016 [250 GB]
Unallocated NVM Capacity: 0
Controller ID: 2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 250,059,350,016 [250 GB]
Namespace 1 Utilization: 223,633,428,480 [223 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5281b2e8cb
Local Time is: Mon Apr 20 17:02:41 2020 EEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 77 Celsius
Critical Comp. Temp. Threshold: 79 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.04W - - 0 0 0 0 0 0
1 + 5.09W - - 1 1 1 1 0 0
2 + 4.08W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1500
4 - 0.0050W - - 4 4 4 4 2200 6000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 41 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 5%
Data Units Read: 14,061,093 [7.19 TB]
Data Units Written: 57,169,239 [29.2 TB]
Host Read Commands: 443,435,057
Host Write Commands: 1,627,039,827
Controller Busy Time: 1,256
Power Cycles: 70
Power On Hours: 11,767
Unsafe Shutdowns: 68
Media and Data Integrity Errors: 0
Error Information Log Entries: 12
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 41 Celsius
Temperature Sensor 2: 54 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 12 0 0x0000 0x4016 - 0 65535 -
1 11 0 0x0000 0x4016 0x004 0 65535 -
2 10 0 0x0000 0x4016 - 0 65535 -
3 9 0 0x0000 0x4016 0x004 0 65535 -
4 8 0 0x0000 0x4016 0x004 0 65535 -
5 7 0 0x0000 0x4016 0x004 0 65535 -
6 6 0 0x0000 0x4016 0x004 0 65535 -
7 5 0 0x0000 0x4016 0x004 0 65535 -
8 4 0 0x0000 0x4016 0x004 0 65535 -
9 3 0 0x0000 0x4016 0x004 0 65535 -
10 2 0 0x0000 0x4016 0x004 0 65535 -
11 1 0 0x0000 0x4212 0x028 0 65535 -
>How-To-Repeat:
>Fix:
>Audit-Trail:
From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Mon, 20 Apr 2020 10:16:03 -0700 (PDT)
> sudo smartctl -d nvme,0x1 -a /dev/nvme0
>
> The output was mostly as expected, but one item stood out:
>
> Unsafe Shutdowns: 68
>
> This is only two less than the reported number of power cycles,
> and appears to increase by one for each power cycle, even after
> applying the patch from PR 54969. The machine is typically
> powered off by running "halt -p".
I've got similar results on my "Samsung SSD 960 PRO 512GB":
...
Power Cycles: 32
Power On Hours: 16,890
Unsafe Shutdowns: 30
...
+--------------------+--------------------------+-----------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul@whooppee.com |
| Software Developer | 0786 F758 55DE 53BA 7731 | pgoyette@netbsd.org |
+--------------------+--------------------------+-----------------------+
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Tue, 21 Apr 2020 06:53:07 +1000
we don't do this for wd(4) as well, on some of my systems
atactl can report eg:
174 100 0 no online positive Unexpected Power Loss Count 27
that goes up when i have an otherwise successful "poweroff".
.mrg.
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Tue, 21 Apr 2020 09:30:49 +0300
matthew green wrote:
> we don't do this for wd(4) as well, on some of my systems
> atactl can report eg:
>
> 174 100 0 no online positive Unexpected Power Loss Count 27
>
> that goes up when i have an otherwise successful "poweroff".
That could also be PR 54969, which was just fixed in -current. What
version are those systems running?
--
Andreas Gustafsson, gson@gson.org
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, gson@gson.org (Andreas Gustafsson)
Subject: re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Wed, 22 Apr 2020 03:51:04 +1000
> matthew green wrote:
> > we don't do this for wd(4) as well, on some of my systems
> > atactl can report eg:
> >
> > 174 100 0 no online positive Unexpected Power Loss Count 27
> >
> > that goes up when i have an otherwise successful "poweroff".
>
> That could also be PR 54969, which was just fixed in -current. What
> version are those systems running?
ah, none of the are -current, but the problem i saw has
been around for a long time (netbsd-6 at least?) so i'm
not sure it was broken by the same change, or perhaps
fixed by the same fixed.
it may be -- looking at a test machine, i see that wd0
does not detach at reboot, and simple reboot didn't make
that count go up (but i'm not near the machine with a
long history of this...)
i'll try to setup a test for this.
.mrg.
From: Andreas Gustafsson <gson@gson.org>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@netbsd.org
Subject: re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Tue, 21 Apr 2020 21:23:48 +0300
matthew green wrote:
> > That could also be PR 54969, which was just fixed in -current. What
> > version are those systems running?
>
> ah, none of the are -current, but the problem i saw has
> been around for a long time (netbsd-6 at least?) so i'm
> not sure it was broken by the same change, or perhaps
> fixed by the same fixed.
If your issue has been around since -6, it's not PR 54969 because that
one started only in 2017 (2017.08.21.09.00.21 to be exact), and the
only release branch affected by it is -9.
--
Andreas Gustafsson, gson@gson.org
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@netbsd.org
Cc: gson@gson.org (Andreas Gustafsson)
Subject: Re: kern/55189: NVMe SSD reports unsafe shutdowns
Date: Wed, 22 Apr 2020 18:17:14 +0300
FWIW, the lack of detach messages for nvme0 and ld0 can also be
reproduced in qemu without any physical NVMe hardware. Just create
a disk image:
dd if=/dev/zero of=nvme.img bs=1k count=1024
and add the following to the qemu-system-x86_64 command line:
-drive file=nvme.img,if=none,id=a -device nvme,drive=a,serial=1234
The console will show attach messages like these at boot time:
[ 1.0071408] nvme0 at pci0 dev 4 function 0: Intel product 5845 (rev. 0x02)
[ 1.0071408] nvme0: NVMe 1.2
[ 1.0071408] nvme0: for admin queue interrupting at msix0 vec 0
[ 1.0071408] nvme0: QEMU NVMe Ctrl, firmware 1.0, serial 1234
[ 1.0071408] nvme0: for io queue 1 interrupting at msix0 vec 1 affinity to cpu0
[ 1.0071408] ld0 at nvme0 nsid 1
[ 1.0071408] ld0: 1024 KB, 2 cyl, 16 head, 63 sec, 512 bytes/sect x 2048 sectors
but no corresponding detach messages at shutdown time.
--
Andreas Gustafsson, gson@gson.org
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.