NetBSD Problem Report #55576

From spz@netbsd.org  Fri Aug 14 18:51:13 2020
Return-Path: <spz@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id BA0671A9239
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 14 Aug 2020 18:51:13 +0000 (UTC)
Message-Id: <20200814185112.425C066BBF@babylon5.netbsd.org>
Date: Fri, 14 Aug 2020 18:51:12 +0000 (UTC)
From: spz@NetBSD.org
Reply-To: spz@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: mpt clean volume degrades without clear reason, then panics
X-Send-Pr-Version: 3.95

>Number:         55576
>Category:       kern
>Synopsis:       mpt volume degradation and panic
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Aug 14 18:55:00 +0000 2020
>Originator:     S.P.Zeidler <spz@NetBSD.org>
>Release:        NetBSD 9.0_STABLE 20200807
>Organization:
	The NetBSD Foundation
>Environment:
System: NetBSD babylon5.netbsd.org 9.0_STABLE NetBSD 9.0_STABLE (BABYLON5) #1: Fri Aug 7 10:27:09 UTC 2020 spz@franklin.NetBSD.org:/home/netbsd/9/amd64/obj/sys/arch/amd64/compile/BABYLON5 amd64
Architecture: x86_64
Machine: amd64
>Description:

the panic: nothing new, happens when there is a raid to mend:
commands time out and the command pool dries up. 35071 is probably related,
but 5 major versions back.

[ 271468.2819715] mpt0: Phy 0: Link Rate 3.0 Gbps
[ 271468.2919772] mpt0: Phy 1: Link Rate 3.0 Gbps
[ 271468.5121031] mpt0: Unknown async event: 0xb
[ 271468.5421203] mpt0: Phy 2: Link Rate 3.0 Gbps
[ 271468.5521261] mpt0: Unknown async event: 0x13
[ 271468.7722520] mpt0: Unknown async event: 0xb
[ 271468.7722520] mpt0: Unknown async event: 0x15
[ 271468.7722520] mpt0: Unknown async event: 0x21
[ 271469.0023837] mpt0: Unknown async event: 0x15
[ 271469.0123894] mpt0: Unknown async event: 0x21
[ 271472.2742556] mpt0: Unknown async event: 0x21
[ 271472.2742556] mpt0: Unknown async event: 0x21
[ 271478.6679137] mpt0: restart succeeded
[ 271486.8225792] mpt0: read_cfg_header timed out
[ 271486.8325852] uvm_fault(0xffff888f9d870a38, 0x0, 1) -> e
[ 271486.8325852] fatal page fault in supervisor mode
[ 271486.8325852] trap type 6 code 0 rip 0xffffffff803bc2c7 cs 0x8 rflags 0x1024
6 cr2 0x10 ilevel 0x6 rsp 0xffffc186d3932bb0
[ 271486.8325852] curlwp 0xffff888f54660720 pid 21093.1 lowest kstack 0xffffc186
d39302c0
kernel: page fault trap, code=0
Stopped in pid 21093.1 (bioctl) at      netbsd:mpt_read_cfg_header+0x27:        
movq    10(%rax),%rax
db{2}> bt
mpt_read_cfg_header() at netbsd:mpt_read_cfg_header+0x27
mpt_get_cfg_page_ioc2() at netbsd:mpt_get_cfg_page_ioc2+0x23
mpt_bio_ioctl() at netbsd:mpt_bio_ioctl+0x17d
bioioctl() at netbsd:bioioctl+0x289
VOP_IOCTL() at netbsd:VOP_IOCTL+0x54
vn_ioctl() at netbsd:vn_ioctl+0xa5
sys_ioctl() at netbsd:sys_ioctl+0x5ab
syscall() at netbsd:syscall+0x157
--- syscall (number 54) ---
7af49936824a:

what's new:
the volume goes to degraded without a clear reason (like a bad disk).
Log:
# bioctl mpt0 show
Volume Status       Size         Device/Label    Level Stripe
=============================================================
     0 Online       1.8T sd0 LSILOGIC Logical Volume 3000   RAID 1    N/A 
   0:0 Online       1.8T         0:2.0 noencl <ATA ST2000NM0011 SN02>
   0:1 Online       1.8T         0:1.0 noencl <ATA ST2000NM0011 SN02>
Machine is up by: Wed Aug 12 20:31:14 UTC 2020

Starting after Fri Aug 14 16:00:00 2020 console log says:
[ 213428.7823240] mpt0: mpt_done: IOC SCSI task terminated!
[ 213428.7823240] mpt0: mpt_done: IOC fatal error: restarting...
[ 213459.0088964] mpt0: soft reset failed: ack timeout
[ 213459.0088964] mpt0: soft reset failed
[repeat the messages in order >200 times until]
[ 216179.3003868] mpt0: mpt_done: IOC SCSI task terminated!
[ 216179.3003868] mpt0: mpt_done: IOC fatal error: restarting...
[ 216179.4404636] mpt0: re-queued 254 requests
[ 216180.9312810] mpt0: Phy 0: Link Rate 3.0 Gbps
[ 216180.9312810] mpt0: Phy 1: Link Rate 3.0 Gbps
[ 216181.1413962] mpt0: Unknown async event: 0xb
[ 216181.1413962] mpt0: Unknown async event: 0x15
[ 216181.1413962] mpt0: Unknown async event: 0x21
[ 216181.1614072] mpt0: Unknown async event: 0xb
[ 216181.1814182] mpt0: Unknown async event: 0xb
[ 216181.1814182] mpt0: Phy 2: Link Rate 3.0 Gbps
[ 216181.2014292] mpt0: Unknown async event: 0x13
[ 216181.4415608] mpt0: Unknown async event: 0xb
[ 216181.4415608] mpt0: Unknown async event: 0x15
[ 216181.4415608] mpt0: Unknown async event: 0x21
[ 216181.6616815] mpt0: Unknown async event: 0x15
[ 216181.6616815] mpt0: Unknown async event: 0x21
[ 216187.2047206] mpt0: Unknown async event: 0x21
[ 216187.2047206] mpt0: Unknown async event: 0x21
[ 216187.2047206] mpt0: Unknown async event: 0xb
[ 216191.3069698] mpt0: restart succeeded
[ 216202.1829332] mpt0: Phy 2: Link Status Unknown
[ 216205.1845792] mpt0: Unknown async event: 0xb
[ 216205.1845792] mpt0: Unknown async event: 0x15
[ 216205.1845792] mpt0: Unknown async event: 0x21
[ 216205.1945845] mpt0: Unknown async event: 0xb
[ 216205.2145956] mpt0: Unknown async event: 0x21
[ 216205.2246010] mpt0: Unknown async event: 0x21
[ 216205.2246010] mpt0: Unknown async event: 0xb
[ 216205.2246010] mpt0: Unknown async event: 0x15
[ 216205.2246010] mpt0: Unknown async event: 0x21
[ 216205.2346065] mpt0: Unknown async event: 0x15
[ 216205.2346065] mpt0: Unknown async event: 0x21
[ 216238.4328103] mpt0: Phy 2: Link Rate 3.0 Gbps
[ 216238.4928423] mpt0: Unknown async event: 0xb
[ 216238.4928423] mpt0: Unknown async event: 0x15
[ 216238.5028481] mpt0: Unknown async event: 0x21
[ 216238.8830568] mpt0: Unknown async event: 0x15
[ 216238.8830568] mpt0: Unknown async event: 0x21
[ 216238.9731065] mpt0: Unknown async event: 0x21
[ 216238.9731065] mpt0: Unknown async event: 0x21
[ 216239.0031229] mpt0: Unknown async event: 0xb
[ 216502.1974219] mpt0: Unknown async event: 0xb
[ 216502.1974219] mpt0: Unknown async event: 0x21
[ 217343.5687086] mpt0: Unknown async event: 0x14
[ 218445.7129594] mpt0: Unknown async event: 0x14
[ 219507.2649365] mpt0: Unknown async event: 0x14
[ 220563.8741977] mpt0: Unknown async event: 0x14

# bioctl mpt0 show
Volume Status       Size         Device/Label    Level Stripe
=============================================================
     0 Degraded     1.8T sd0 LSILOGIC Logical Volume 3000   RAID 1    N/A 
   0:0 Online       1.8T         0:2.0 noencl <ATA ST2000NM0011 SN02>
   0:1 Online       1.8T         0:1.0 noencl <ATA ST2000NM0011 SN02>

until Aug 4th the system ran an 8_STABLE kernel and had occasionally
3 or 4 mpt_done timeouts. With the netbsd-9 20200802 and 20200807
kernels, the situation has become significantly less stable.

>How-To-Repeat:

>Fix:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.