NetBSD Problem Report #57848

From www@netbsd.org  Sat Jan 13 21:57:20 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id D7C2E1A9238
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 13 Jan 2024 21:57:20 +0000 (UTC)
Message-Id: <20240113215718.EE7551A9239@mollari.NetBSD.org>
Date: Sat, 13 Jan 2024 21:57:18 +0000 (UTC)
From: als@thangorodrim.ch
Reply-To: als@thangorodrim.ch
To: gnats-bugs@NetBSD.org
Subject: NetBSD 9.3/sparc64 crash reboots under high I/O load
X-Send-Pr-Version: www-1.0

>Number:         57848
>Category:       port-sparc64
>Synopsis:       NetBSD 9.3/sparc64 crash reboots under high I/O load
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    port-sparc64-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 13 22:00:00 +0000 2024
>Last-Modified:  Thu Jan 25 23:50:01 +0000 2024
>Originator:     Alexander Schreiber
>Release:        NetBSD 9.3 (release)
>Organization:
>Environment:
NetBSD laurelin.angband.thangorodrim.de 9.3 NetBSD 9.3 (TELPERION) #0: Sun Nov 27 15:17:20 CET 2022  root@telperion.angband.thangorodrim.de:/usr/obj/sys/arch/sparc64/compile/TELPERION sparc64

>Description:
Let me preface this with: I _suspect_ that something is actually wonky with the machine itself.

machine background:
 - this is a Sun Fire V100 with a SUNW,UltraSPARC-IIe @ 548 MHz CPU and
   maximum memory loadout (2G)
 - it has been frankensteined a bit (which would probably give a SUN field
   support engineer apoplexy if they were still around):
   - system board shelled and moved into MicroATX case
   - powered by 100W NanoPSU (which I suspect might be relevant)
   - wd0 is a 64G PATA SSD
   - wd1, wd2 are 2 TB SATA SSD behind SATA-PATA converter each
   - CPU cooling fan replaced with a quieter one, keeping the cooler in place
     but removing the airguide
   - NetBSd 9.3 was installed on wd0 from the netbooted installer ISO
   - system was rebuilt to enable ZFS
   - wd1 & wd2 were set up as a ZFS mirror-1 for data storage
 - in this configuration, the machine ran for at least 1y just fine
 - a few weeks ago, I accidentally power killed the machine (pulled the plug)
 - on next power up, it refused to boot from wd0, claiming there was nothing
   there ... also, probe-ide-all only showed the PATA SSD
 - netbooted NetBSD 9.3 installer again, reinstalled to wd0
 - system still refused to boot from wd0
 - copied install from wd0 to NFS host, switched to NFS root
 - rebuilt system to enable ZFS and use custom kernel config
 - two crash reboots mid system rebuild
 - imported ZFS pool
 - starting copying (via rsync-over-ssh) a 75G data set to ZFS
 - this triggered crash reboots after sometimes as little as 1h, several times
 - updating a 210G git repo to ZFS sometimes also triggered that

It always seems to crash (and then reboot) at the same place. Crash message
copied from console (with ddb.onpanic=1):

[ 54990.7869753] data error type 32 sfsr=0 sfva=40bdf810 afsr=84000000 afva=1fe02004000 tf=0x1782cd850
[ 54990.9170412] data fault: pc=1083004 addr=40bdf810 sfsr=0x0<ASI=0x0>
[ 54990.9170412] kernel trap 32: data access error
Stopped in pid 0.29 (system) at netbsd:alipm_smb_exec+0x184:    andcc
%g1, 0xe4, %g0
db{0}> bt
iic_exec(101da35a8, 1, 18, 1782cdbfd, 1, 1782cdbff) at netbsd:iic_exec+0x1ac
admtemp_refresh(101dfc148, 10251f5b8, e0047ed0, 0, 103b1a0, 10251f588) at netbsd
:admtemp_refresh+0x48
sysmon_envsys_refresh_sensor(101dfc148, 10251f5b8, 186d800, 1672000, 102553960,
102553a90) at netbsd:sysmon_envsys_refresh_sensor+0x1c
sme_events_worker(101dfc218, 101dfc148, 102553960, 10251f5b8, 101dfc148, 101dc32
48) at netbsd:sme_events_worker+0x130
workqueue_worker(10254b0c0, 10254b120, 10254b130, 10254b108, 101dc3188, 10254b10
0) at netbsd:workqueue_worker+0xf0
lwp_trampoline(f0061134, 116000, 113a30, 1, fffc5c88, 0) at netbsd:lwp_trampolin
e+0x8
db{0}>


contents of /etc/mk.conf:

# ===================
ACCEPTABLE_LICENSES+= vim-license
ACCEPTABLE_LICENSES+= gnu-agpl-v3
PKG_OPTIONS.python27=-x11
PKG_OPTIONS.python37=-x11
PKG_OPTIONS.ghostscript=-x11
ALLOW_VULNERABLE_PACKAGES=yes

UPDATE_TARGET=package-install
WRKOBJDIR = /usr/pkgobj
# WRKOBJDIR = /zfs/pkgobj
# WRKOBJDIR = /backup/1/pkgobj
PACKAGES=${PKGSRCDIR}/packages/${LOWER_OPSYS}-${OS_VERSION}-${MACHINE_ARCH}
USE_FORT=yes
USE_SSP=yes
# PKG_DBDIR=/var/db/pkg

MKZFS=yes
#=======================

Changes between GENERIC and the custom kernel is mostly commenting out
support for hardware the machine doesn't have and network support I
don't need, also:

-#options       BLINK           # blink the system LED
+options        BLINK           # blink the system LED

-#options       NFS_BOOT_BOOTP
+options        NFS_BOOT_BOOTP

-#options       DIAGNOSTIC      # extra kernel sanity checking
+options        DIAGNOSTIC      # extra kernel sanity checking

I can provide the full config if needed, of course. Interestingly, the kernel
compiled on this machine uses the _same_ config I have on another V100, with only the config name changed. The kernel built on this machine is still 120
bytes larger.

I have another, almost identical Sun Fire V100, differences are
 - 3x SATA HDDs behind SATA-PATA adapters
 - full ATX PSU
This one exhibits _none_ of these issues. The machine only started
misbehaving after unexpectedly losing power. I suspect that since that
NanoPSU has no useful amount of capacitor energy storage, all the power
rails dropped at once, which might not be what the hardware expects and
may have therefore poked something in unexpected ways - blind guess, though.
>How-To-Repeat:
Borrow my (presumably slightly wonky) machine?
>Fix:
none

>Audit-Trail:
From: Alexander Schreiber <als@thangorodrim.ch>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-sparc64/57848: NetBSD 9.3/sparc64 crash reboots under high
 I/O load
Date: Sun, 14 Jan 2024 15:55:21 +0100

 And the machine crashed again, this time overnight while doing a git pull
 that by the time I went to bed had already written 12+ GB to ZFS.

 console log:

 [ 70718.1188914] data error type 32 sfsr=0 sfva=46d5e000 afsr=84000000 afva=1fe02004000 tf=0x1782cd850
 [ 70718.2489537] data fault: pc=1083004 addr=46d5e000 sfsr=0x0<ASI=0x0>
 [ 70718.2489537] kernel trap 32: data access error
 Stopped in pid 0.29 (system) at netbsd:alipm_smb_exec+0x184:    andcc           
 %g1, 0xe4, %g0
 db{0}> bt
 iic_exec(101da35a8, 1, 18, 1782cdbfd, 1, 1782cdbff) at netbsd:iic_exec+0x1ac
 admtemp_refresh(101dfc148, 10251f6e0, c8, 0, 103b1a0, 10251f588) at netbsd:admte
 mp_refresh+0x48
 sysmon_envsys_refresh_sensor(101dfc148, 10251f6e0, 102553960, 166f400, 10251f5b8
 , 162e648) at netbsd:sysmon_envsys_refresh_sensor+0x1c
 sme_events_worker(101dfc218, 101dfc148, 102553960, 10251f6e0, 101dfc148, 101dc31
 88) at netbsd:sme_events_worker+0x130
 workqueue_worker(10254b0c0, 10254b120, 10254b130, 10254b108, 0, 10254b100) at ne
 tbsd:workqueue_worker+0xf0
 lwp_trampoline(f0061134, 116000, 113a30, 1, fffc5c88, 0) at netbsd:lwp_trampolin
 e+0x8

 I'm currently building a kernel that is just GENERIC, but with:
  - ZFS enabled
  - DIAGNOSTIC enabled

 to see if that blows up again.

From: Alexander Schreiber <als@thangorodrim.ch>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-sparc64/57848: NetBSD 9.3/sparc64 crash reboots under high
 I/O load
Date: Sun, 14 Jan 2024 21:34:29 +0100

 short updated:

 during the "./build.sh tools" stage of cooking a new kernel the machine
 dropped back to the firmware prompt with this on the console:

 ------------- cut here -----

 Watchdog Reset
 Externally Initiated Reset
 ok

 ------------- cut here -----

 I take this as another indicator that it is the machine, not the NetBSD
 kernel.

From: Alexander Schreiber <als@thangorodrim.ch>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-sparc64/57848: NetBSD 9.3/sparc64 crash reboots under high
 I/O load
Date: Fri, 26 Jan 2024 00:44:34 +0100

 And one more update: I've rebuilt the kernel with the standard 9.3
 GENERIC config with two deviations: enabling DIAGNOSTIC in the kernel
 config and "MKZFS=yes" in /etc/mk.conf. Rebooted into it and after writing
 about 100G to ZFS, the machine crash-rebooted again, in the same manner
 as seen before. Again, I've seen no crashes on the other Sun V100 I have
 so I still strongly suspect some kind of hardware wonkiness.
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.