NetBSD Problem Report #56850

From www@netbsd.org  Mon May 23 01:55:40 2022
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 53DEC1A9242
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 23 May 2022 01:55:40 +0000 (UTC)
Message-Id: <20220523015538.8FD681A9243@mollari.NetBSD.org>
Date: Mon, 23 May 2022 01:55:38 +0000 (UTC)
From: rokuyama.rk@gmail.com
Reply-To: rokuyama.rk@gmail.com
To: gnats-bugs@NetBSD.org
Subject: system locks up with NFS root & swap on mvgbe(4)
X-Send-Pr-Version: www-1.0

>Number:         56850
>Category:       port-arm
>Synopsis:       system locks up with NFS root for Kirkwood and Orion
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon May 23 02:00:01 +0000 2022
>Last-Modified:  Fri Oct 13 08:50:01 +0000 2023
>Originator:     Rin Okuyama
>Release:        9.99.96
>Organization:
Department of Physics, Meiji University
>Environment:
NetBSD obsa6 9.99.96 NetBSD 9.99.96 (OBSA6_BE) #23: Mon May 23 00:06:35 JST 2022  rin@latipes:/build/src/sys/arch/evbarm/compile/OBSA6_BE evbarm
>Description:
* Summary

The system eventually locks up with NFS root/swap on mvgbe(4).

This is probably due to software or hardware bugs of mvgbe(4), but
at the same time, I suspect that our NFS client may be fragile for
packet loss or other problems for NICs.

* Details

The failure occurs on ARM9E-based Marvell SoCs:

- KUROBOX_PRO: https://dmesgd.nycbug.org/index.cgi?do=view&id=6594
- OPENBLOCKS_A6: https://dmesgd.nycbug.org/index.cgi?do=view&id=6595

both in little- and big-endian mode.

With NFS root/swap on mvgbe(4), the system eventually locks up under
heavy I/O while building some pkgsrc's. Once the failure occurs, the
system does not respond to anything but input from serial console.

Then, I observe that many processes sleep at "nfsrecv":

https://gist.github.com/rokuyama/228f7afe67ffa8fe8024eb10bc2f14a1

The problem seems to be significantly mitigated by using UDP, but
it is not perfect; the failure occurs ~ every few hours for TCP,
while it does ~ every day for UDP.

For a similar generation armv5-based machine but with wm(4):

- HDL_G: https://dmesgd.nycbug.org/index.cgi?do=view&id=6139

I've never observed a similar failure.

Therefore, there should be bugs in mvgbe(4), or hardware problems.

However, at the same time, I wonder whether we can improve NFS or
socket layers in kernel; even if some packets are unexpectedly lost,
NFS routines should not sleep forever.
>How-To-Repeat:
Build some pkgsrc's with NFS root/swap on mvgbe(4).
>Fix:
N/A

>Release-Note:

>Audit-Trail:
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56850: system locks up with NFS root & swap on mvgbe(4)
Date: Thu, 26 May 2022 02:35:28 +0000

 Not sent to gnats.

    ------

 From: Rin Okuyama <rokuyama.rk@gmail.com>
 To: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
 Subject: Re: kern/56850: system locks up with NFS root & swap on mvgbe(4)
 Date: Mon, 23 May 2022 21:41:40 +0900

 Similar failures were observed with axe(4) and axen(4) for OPENBLOCKS_A6.

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: port-arm/56850 (system locks up with NFS root for Kirkwood and Orion)
Date: Thu, 12 Oct 2023 16:16:21 +0900

 Category and title have been updated.

 As reported earlier, this lock up occurs for USB NICs. Therefore, the
 problem should be due to the very MD parts of {,evb}arm/marvell.

 KURO-BOX/PRO (Orion) has a PCIe slot:
 https://dmesgd.nycbug.org/index.cgi?do=view&id=6594

 Even with wm(4) variants in this slot, the system locked up within ~ a day.

 I will revisit this problem soon. LOCKDEBUG may or may not be helpful...

 Thanks,
 rin

From: "Jonathan A. Kollasch" <jakllsch@kollasch.net>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56850: system locks up with NFS root & swap on mvgbe(4)
Date: Thu, 12 Oct 2023 15:05:46 -0500

 I have a vague recollection this might be related or similar to what I
 tried to fix in r1.14 src/sys/dev/marvell/if_mvgbe.c

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: "Jonathan A. Kollasch" <jakllsch@kollasch.net>, "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/56850: system locks up with NFS root & swap on mvgbe(4)
Date: Fri, 13 Oct 2023 17:45:28 +0900

 On Fri, Oct 13, 2023 at 5:10=E2=80=AFAM Jonathan A. Kollasch
 <jakllsch@kollasch.net> wrote:
 >  I have a vague recollection this might be related or similar to what I
 >  tried to fix in r1.14 src/sys/dev/marvell/if_mvgbe.c

 Thanks for hints! I will examine documents and/or how other OSes handle DMA=
 C.

 rin

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.