NetBSD Problem Report #55737

From rmtodd@servalan.servalan.com  Tue Oct 20 03:30:07 2020
Return-Path: <rmtodd@servalan.servalan.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id DB7E21A9217
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 20 Oct 2020 03:30:07 +0000 (UTC)
Message-Id: <20201020031740.GA27771@ichotolot.servalan.com>
Date: Mon, 19 Oct 2020 22:17:40 -0500
From: Richard Todd <rmtodd@servalan.servalan.com>
Reply-To: rmtodd@servalan.servalan.com
To: gnats-bugs@netbsd.org
Subject: Apparent bug in evbarm64 DMA code causes filesystem corruption
X-Send-Pr-Version: 3.95

>Number:         55737
>Category:       port-arm
>Synopsis:       Apparent bug in evbarm64 DMA code causes filesystem corruption
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    port-arm-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Oct 20 03:35:00 +0000 2020
>Closed-Date:    Sun May 15 20:28:51 +0000 2022
>Last-Modified:  Sun May 15 20:28:51 +0000 2022
>Originator:     Richard Todd
>Release:        NetBSD 9.99.72
>Organization:

>Environment:
System: NetBSD netbsdarm64test.servalan.com 9.99.72 NetBSD 9.99.72 (RMTGENERIC_VM_a15d1b7af98f) #3: Thu Oct 8 00:00:34 CDT 2020 sysbuild@materia.servalan.com:/amd/ichotolot/u2/netbsd-vmexport/sysbuild/Build/evbarm64/obj/amd/ichotolot/u2/netbsd-vmexport/sysbuild/src/sys/arch/evbarm/compile/RMTGENERIC_VM evbarm
Architecture: aarch64
Machine: evbarm
>Description:
  As per recent discussion on port-arm
http://mail-index.netbsd.org/port-arm/2020/10/13/msg006997.html
various people, including me, have been seeing filesystem corruption with
recent builds of NetBSD-current on evbarm64.  With a good deal of work, I
managed to get a reproducible test case and narrow the apparent culprit down
to this recent commit:

changeset:   936632:a15d1b7af98f
branch:      trunk
user:        skrll <skrll@NetBSD.org>
date:        Tue Sep 08 10:30:17 2020 +0000
files:       sys/arch/arm/arm32/bus_dma.c
extra:       branch=trunk
description:
A few bus_dmatag_subregion fixes

- return EOPNOTSUPP if min_addr isn't less than max_addr
- fix the subset check to ensure that all the ranges in the parent tag are
  within the {min,max}_addr range.  If so we can just continue to use the
  parent tag.
- when building the new ranges read the parent tag range rather than un-
  initialised memory.
- remove the max_addr != 0xffffffff check - the overflow should be handled
  by the unsigned arithmetic for arm32.
- add a KASSERT
- add comments

I've got a console log here ( https://pastebin.com/eHRG8jGy ) exhibiting an
example reproduction of the problem in my usual qemu testing VM.  The first
part has qemu booting with two virtual disks attached, one the "usual"
NetBSD-current disk image I've been using with that VM, and the second a
copy of said image which was going to get its / newfsed and recreated by
copying from the master.  (This was done to mitigate the possiblity that
these crashes were caused by preexisting corruption on the filesystem,
which was the previous theory on port-arm mailing list regarding the source
of these crashes).  The second part is qemu running off of the "copy" disk
image, and going thru a typical attempt to upgrade to an already-built
NetBSD-current release (built elsewhere, stashed on NFS on another machine)
-- first sysupgrade fetches everything, installs the new kernel, then we
boot the new kernel.  Next we try to use sysupgrade to install world, but
we hit a

 panic: ffs_blkfree: bad size: dev = 0x5c20, bno = 2432893633 bsize = 16384, size = 16384, fs = /

panic in the middle of extracting the tests tar file, and then fail
automatic fsck on reboot.


>How-To-Repeat:
      Install a kernel built from -current after the a15d1b7af98f commit,
      reboot, and attempt to install world with sysupgrade.  
>Fix:
      Revert to a kernel built from -current prior to that commit.
      Alternatively, I just tested a build of -current sources from
      Saturday (rev 1b3515410312, date: Sat Oct 17 23:23:06 2020 +0000) with
      reversing the suspect bus_dma.c patch and the later uvm_bio.c patch of
      October 5 mentioned recently on netbsd-current list (and which has
      since been reverted in -current source).  With those sources, we can do
      the install of kernel and world just fine. 


>Release-Note:

>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org, skrll@netbsd.org,
    rmtodd@servalan.servalan.com
Cc: port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: port-arm/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption
Date: Tue, 20 Oct 2020 16:34:12 +1100

 > - fix the subset check to ensure that all the ranges in the parent tag are
 >   within the {min,max}_addr range.  If so we can just continue to use the
 >   parent tag.

 this part of the patch i think is problematic.

 previously, subset started as false, and was transitioned
 to true, and never back.

 now, it's as false, but it transitions back and forth to
 have the last entry's value.

 i think both are wrong.  it should start true, and if any
 entry is outside the parent range, set to false.


 .mrg.

From: matthew green <mrg@eterna.com.au>
To: port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org, gnats-bugs@netbsd.org, skrll@netbsd.org,
    rmtodd@servalan.servalan.com
Cc: 
Subject: re: port-arm/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption
Date: Tue, 20 Oct 2020 17:31:48 +1100

 matthew green writes:
 > i think both are wrong.  it should start true, and if any
 > entry is outside the parent range, set to false.

 i meant to include this untested patch.  not meant to be
 commited as-is, but should confirm or deny my guess with
 a repro attempt.


 .mrg.


 Index: bus_dma.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/arm/arm32/bus_dma.c,v
 retrieving revision 1.123
 diff -p -u -r1.123 bus_dma.c
 --- bus_dma.c	8 Sep 2020 10:30:17 -0000	1.123
 +++ bus_dma.c	20 Oct 2020 06:29:52 -0000
 @@ -1816,7 +1816,7 @@ _bus_dmatag_subregion(bus_dma_tag_t tag,

  #ifdef _ARM32_NEED_BUS_DMA_BOUNCE
  	struct arm32_dma_range *dr;
 -	bool subset = false;
 +	bool subset = true;
  	size_t nranges = 0;
  	size_t i;
  	for (i = 0, dr = tag->_ranges; i < tag->_nranges; i++, dr++) {
 @@ -1826,7 +1826,7 @@ _bus_dmatag_subregion(bus_dma_tag_t tag,
  		 */
  		if (dr->dr_sysbase >= min_addr
  		    && dr->dr_sysbase + dr->dr_len - 1 <= max_addr) {
 -			subset = true;
 +			//subset = true;
  		} else {
  			subset = false;
  		}

From: Nick Hudson <nick.hudson@gmx.co.uk>
To: matthew green <mrg@eterna.com.au>, gnats-bugs@netbsd.org,
 rmtodd@servalan.servalan.com
Cc: port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Subject: Re: port-arm/55737: Apparent bug in evbarm64 DMA code causes
 filesystem corruption
Date: Tue, 20 Oct 2020 09:03:57 +0100

 This is a multi-part message in MIME format.
 --------------64740B43E995EA09E8BB3D94
 Content-Type: text/plain; charset=utf-8; format=flowed
 Content-Transfer-Encoding: quoted-printable


 On 20/10/2020 06:34, matthew green wrote:
 >> - fix the subset check to ensure that all the ranges in the parent tag =
 are
 >>    within the {min,max}_addr range.  If so we can just continue to use =
 the
 >>    parent tag.
 > this part of the patch i think is problematic.
 >
 > previously, subset started as false, and was transitioned
 > to true, and never back.
 >
 > now, it's as false, but it transitions back and forth to
 > have the last entry's value.
 >
 > i think both are wrong.  it should start true, and if any
 > entry is outside the parent range, set to false.

 You're right the code is flawed.

 I think this patch is correct and covers the case that your patch
 doesn't where there are no ranges in the parent tag.

 I think the virtio patch is still valid, however.

 Thanks,

 Nick


 --------------64740B43E995EA09E8BB3D94
 Content-Type: text/x-patch; charset=UTF-8;
  name="bus_dma.c.diff"
 Content-Transfer-Encoding: quoted-printable
 Content-Disposition: attachment;
  filename="bus_dma.c.diff"

 Index: sys/arch/arm/arm32/bus_dma.c
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 RCS file: /cvsroot/src/sys/arch/arm/arm32/bus_dma.c,v
 retrieving revision 1.123
 diff -u -p -r1.123 bus_dma.c
 =2D-- sys/arch/arm/arm32/bus_dma.c	8 Sep 2020 10:30:17 -0000	1.123
 +++ sys/arch/arm/arm32/bus_dma.c	20 Oct 2020 08:00:34 -0000
 @@ -1816,18 +1816,18 @@ _bus_dmatag_subregion(bus_dma_tag_t tag,

  #ifdef _ARM32_NEED_BUS_DMA_BOUNCE
  	struct arm32_dma_range *dr;
 -	bool subset =3D false;
 +	bool subset =3D true;
  	size_t nranges =3D 0;
  	size_t i;
  	for (i =3D 0, dr =3D tag->_ranges; i < tag->_nranges; i++, dr++) {
  		/*
 -		 * Are all the ranges in the parent tag a subset of the new
 -		 * range? If yes, we can continue to use the parent tag.
 +		 * If the new {min,max}_addr are narrower than any of the
 +		 * ranges in the parent tag then we need a new tag;
 +		 * otherwise the parent tag is a subset of the new
 +		 * range and can continue to be used.
  		 */
 -		if (dr->dr_sysbase >=3D min_addr
 -		    && dr->dr_sysbase + dr->dr_len - 1 <=3D max_addr) {
 -			subset =3D true;
 -		} else {
 +		if (min_addr > dr->dr_sysbase
 +		    || max_addr < dr->dr_sysbase + dr->dr_len - 1) {
  			subset =3D false;
  		}
  		if (min_addr <=3D dr->dr_sysbase + dr->dr_len
 @@ -1835,6 +1835,10 @@ _bus_dmatag_subregion(bus_dma_tag_t tag,
  			nranges++;
  		}
  	}
 +	if (nranges =3D=3D 0) {
 +		nranges =3D 1;
 +		subset =3D false;
 +	}
  	if (subset) {
  		*newtag =3D tag;
  		/* if the tag must be freed, add a reference */
 @@ -1842,9 +1846,6 @@ _bus_dmatag_subregion(bus_dma_tag_t tag,
  			(tag->_tag_needs_free)++;
  		return 0;
  	}
 -	if (nranges =3D=3D 0) {
 -		nranges =3D 1;
 -	}

  	const size_t tagsize =3D sizeof(*tag) + nranges * sizeof(*dr);
  	if ((*newtag =3D kmem_intr_zalloc(tagsize,

 --------------64740B43E995EA09E8BB3D94--

From: Richard Todd <rmtodd@servalan.servalan.com>
To: Nick Hudson <nick.hudson@gmx.co.uk>
Cc: matthew green <mrg@eterna.com.au>,  gnats-bugs@netbsd.org,
  port-arm-maintainer@netbsd.org,  gnats-admin@netbsd.org,
  netbsd-bugs@netbsd.org
Subject: Re: port-arm/55737: Apparent bug in evbarm64 DMA code causes
 filesystem corruption
Date: Fri, 23 Oct 2020 02:04:29 -0500

 Nick Hudson <nick.hudson@gmx.co.uk> writes:
 >
 > I think this patch is correct and covers the case that your patch
 > doesn't where there are no ranges in the parent tag.
 >
 > I think the virtio patch is still valid, however.

 Okay, here's the (somewhat delayed) results from my testing that I
 promised earlier. 

 1) The virtio patch you provided earlier (here 
 http://mail-index.netbsd.org/port-arm/2020/10/19/msg007017.html)
 seems to do the trick -- applying that to a recentish -current tree
 (21a63fef3aab date:        Mon Oct 19 09:07:29 2020 +0000, if it matters)
 resulted in a kernel that could withstand three rounds of
 sysupgrade installing kernel and world without a single panic.  So that
 patch looks good.

 2) The patch you just gave for bus_dma.c, applied by itself on top of
 the same -current, does *not* seem to fix the problem; we get what looks
 like the same panic at about the same place.


From: "Nick Hudson" <skrll@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55737 CVS commit: src/sys/dev/acpi
Date: Sat, 24 Oct 2020 07:21:01 +0000

 Module Name:	src
 Committed By:	skrll
 Date:		Sat Oct 24 07:21:01 UTC 2020

 Modified Files:
 	src/sys/dev/acpi: virtio_acpi.c

 Log Message:
 Use the 64bit DMA tag if its valid.

 There appears to be a bug in virtio / ld_virtio and bounce buffers that
 was triggered when the bus_dmatag_subregion code on arm64 was fixed to
 correctly create a new tag for the 32bit tag vs the system (64bit) tag.
 This change avoids the bug.

 PR/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption


 To generate a diff of this commit:
 cvs rdiff -u -r1.2 -r1.3 src/sys/dev/acpi/virtio_acpi.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: skrll@NetBSD.org
State-Changed-When: Sun, 15 May 2022 20:28:51 +0000
State-Changed-Why:
the virtio bug was fixed and the bounce bbuffer bug 
was evry likely fixed in

sys/arch/arm/arm32/bus_dma.c:1.135
if we have to bounce a buffer, clear our anything already setup in
the map before trying again.

riastradh@ noticed that a map had two types of mbuf types, and this
seems to avoid it as seen on eqos(4) on quartz64.

add a couple more event counts for bus dma events.

ok @skrll @raistradh @jmcneill


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.