NetBSD Problem Report #55737
From rmtodd@servalan.servalan.com Tue Oct 20 03:30:07 2020
Return-Path: <rmtodd@servalan.servalan.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id DB7E21A9217
for <gnats-bugs@gnats.NetBSD.org>; Tue, 20 Oct 2020 03:30:07 +0000 (UTC)
Message-Id: <20201020031740.GA27771@ichotolot.servalan.com>
Date: Mon, 19 Oct 2020 22:17:40 -0500
From: Richard Todd <rmtodd@servalan.servalan.com>
Reply-To: rmtodd@servalan.servalan.com
To: gnats-bugs@netbsd.org
Subject: Apparent bug in evbarm64 DMA code causes filesystem corruption
X-Send-Pr-Version: 3.95
>Number: 55737
>Category: port-arm
>Synopsis: Apparent bug in evbarm64 DMA code causes filesystem corruption
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: port-arm-maintainer
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Oct 20 03:35:00 +0000 2020
>Closed-Date: Sun May 15 20:28:51 +0000 2022
>Last-Modified: Sun May 15 20:28:51 +0000 2022
>Originator: Richard Todd
>Release: NetBSD 9.99.72
>Organization:
>Environment:
System: NetBSD netbsdarm64test.servalan.com 9.99.72 NetBSD 9.99.72 (RMTGENERIC_VM_a15d1b7af98f) #3: Thu Oct 8 00:00:34 CDT 2020 sysbuild@materia.servalan.com:/amd/ichotolot/u2/netbsd-vmexport/sysbuild/Build/evbarm64/obj/amd/ichotolot/u2/netbsd-vmexport/sysbuild/src/sys/arch/evbarm/compile/RMTGENERIC_VM evbarm
Architecture: aarch64
Machine: evbarm
>Description:
As per recent discussion on port-arm
http://mail-index.netbsd.org/port-arm/2020/10/13/msg006997.html
various people, including me, have been seeing filesystem corruption with
recent builds of NetBSD-current on evbarm64. With a good deal of work, I
managed to get a reproducible test case and narrow the apparent culprit down
to this recent commit:
changeset: 936632:a15d1b7af98f
branch: trunk
user: skrll <skrll@NetBSD.org>
date: Tue Sep 08 10:30:17 2020 +0000
files: sys/arch/arm/arm32/bus_dma.c
extra: branch=trunk
description:
A few bus_dmatag_subregion fixes
- return EOPNOTSUPP if min_addr isn't less than max_addr
- fix the subset check to ensure that all the ranges in the parent tag are
within the {min,max}_addr range. If so we can just continue to use the
parent tag.
- when building the new ranges read the parent tag range rather than un-
initialised memory.
- remove the max_addr != 0xffffffff check - the overflow should be handled
by the unsigned arithmetic for arm32.
- add a KASSERT
- add comments
I've got a console log here ( https://pastebin.com/eHRG8jGy ) exhibiting an
example reproduction of the problem in my usual qemu testing VM. The first
part has qemu booting with two virtual disks attached, one the "usual"
NetBSD-current disk image I've been using with that VM, and the second a
copy of said image which was going to get its / newfsed and recreated by
copying from the master. (This was done to mitigate the possiblity that
these crashes were caused by preexisting corruption on the filesystem,
which was the previous theory on port-arm mailing list regarding the source
of these crashes). The second part is qemu running off of the "copy" disk
image, and going thru a typical attempt to upgrade to an already-built
NetBSD-current release (built elsewhere, stashed on NFS on another machine)
-- first sysupgrade fetches everything, installs the new kernel, then we
boot the new kernel. Next we try to use sysupgrade to install world, but
we hit a
panic: ffs_blkfree: bad size: dev = 0x5c20, bno = 2432893633 bsize = 16384, size = 16384, fs = /
panic in the middle of extracting the tests tar file, and then fail
automatic fsck on reboot.
>How-To-Repeat:
Install a kernel built from -current after the a15d1b7af98f commit,
reboot, and attempt to install world with sysupgrade.
>Fix:
Revert to a kernel built from -current prior to that commit.
Alternatively, I just tested a build of -current sources from
Saturday (rev 1b3515410312, date: Sat Oct 17 23:23:06 2020 +0000) with
reversing the suspect bus_dma.c patch and the later uvm_bio.c patch of
October 5 mentioned recently on netbsd-current list (and which has
since been reverted in -current source). With those sources, we can do
the install of kernel and world just fine.
>Release-Note:
>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org, skrll@netbsd.org,
rmtodd@servalan.servalan.com
Cc: port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: re: port-arm/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption
Date: Tue, 20 Oct 2020 16:34:12 +1100
> - fix the subset check to ensure that all the ranges in the parent tag are
> within the {min,max}_addr range. If so we can just continue to use the
> parent tag.
this part of the patch i think is problematic.
previously, subset started as false, and was transitioned
to true, and never back.
now, it's as false, but it transitions back and forth to
have the last entry's value.
i think both are wrong. it should start true, and if any
entry is outside the parent range, set to false.
.mrg.
From: matthew green <mrg@eterna.com.au>
To: port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, gnats-bugs@netbsd.org, skrll@netbsd.org,
rmtodd@servalan.servalan.com
Cc:
Subject: re: port-arm/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption
Date: Tue, 20 Oct 2020 17:31:48 +1100
matthew green writes:
> i think both are wrong. it should start true, and if any
> entry is outside the parent range, set to false.
i meant to include this untested patch. not meant to be
commited as-is, but should confirm or deny my guess with
a repro attempt.
.mrg.
Index: bus_dma.c
===================================================================
RCS file: /cvsroot/src/sys/arch/arm/arm32/bus_dma.c,v
retrieving revision 1.123
diff -p -u -r1.123 bus_dma.c
--- bus_dma.c 8 Sep 2020 10:30:17 -0000 1.123
+++ bus_dma.c 20 Oct 2020 06:29:52 -0000
@@ -1816,7 +1816,7 @@ _bus_dmatag_subregion(bus_dma_tag_t tag,
#ifdef _ARM32_NEED_BUS_DMA_BOUNCE
struct arm32_dma_range *dr;
- bool subset = false;
+ bool subset = true;
size_t nranges = 0;
size_t i;
for (i = 0, dr = tag->_ranges; i < tag->_nranges; i++, dr++) {
@@ -1826,7 +1826,7 @@ _bus_dmatag_subregion(bus_dma_tag_t tag,
*/
if (dr->dr_sysbase >= min_addr
&& dr->dr_sysbase + dr->dr_len - 1 <= max_addr) {
- subset = true;
+ //subset = true;
} else {
subset = false;
}
From: Nick Hudson <nick.hudson@gmx.co.uk>
To: matthew green <mrg@eterna.com.au>, gnats-bugs@netbsd.org,
rmtodd@servalan.servalan.com
Cc: port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: port-arm/55737: Apparent bug in evbarm64 DMA code causes
filesystem corruption
Date: Tue, 20 Oct 2020 09:03:57 +0100
This is a multi-part message in MIME format.
--------------64740B43E995EA09E8BB3D94
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
On 20/10/2020 06:34, matthew green wrote:
>> - fix the subset check to ensure that all the ranges in the parent tag =
are
>> within the {min,max}_addr range. If so we can just continue to use =
the
>> parent tag.
> this part of the patch i think is problematic.
>
> previously, subset started as false, and was transitioned
> to true, and never back.
>
> now, it's as false, but it transitions back and forth to
> have the last entry's value.
>
> i think both are wrong. it should start true, and if any
> entry is outside the parent range, set to false.
You're right the code is flawed.
I think this patch is correct and covers the case that your patch
doesn't where there are no ranges in the parent tag.
I think the virtio patch is still valid, however.
Thanks,
Nick
--------------64740B43E995EA09E8BB3D94
Content-Type: text/x-patch; charset=UTF-8;
name="bus_dma.c.diff"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
filename="bus_dma.c.diff"
Index: sys/arch/arm/arm32/bus_dma.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /cvsroot/src/sys/arch/arm/arm32/bus_dma.c,v
retrieving revision 1.123
diff -u -p -r1.123 bus_dma.c
=2D-- sys/arch/arm/arm32/bus_dma.c 8 Sep 2020 10:30:17 -0000 1.123
+++ sys/arch/arm/arm32/bus_dma.c 20 Oct 2020 08:00:34 -0000
@@ -1816,18 +1816,18 @@ _bus_dmatag_subregion(bus_dma_tag_t tag,
#ifdef _ARM32_NEED_BUS_DMA_BOUNCE
struct arm32_dma_range *dr;
- bool subset =3D false;
+ bool subset =3D true;
size_t nranges =3D 0;
size_t i;
for (i =3D 0, dr =3D tag->_ranges; i < tag->_nranges; i++, dr++) {
/*
- * Are all the ranges in the parent tag a subset of the new
- * range? If yes, we can continue to use the parent tag.
+ * If the new {min,max}_addr are narrower than any of the
+ * ranges in the parent tag then we need a new tag;
+ * otherwise the parent tag is a subset of the new
+ * range and can continue to be used.
*/
- if (dr->dr_sysbase >=3D min_addr
- && dr->dr_sysbase + dr->dr_len - 1 <=3D max_addr) {
- subset =3D true;
- } else {
+ if (min_addr > dr->dr_sysbase
+ || max_addr < dr->dr_sysbase + dr->dr_len - 1) {
subset =3D false;
}
if (min_addr <=3D dr->dr_sysbase + dr->dr_len
@@ -1835,6 +1835,10 @@ _bus_dmatag_subregion(bus_dma_tag_t tag,
nranges++;
}
}
+ if (nranges =3D=3D 0) {
+ nranges =3D 1;
+ subset =3D false;
+ }
if (subset) {
*newtag =3D tag;
/* if the tag must be freed, add a reference */
@@ -1842,9 +1846,6 @@ _bus_dmatag_subregion(bus_dma_tag_t tag,
(tag->_tag_needs_free)++;
return 0;
}
- if (nranges =3D=3D 0) {
- nranges =3D 1;
- }
const size_t tagsize =3D sizeof(*tag) + nranges * sizeof(*dr);
if ((*newtag =3D kmem_intr_zalloc(tagsize,
--------------64740B43E995EA09E8BB3D94--
From: Richard Todd <rmtodd@servalan.servalan.com>
To: Nick Hudson <nick.hudson@gmx.co.uk>
Cc: matthew green <mrg@eterna.com.au>, gnats-bugs@netbsd.org,
port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: port-arm/55737: Apparent bug in evbarm64 DMA code causes
filesystem corruption
Date: Fri, 23 Oct 2020 02:04:29 -0500
Nick Hudson <nick.hudson@gmx.co.uk> writes:
>
> I think this patch is correct and covers the case that your patch
> doesn't where there are no ranges in the parent tag.
>
> I think the virtio patch is still valid, however.
Okay, here's the (somewhat delayed) results from my testing that I
promised earlier.
1) The virtio patch you provided earlier (here
http://mail-index.netbsd.org/port-arm/2020/10/19/msg007017.html)
seems to do the trick -- applying that to a recentish -current tree
(21a63fef3aab date: Mon Oct 19 09:07:29 2020 +0000, if it matters)
resulted in a kernel that could withstand three rounds of
sysupgrade installing kernel and world without a single panic. So that
patch looks good.
2) The patch you just gave for bus_dma.c, applied by itself on top of
the same -current, does *not* seem to fix the problem; we get what looks
like the same panic at about the same place.
From: "Nick Hudson" <skrll@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55737 CVS commit: src/sys/dev/acpi
Date: Sat, 24 Oct 2020 07:21:01 +0000
Module Name: src
Committed By: skrll
Date: Sat Oct 24 07:21:01 UTC 2020
Modified Files:
src/sys/dev/acpi: virtio_acpi.c
Log Message:
Use the 64bit DMA tag if its valid.
There appears to be a bug in virtio / ld_virtio and bounce buffers that
was triggered when the bus_dmatag_subregion code on arm64 was fixed to
correctly create a new tag for the 32bit tag vs the system (64bit) tag.
This change avoids the bug.
PR/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption
To generate a diff of this commit:
cvs rdiff -u -r1.2 -r1.3 src/sys/dev/acpi/virtio_acpi.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: skrll@NetBSD.org
State-Changed-When: Sun, 15 May 2022 20:28:51 +0000
State-Changed-Why:
the virtio bug was fixed and the bounce bbuffer bug
was evry likely fixed in
sys/arch/arm/arm32/bus_dma.c:1.135
if we have to bounce a buffer, clear our anything already setup in
the map before trying again.
riastradh@ noticed that a map had two types of mbuf types, and this
seems to avoid it as seen on eqos(4) on quartz64.
add a couple more event counts for bus dma events.
ok @skrll @raistradh @jmcneill
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.