NetBSD Problem Report #54810
From gson@gson.org Mon Dec 30 08:12:51 2019
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id D826E7A158
for <gnats-bugs@gnats.NetBSD.org>; Mon, 30 Dec 2019 08:12:50 +0000 (UTC)
Message-Id: <20191230081246.20724253F3D@guava.gson.org>
Date: Mon, 30 Dec 2019 10:12:46 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: sparc64 pool_redzone_check errors during install
X-Send-Pr-Version: 3.95
>Number: 54810
>Category: port-sparc64
>Synopsis: sparc64 pool_redzone_check errors during install
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: port-sparc64-maintainer
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Dec 30 08:15:00 +0000 2019
>Closed-Date: Wed Oct 28 13:00:09 +0000 2020
>Last-Modified: Wed Oct 28 13:00:09 +0000 2020
>Originator: Andreas Gustafsson
>Release: NetBSD-current
>Organization:
>Environment:
System: NetBSD
Architecture: sparc64
Machine: sparc64
>Description:
The sparc64 port is printing messages like
pool_redzone_check: 0x125760c20: 0x00 != 0x66
during the set extraction phase of the install. On the TNF testbed,
they appeared some time between source dates 2019.11.16.04.10.33 and
2019.11.28.14.21.25, but were only warnings until they were changed
into a panic in
2019.12.27.15.49.20 maxv src/sys/kern/subr_pool.c 1.264
Log output from the latest such panic is at
http://releng.netbsd.org/b5reports/sparc64/2019/2019.12.29.09.17.51/install.log
>How-To-Repeat:
Using versions of qemu and glib2 that don't suffer from PR 54310, e.g.,
qemu-4.1.0nb2 with glib2-2.60.6.
>Fix:
>Release-Note:
>Audit-Trail:
From: Harold Gutch <logix@foobar.franken.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sat, 24 Oct 2020 09:57:41 +0200
--hxkXGo8AKqTJ+9QI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Hi,
can you try this diff? NTFS got added implicitly when all file
systems were moved from GENERIC to conf/filesystems.conf in r1.216,
and that is when for me these messages started to show up.
I just went through 3 rounds of installing sparc64 with my change in
Qemu (albeit with a source tree from a few months ago) without any
issues. After going back to the vanilla GENERIC, I got the panic
again when extracting base.
In the long run it would of course still be nice if the actual issue
in NTFS could be fixed, but either way, I don't see it being of much
use in sparc64 GENERIC, so IMHO it might as well just be disabled
there.
Harold
--hxkXGo8AKqTJ+9QI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=diff
--- src/sys/arch/sparc64/conf/GENERIC.orig 2020-09-27 15:48:54.000000000 +0200
+++ src/sys/arch/sparc64/conf/GENERIC 2020-10-23 23:43:34.907070000 +0200
@@ -148,6 +148,7 @@
## File systems.
include "conf/filesystems.config"
+no file-system NTFS
## File system options.
options NFSSERVER # Sun NFS-compatible filesystem server
--hxkXGo8AKqTJ+9QI--
From: Andreas Gustafsson <gson@gson.org>
To: Harold Gutch <logix@foobar.franken.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sat, 24 Oct 2020 15:31:01 +0300
Harold Gutch wrote:
> can you try this diff? NTFS got added implicitly when all file
> systems were moved from GENERIC to conf/filesystems.conf in r1.216,
> and that is when for me these messages started to show up.
With the patch, the install completed sucessfully for me. I tested
using sources from 2020.10.17.17.23.22 because today's current doesn't
build.
Does anyone have an idea how the mere presence of ntfs in the kernel
can trigger this bug, without an ntfs actually being mounted?
--
Andreas Gustafsson, gson@gson.org
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: port-sparc64-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, gson@gson.org (Andreas Gustafsson)
Subject: re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sun, 25 Oct 2020 10:35:16 +1100
> Does anyone have an idea how the mere presence of ntfs in the kernel
> can trigger this bug, without an ntfs actually being mounted?
there are probe-mounts for root, it may have be called to try
to mountroot before ffs was. does the log have eg:
boot device: re0
mountroot: trying ffs...
mountroot: trying ext2fs...
mountroot: trying nfs...
nfs_boot: trying DHCP/BOOTP
where it says ntfs somewhere?
that's the only thing i can think of.
.mrg.
From: Andreas Gustafsson <gson@gson.org>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@netbsd.org, port-sparc64-maintainer@netbsd.org
Subject: re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sun, 25 Oct 2020 09:24:05 +0200
matthew green wrote:
> > Does anyone have an idea how the mere presence of ntfs in the kernel
> > can trigger this bug, without an ntfs actually being mounted?
>
> there are probe-mounts for root, it may have be called to try
> to mountroot before ffs was. does the log have eg:
>
> boot device: re0
> mountroot: trying ffs...
> mountroot: trying ext2fs...
> mountroot: trying nfs...
> nfs_boot: trying DHCP/BOOTP
>
> where it says ntfs somewhere?
No, the string "ntfs" does not appear in the console output.
I tried adding a printf("ntfs_init\n") to ntfs_init(), and it was
called early in boot:
[ 1.0000000] entropy: no seed from bootloader
[ 1.0000000] ntfs_init
[ 1.0000000] mainbus0 (root): OpenBiosTeam,OpenBIOS: hostid 80123456
I then tried commenting out all the other code in ntfs_init(), leaving
only the printf, but that changed nothing; I still got the panic in
pool_redzone_check during set extraction.
--
Andreas Gustafsson, gson@gson.org
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sun, 25 Oct 2020 07:48:33 -0000 (UTC)
gson@gson.org (Andreas Gustafsson) writes:
> I tried adding a printf("ntfs_init\n") to ntfs_init(), and it was
> called early in boot:
>
> [ 1.0000000] entropy: no seed from bootloader
> [ 1.0000000] ntfs_init
> [ 1.0000000] mainbus0 (root): OpenBiosTeam,OpenBIOS: hostid 80123456
>
> I then tried commenting out all the other code in ntfs_init(), leaving
> only the printf, but that changed nothing;
Init happens when the the filesystem code is attached. This allocates
and initializes a few data structures. Maybe there is an overflow.
There might be mount attempts, but the code stops at the first
filesystem that succeeds and ntfs is rather late in the list. See
vfs.generic.fstypes.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: Harold Gutch <logix@foobar.franken.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sun, 25 Oct 2020 16:29:01 +0100
The does not appear to be in ntfs_init(), but instead in
ntfs_mountroot() (or something called by that function). If I add a
"return -1" at the top of the function, the installation also succeeds
for me.
FWIW, on an unpatched system ntfs_mountroot() we land in the
ntfs_mountfs() != 0 case in ntfs_vfsops.c:117, so it is either
something allocated by vfs_rootmountalloc() or ntfs_mountfs().
Harold
From: Andreas Gustafsson <gson@gson.org>
To: Harold Gutch <logix@foobar.franken.de>
Cc: gnats-bugs@netbsd.org, martin@NetBSD.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 09:25:02 +0200
Harold Gutch wrote:
> The does not appear to be in ntfs_init(), but instead in
> ntfs_mountroot() (or something called by that function). If I add a
> "return -1" at the top of the function, the installation also succeeds
> for me.
>
> FWIW, on an unpatched system ntfs_mountroot() we land in the
> ntfs_mountfs() != 0 case in ntfs_vfsops.c:117, so it is either
> something allocated by vfs_rootmountalloc() or ntfs_mountfs().
Agreed. I also did some debugging by disabling selecting parts of the
code and seeing if it fixes the bug, and unless I have made a mistake
somewhere, I have narrowed it down to this stretch of code in
ntfs_mountfs():
error = bread(devvp, BBLOCK, BBSIZE, 0, &bp);
if (error)
goto out;
or possibly the cleanup code following "out". Since BBSIZE is 1024, this
also seems consistent with the reported corruption in a "[buf1k]" pool.
Maybe some lower layer ends up reading an entire 2048-byte CD-ROM
block into a 1024-byte buffer?
--
Andreas Gustafsson, gson@gson.org
From: Harold Gutch <logix@foobar.franken.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org, martin@NetBSD.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 10:11:35 +0100
On Mon, Oct 26, 2020 at 09:25:02AM +0200, Andreas Gustafsson wrote:
> Harold Gutch wrote:
> > The does not appear to be in ntfs_init(), but instead in
> > ntfs_mountroot() (or something called by that function). If I add a
> > "return -1" at the top of the function, the installation also succeeds
> > for me.
> >
> > FWIW, on an unpatched system ntfs_mountroot() we land in the
> > ntfs_mountfs() != 0 case in ntfs_vfsops.c:117, so it is either
> > something allocated by vfs_rootmountalloc() or ntfs_mountfs().
>
> Agreed. I also did some debugging by disabling selecting parts of the
> code and seeing if it fixes the bug, and unless I have made a mistake
> somewhere, I have narrowed it down to this stretch of code in
> ntfs_mountfs():
>
> error = bread(devvp, BBLOCK, BBSIZE, 0, &bp);
> if (error)
> goto out;
Ah, that's good, I did the same yesterday and I come to exactly the
same conclusion, just that I did not "goto out" but instead called
"brelse(bp,0); return -1;", i.e., I skipped the
spec_node_setmountedfs() call in the cleanup (so we can rule that out
too).
> or possibly the cleanup code following "out". Since BBSIZE is 1024, this
> also seems consistent with the reported corruption in a "[buf1k]" pool.
>
> Maybe some lower layer ends up reading an entire 2048-byte CD-ROM
> block into a 1024-byte buffer?
That was also my guess, I was planning on following up some more in
that direction today. I first wanted to have a look what the other
file system drivers do there, because in order for this to make sense,
no other driver should read a 1K block in *_mountroot() or *_init().
Harold
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 09:25:28 -0000 (UTC)
gson@gson.org (Andreas Gustafsson) writes:
> Maybe some lower layer ends up reading an entire 2048-byte CD-ROM
> block into a 1024-byte buffer?
There is magic code for sun hardware and the cd driver. The sun code
fakes a block size of 512 bytes in the disklabel and the cd driver
emulates such accesses if the hardware reports a different block size
(an original Sun CD-ROM can be configured to 512 byte blocks, most
others cannot).
That's the only regular case where disklabel and driver disagree on
block size.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during
install
Date: Mon, 26 Oct 2020 11:00:47 +0100
On Mon, Oct 26, 2020 at 09:15:01AM +0000, Harold Gutch wrote:
> > error = bread(devvp, BBLOCK, BBSIZE, 0, &bp);
> > if (error)
> > goto out;
>
> Ah, that's good, I did the same yesterday and I come to exactly the
> same conclusion, just that I did not "goto out" but instead called
> "brelse(bp,0); return -1;", i.e., I skipped the
> spec_node_setmountedfs() call in the cleanup (so we can rule that out
> too).
>
>
> > or possibly the cleanup code following "out". Since BBSIZE is 1024, this
> > also seems consistent with the reported corruption in a "[buf1k]" pool.
> >
> > Maybe some lower layer ends up reading an entire 2048-byte CD-ROM
> > block into a 1024-byte buffer?
>
> That was also my guess, I was planning on following up some more in
> that direction today. I first wanted to have a look what the other
> file system drivers do there, because in order for this to make sense,
> no other driver should read a 1K block in *_mountroot() or *_init().
This is clearly a bug. The medium has 2k sectors, we are not allowed to
use the buffer cache with blocks smaller than that (or with varying block
sizes).
We could make ntfs_mountfs() do something like:
uint64_t psize;
unsigned secsize;
int error;
error = getdisksize(devvp, &psize, &secsize);
if (error || secsize > BBSIZE)
return ENODEV;
Or we could go through the code and replace all BBSIZE reads with proper
code to deal with bigger blocks - or check if FreeBSD did that and update
from their version.
Martin
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during
install
Date: Mon, 26 Oct 2020 11:05:27 +0100
Anyone have a (external?) disk that advertizes 2k or bigger sectors
where they could create an NTFS on?
I bet a KASAN kernel trying to mount such a file system would die
too...
Martin
From: Harold Gutch <logix@foobar.franken.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org, martin@NetBSD.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 11:25:53 +0100
On Mon, Oct 26, 2020 at 10:11:35AM +0100, Harold Gutch wrote:
> On Mon, Oct 26, 2020 at 09:25:02AM +0200, Andreas Gustafsson wrote:
> > or possibly the cleanup code following "out". Since BBSIZE is 1024, this
> > also seems consistent with the reported corruption in a "[buf1k]" pool.
> >
> > Maybe some lower layer ends up reading an entire 2048-byte CD-ROM
> > block into a 1024-byte buffer?
>
> That was also my guess, I was planning on following up some more in
> that direction today. I first wanted to have a look what the other
> file system drivers do there, because in order for this to make sense,
> no other driver should read a 1K block in *_mountroot() or *_init().
I think we're all clear on the issue here, but just for additional
verification, if I modify the bread() to instead read an entire 2048
byte chunk, the installation also succeeds (well, it succeeded in the
one test run that I did).
Harold
From: Andreas Gustafsson <gson@gson.org>
To: Martin Husemann <martin@duskware.de>
Cc: Harold Gutch <logix@foobar.franken.de>, gnats-bugs@netbsd.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 12:32:18 +0200
Martin Husemann wrote:
> This is clearly a bug. The medium has 2k sectors, we are not allowed to
> use the buffer cache with blocks smaller than that (or with varying block
> sizes).
Regardless of which way we choose to fix this, it would also be
helpful to have an assertion somehwere to catch this error when it
happens again.
--
Andreas Gustafsson, gson@gson.org
From: Martin Husemann <martin@duskware.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: Harold Gutch <logix@foobar.franken.de>, gnats-bugs@netbsd.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during
install
Date: Mon, 26 Oct 2020 11:35:13 +0100
On Mon, Oct 26, 2020 at 12:32:18PM +0200, Andreas Gustafsson wrote:
> Regardless of which way we choose to fix this, it would also be
> helpful to have an assertion somehwere to catch this error when it
> happens again.
Yes, please!
From: "Michael van Elst" <mlelstv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54810 CVS commit: src/sys/dev/scsipi
Date: Mon, 26 Oct 2020 11:39:48 +0000
Module Name: src
Committed By: mlelstv
Date: Mon Oct 26 11:39:48 UTC 2020
Modified Files:
src/sys/dev/scsipi: cd.c
Log Message:
Avoid buffer overflow when copying from bounce buffer.
Fixes PR 54810
Don't use uninitialized pointer in split bounce buffer case and
free a partially allocated bounce buffer on error.
To generate a diff of this commit:
cvs rdiff -u -r1.348 -r1.349 src/sys/dev/scsipi/cd.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Harold Gutch <logix@foobar.franken.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: Martin Husemann <martin@duskware.de>, gnats-bugs@netbsd.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 12:41:20 +0100
On Mon, Oct 26, 2020 at 12:32:18PM +0200, Andreas Gustafsson wrote:
> Martin Husemann wrote:
> > This is clearly a bug. The medium has 2k sectors, we are not allowed to
> > use the buffer cache with blocks smaller than that (or with varying block
> > sizes).
>
> Regardless of which way we choose to fix this, it would also be
> helpful to have an assertion somehwere to catch this error when it
> happens again.
Michael supplied me off-list with a patch that handles this in
sys/dev/scsipi/cd.c and I successfully went two "minimal" installs
with it (and all my other edits reverted). He was going to commit it
too.
And I agree, having an assertion somewhere would be a good idea.
Harold
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54810 CVS commit: src/sys/arch/sparc64/conf
Date: Mon, 26 Oct 2020 11:49:45 +0000
Module Name: src
Committed By: martin
Date: Mon Oct 26 11:49:45 UTC 2020
Modified Files:
src/sys/arch/sparc64/conf: GENERIC
Log Message:
Backout previous, PR 54810 has been fixed.
To generate a diff of this commit:
cvs rdiff -u -r1.231 -r1.232 src/sys/arch/sparc64/conf/GENERIC
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 12:08:25 -0000 (UTC)
martin@duskware.de (Martin Husemann) writes:
> This is clearly a bug. The medium has 2k sectors, we are not allowed to
> use the buffer cache with blocks smaller than that (or with varying block
> sizes).
But that's not what happened.
If you do regular I/O on a device that is not a multiple of sector size,
the driver will just fail that request with EINVAL.
cd(4) is magic, as it needs to support 512 byte UFS access on CD-ROMs
for compatibility with old Sun hardware.
CD-ROM drives that support 512 byte blocks will just be configured to
handle 512 byte read commands. For CD-ROMs that do not (the majority)
the driver provides the bounce buffer.
The magic is triggered by the special sun*/sparc* disklabel code that
returns a default sector size of 512 for CD-ROMs.
When I refactored the code in 2016, I missed the case of a read
smaller than a single sector. Apparently nobody does this but
our NTFS code, so nobody noticed the bug until NTFS was recently
added to the sparc64 kernel.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Wed, 28 Oct 2020 13:00:09 +0000
State-Changed-Why:
Fixed in src/sys/dev/scsipi/cd.c 1.349, thanks.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.