NetBSD Problem Report #54810

From gson@gson.org  Mon Dec 30 08:12:51 2019
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id D826E7A158
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 30 Dec 2019 08:12:50 +0000 (UTC)
Message-Id: <20191230081246.20724253F3D@guava.gson.org>
Date: Mon, 30 Dec 2019 10:12:46 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: sparc64 pool_redzone_check errors during install
X-Send-Pr-Version: 3.95

>Number:         54810
>Category:       port-sparc64
>Synopsis:       sparc64 pool_redzone_check errors during install
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    port-sparc64-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Dec 30 08:15:00 +0000 2019
>Closed-Date:    Wed Oct 28 13:00:09 +0000 2020
>Last-Modified:  Wed Oct 28 13:00:09 +0000 2020
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current
>Organization:

>Environment:
System: NetBSD
Architecture: sparc64
Machine: sparc64
>Description:

The sparc64 port is printing messages like

  pool_redzone_check: 0x125760c20: 0x00 != 0x66

during the set extraction phase of the install.  On the TNF testbed,
they appeared some time between source dates 2019.11.16.04.10.33 and
2019.11.28.14.21.25, but were only warnings until they were changed
into a panic in

  2019.12.27.15.49.20 maxv src/sys/kern/subr_pool.c 1.264

Log output from the latest such panic is at

  http://releng.netbsd.org/b5reports/sparc64/2019/2019.12.29.09.17.51/install.log

>How-To-Repeat:

Using versions of qemu and glib2 that don't suffer from PR 54310, e.g.,
qemu-4.1.0nb2 with glib2-2.60.6.

>Fix:

>Release-Note:

>Audit-Trail:
From: Harold Gutch <logix@foobar.franken.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sat, 24 Oct 2020 09:57:41 +0200

 --hxkXGo8AKqTJ+9QI
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 Hi,

 can you try this diff?  NTFS got added implicitly when all file
 systems were moved from GENERIC to conf/filesystems.conf in r1.216,
 and that is when for me these messages started to show up.

 I just went through 3 rounds of installing sparc64 with my change in
 Qemu (albeit with a source tree from a few months ago) without any
 issues.  After going back to the vanilla GENERIC, I got the panic
 again when extracting base.

 In the long run it would of course still be nice if the actual issue
 in NTFS could be fixed, but either way, I don't see it being of much
 use in sparc64 GENERIC, so IMHO it might as well just be disabled
 there.


   Harold

 --hxkXGo8AKqTJ+9QI
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff

 --- src/sys/arch/sparc64/conf/GENERIC.orig	2020-09-27 15:48:54.000000000 +0200
 +++ src/sys/arch/sparc64/conf/GENERIC	2020-10-23 23:43:34.907070000 +0200
 @@ -148,6 +148,7 @@

  ## File systems.
  include "conf/filesystems.config"
 +no file-system NTFS

  ## File system options.
  options 	NFSSERVER	# Sun NFS-compatible filesystem server

 --hxkXGo8AKqTJ+9QI--

From: Andreas Gustafsson <gson@gson.org>
To: Harold Gutch <logix@foobar.franken.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sat, 24 Oct 2020 15:31:01 +0300

 Harold Gutch wrote:
 >  can you try this diff?  NTFS got added implicitly when all file
 >  systems were moved from GENERIC to conf/filesystems.conf in r1.216,
 >  and that is when for me these messages started to show up.

 With the patch, the install completed sucessfully for me.  I tested
 using sources from 2020.10.17.17.23.22 because today's current doesn't
 build.

 Does anyone have an idea how the mere presence of ntfs in the kernel
 can trigger this bug, without an ntfs actually being mounted?
 -- 
 Andreas Gustafsson, gson@gson.org

From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: port-sparc64-maintainer@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org, gson@gson.org (Andreas Gustafsson)
Subject: re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sun, 25 Oct 2020 10:35:16 +1100

 > Does anyone have an idea how the mere presence of ntfs in the kernel
 > can trigger this bug, without an ntfs actually being mounted?

 there are probe-mounts for root, it may have be called to try
 to mountroot before ffs was.  does the log have eg:

    boot device: re0
    mountroot: trying ffs...
    mountroot: trying ext2fs...
    mountroot: trying nfs...
    nfs_boot: trying DHCP/BOOTP

 where it says ntfs somewhere?

 that's the only thing i can think of.


 .mrg.

From: Andreas Gustafsson <gson@gson.org>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@netbsd.org, port-sparc64-maintainer@netbsd.org
Subject: re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sun, 25 Oct 2020 09:24:05 +0200

 matthew green wrote:
 > > Does anyone have an idea how the mere presence of ntfs in the kernel
 > > can trigger this bug, without an ntfs actually being mounted?
 > 
 > there are probe-mounts for root, it may have be called to try
 > to mountroot before ffs was.  does the log have eg:
 > 
 >    boot device: re0
 >    mountroot: trying ffs...
 >    mountroot: trying ext2fs...
 >    mountroot: trying nfs...
 >    nfs_boot: trying DHCP/BOOTP
 > 
 > where it says ntfs somewhere?

 No, the string "ntfs" does not appear in the console output.

 I tried adding a printf("ntfs_init\n") to ntfs_init(), and it was
 called early in boot:

 [   1.0000000] entropy: no seed from bootloader
 [   1.0000000] ntfs_init
 [   1.0000000] mainbus0 (root): OpenBiosTeam,OpenBIOS: hostid 80123456

 I then tried commenting out all the other code in ntfs_init(), leaving
 only the printf, but that changed nothing; I still got the panic in
 pool_redzone_check during set extraction.
 -- 
 Andreas Gustafsson, gson@gson.org

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sun, 25 Oct 2020 07:48:33 -0000 (UTC)

 gson@gson.org (Andreas Gustafsson) writes:

 > I tried adding a printf("ntfs_init\n") to ntfs_init(), and it was
 > called early in boot:
 > 
 > [   1.0000000] entropy: no seed from bootloader
 > [   1.0000000] ntfs_init
 > [   1.0000000] mainbus0 (root): OpenBiosTeam,OpenBIOS: hostid 80123456
 > 
 > I then tried commenting out all the other code in ntfs_init(), leaving
 > only the printf, but that changed nothing;

 Init happens when the the filesystem code is attached. This allocates
 and initializes a few data structures. Maybe there is an overflow.

 There might be mount attempts, but the code stops at the first
 filesystem that succeeds and ntfs is rather late in the list. See
 vfs.generic.fstypes.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Harold Gutch <logix@foobar.franken.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Sun, 25 Oct 2020 16:29:01 +0100

 The does not appear to be in ntfs_init(), but instead in
 ntfs_mountroot() (or something called by that function).  If I add a
 "return -1" at the top of the function, the installation also succeeds
 for me.

 FWIW, on an unpatched system ntfs_mountroot() we land in the
 ntfs_mountfs() != 0 case in ntfs_vfsops.c:117, so it is either
 something allocated by vfs_rootmountalloc() or ntfs_mountfs().


   Harold

From: Andreas Gustafsson <gson@gson.org>
To: Harold Gutch <logix@foobar.franken.de>
Cc: gnats-bugs@netbsd.org, martin@NetBSD.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 09:25:02 +0200

 Harold Gutch wrote:
 >  The does not appear to be in ntfs_init(), but instead in
 >  ntfs_mountroot() (or something called by that function).  If I add a
 >  "return -1" at the top of the function, the installation also succeeds
 >  for me.
 >
 >  FWIW, on an unpatched system ntfs_mountroot() we land in the
 >  ntfs_mountfs() != 0 case in ntfs_vfsops.c:117, so it is either
 >  something allocated by vfs_rootmountalloc() or ntfs_mountfs().

 Agreed. I also did some debugging by disabling selecting parts of the
 code and seeing if it fixes the bug, and unless I have made a mistake
 somewhere, I have narrowed it down to this stretch of code in
 ntfs_mountfs():

         error = bread(devvp, BBLOCK, BBSIZE, 0, &bp);
         if (error)
                 goto out;

 or possibly the cleanup code following "out".  Since BBSIZE is 1024, this
 also seems consistent with the reported corruption in a "[buf1k]" pool.

 Maybe some lower layer ends up reading an entire 2048-byte CD-ROM
 block into a 1024-byte buffer?
 -- 
 Andreas Gustafsson, gson@gson.org

From: Harold Gutch <logix@foobar.franken.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org, martin@NetBSD.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 10:11:35 +0100

 On Mon, Oct 26, 2020 at 09:25:02AM +0200, Andreas Gustafsson wrote:
 > Harold Gutch wrote:
 > >  The does not appear to be in ntfs_init(), but instead in
 > >  ntfs_mountroot() (or something called by that function).  If I add a
 > >  "return -1" at the top of the function, the installation also succeeds
 > >  for me.
 > >
 > >  FWIW, on an unpatched system ntfs_mountroot() we land in the
 > >  ntfs_mountfs() != 0 case in ntfs_vfsops.c:117, so it is either
 > >  something allocated by vfs_rootmountalloc() or ntfs_mountfs().
 > 
 > Agreed. I also did some debugging by disabling selecting parts of the
 > code and seeing if it fixes the bug, and unless I have made a mistake
 > somewhere, I have narrowed it down to this stretch of code in
 > ntfs_mountfs():
 > 
 >         error = bread(devvp, BBLOCK, BBSIZE, 0, &bp);
 >         if (error)
 >                 goto out;

 Ah, that's good, I did the same yesterday and I come to exactly the
 same conclusion, just that I did not "goto out" but instead called
 "brelse(bp,0); return -1;", i.e., I skipped the
 spec_node_setmountedfs() call in the cleanup (so we can rule that out
 too).


 > or possibly the cleanup code following "out".  Since BBSIZE is 1024, this
 > also seems consistent with the reported corruption in a "[buf1k]" pool.
 > 
 > Maybe some lower layer ends up reading an entire 2048-byte CD-ROM
 > block into a 1024-byte buffer?

 That was also my guess, I was planning on following up some more in
 that direction today.  I first wanted to have a look what the other
 file system drivers do there, because in order for this to make sense,
 no other driver should read a 1K block in *_mountroot() or *_init().


   Harold

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 09:25:28 -0000 (UTC)

 gson@gson.org (Andreas Gustafsson) writes:

 > Maybe some lower layer ends up reading an entire 2048-byte CD-ROM
 > block into a 1024-byte buffer?

 There is magic code for sun hardware and the cd driver. The sun code
 fakes a block size of 512 bytes in the disklabel and the cd driver
 emulates such accesses if the hardware reports a different block size
 (an original Sun CD-ROM can be configured to 512 byte blocks, most
 others cannot).

 That's the only regular case where disklabel and driver disagree on
 block size.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during
 install
Date: Mon, 26 Oct 2020 11:00:47 +0100

 On Mon, Oct 26, 2020 at 09:15:01AM +0000, Harold Gutch wrote:
 >  >         error = bread(devvp, BBLOCK, BBSIZE, 0, &bp);
 >  >         if (error)
 >  >                 goto out;
 >  
 >  Ah, that's good, I did the same yesterday and I come to exactly the
 >  same conclusion, just that I did not "goto out" but instead called
 >  "brelse(bp,0); return -1;", i.e., I skipped the
 >  spec_node_setmountedfs() call in the cleanup (so we can rule that out
 >  too).
 >  
 >  
 >  > or possibly the cleanup code following "out".  Since BBSIZE is 1024, this
 >  > also seems consistent with the reported corruption in a "[buf1k]" pool.
 >  > 
 >  > Maybe some lower layer ends up reading an entire 2048-byte CD-ROM
 >  > block into a 1024-byte buffer?
 >  
 >  That was also my guess, I was planning on following up some more in
 >  that direction today.  I first wanted to have a look what the other
 >  file system drivers do there, because in order for this to make sense,
 >  no other driver should read a 1K block in *_mountroot() or *_init().

 This is clearly a bug. The medium has 2k sectors, we are not allowed to
 use the buffer cache with blocks smaller than that (or with varying block
 sizes).

 We could make ntfs_mountfs() do something like:

 	uint64_t psize;
 	unsigned secsize;
 	int error;

 	error = getdisksize(devvp, &psize, &secsize);

 	if (error || secsize > BBSIZE)
 		return ENODEV;

 Or we could go through the code and replace all BBSIZE reads with proper
 code to deal with bigger blocks - or check if FreeBSD did that and update
 from their version.

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during
 install
Date: Mon, 26 Oct 2020 11:05:27 +0100

 Anyone have a (external?) disk that advertizes 2k or bigger sectors
 where they could create an NTFS on?

 I bet a KASAN kernel trying to mount such a file system would die
 too...

 Martin

From: Harold Gutch <logix@foobar.franken.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org, martin@NetBSD.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 11:25:53 +0100

 On Mon, Oct 26, 2020 at 10:11:35AM +0100, Harold Gutch wrote:
 > On Mon, Oct 26, 2020 at 09:25:02AM +0200, Andreas Gustafsson wrote:
 > > or possibly the cleanup code following "out".  Since BBSIZE is 1024, this
 > > also seems consistent with the reported corruption in a "[buf1k]" pool.
 > > 
 > > Maybe some lower layer ends up reading an entire 2048-byte CD-ROM
 > > block into a 1024-byte buffer?
 > 
 > That was also my guess, I was planning on following up some more in
 > that direction today.  I first wanted to have a look what the other
 > file system drivers do there, because in order for this to make sense,
 > no other driver should read a 1K block in *_mountroot() or *_init().

 I think we're all clear on the issue here, but just for additional
 verification, if I modify the bread() to instead read an entire 2048
 byte chunk, the installation also succeeds (well, it succeeded in the
 one test run that I did).


   Harold

From: Andreas Gustafsson <gson@gson.org>
To: Martin Husemann <martin@duskware.de>
Cc: Harold Gutch <logix@foobar.franken.de>, gnats-bugs@netbsd.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 12:32:18 +0200

 Martin Husemann wrote:
 >  This is clearly a bug. The medium has 2k sectors, we are not allowed to
 >  use the buffer cache with blocks smaller than that (or with varying block
 >  sizes).

 Regardless of which way we choose to fix this, it would also be
 helpful to have an assertion somehwere to catch this error when it
 happens again.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Martin Husemann <martin@duskware.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: Harold Gutch <logix@foobar.franken.de>, gnats-bugs@netbsd.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during
 install
Date: Mon, 26 Oct 2020 11:35:13 +0100

 On Mon, Oct 26, 2020 at 12:32:18PM +0200, Andreas Gustafsson wrote:
 > Regardless of which way we choose to fix this, it would also be
 > helpful to have an assertion somehwere to catch this error when it
 > happens again.

 Yes, please!

From: "Michael van Elst" <mlelstv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54810 CVS commit: src/sys/dev/scsipi
Date: Mon, 26 Oct 2020 11:39:48 +0000

 Module Name:	src
 Committed By:	mlelstv
 Date:		Mon Oct 26 11:39:48 UTC 2020

 Modified Files:
 	src/sys/dev/scsipi: cd.c

 Log Message:
 Avoid buffer overflow when copying from bounce buffer.
 Fixes PR 54810

 Don't use uninitialized pointer in split bounce buffer case and
 free a partially allocated bounce buffer on error.


 To generate a diff of this commit:
 cvs rdiff -u -r1.348 -r1.349 src/sys/dev/scsipi/cd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Harold Gutch <logix@foobar.franken.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: Martin Husemann <martin@duskware.de>, gnats-bugs@netbsd.org
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 12:41:20 +0100

 On Mon, Oct 26, 2020 at 12:32:18PM +0200, Andreas Gustafsson wrote:
 > Martin Husemann wrote:
 > >  This is clearly a bug. The medium has 2k sectors, we are not allowed to
 > >  use the buffer cache with blocks smaller than that (or with varying block
 > >  sizes).
 > 
 > Regardless of which way we choose to fix this, it would also be
 > helpful to have an assertion somehwere to catch this error when it
 > happens again.

 Michael supplied me off-list with a patch that handles this in
 sys/dev/scsipi/cd.c and I successfully went two "minimal" installs
 with it (and all my other edits reverted).  He was going to commit it
 too.

 And I agree, having an assertion somewhere would be a good idea.


   Harold

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54810 CVS commit: src/sys/arch/sparc64/conf
Date: Mon, 26 Oct 2020 11:49:45 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Mon Oct 26 11:49:45 UTC 2020

 Modified Files:
 	src/sys/arch/sparc64/conf: GENERIC

 Log Message:
 Backout previous, PR 54810 has been fixed.


 To generate a diff of this commit:
 cvs rdiff -u -r1.231 -r1.232 src/sys/arch/sparc64/conf/GENERIC

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-sparc64/54810: sparc64 pool_redzone_check errors during install
Date: Mon, 26 Oct 2020 12:08:25 -0000 (UTC)

 martin@duskware.de (Martin Husemann) writes:

 > This is clearly a bug. The medium has 2k sectors, we are not allowed to
 > use the buffer cache with blocks smaller than that (or with varying block
 > sizes).

 But that's not what happened.

 If you do regular I/O on a device that is not a multiple of sector size,
 the driver will just fail that request with EINVAL.

 cd(4) is magic, as it needs to support 512 byte UFS access on CD-ROMs
 for compatibility with old Sun hardware.

 CD-ROM drives that support 512 byte blocks will just be configured to
 handle 512 byte read commands. For CD-ROMs that do not (the majority)
 the driver provides the bounce buffer.

 The magic is triggered by the special sun*/sparc* disklabel code that
 returns a default sector size of 512 for CD-ROMs.

 When I refactored the code in 2016, I missed the case of a read
 smaller than a single sector. Apparently nobody does this but
 our NTFS code, so nobody noticed the bug until NTFS was recently
 added to the sparc64 kernel.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Wed, 28 Oct 2020 13:00:09 +0000
State-Changed-Why:
Fixed in src/sys/dev/scsipi/cd.c 1.349, thanks.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.