NetBSD Problem Report #56332
From www@netbsd.org Wed Jul 28 01:59:32 2021
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id EEE6C1A921F
for <gnats-bugs@gnats.NetBSD.org>; Wed, 28 Jul 2021 01:59:31 +0000 (UTC)
Message-Id: <20210728015931.05FEB1A9245@mollari.NetBSD.org>
Date: Wed, 28 Jul 2021 01:59:30 +0000 (UTC)
From: tnn@nygren.pp.se
Reply-To: tnn@nygren.pp.se
To: gnats-bugs@NetBSD.org
Subject: swap on 4k sector device uses only 1/8 of the configured capacity
X-Send-Pr-Version: www-1.0
>Number: 56332
>Category: kern
>Synopsis: swap on 4k sector device uses only 1/8 of the configured capacity
>Confidential: no
>Severity: non-critical
>Priority: low
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jul 28 02:00:00 +0000 2021
>Last-Modified: Wed Aug 04 16:10:01 +0000 2021
>Originator: Tobias Nygren
>Release: 9.99.87
>Organization:
>Environment:
>Description:
swap assumes the block size is 512 bytes.
>How-To-Repeat:
Set up 32 GiB of swap area on gpt on a 4k sector device:
ld4 at nvme0 nsid 1
ld4: 238 GB, 7752 cyl, 128 head, 63 sec, 4096 bytes/sect x 62514774 sectors
dk3 at ld4: 8388608 blocks at 50397440, type: swap
... but only 4 GiB is available.
# swapctl -lm
Device 1M-blocks Used Avail Capacity Priority
/dev/dk3 4096 0 4096 0% 0
>Fix:
kernel should query the device for the correct block size when configuring swap.
>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56332: swap on 4k sector device uses only 1/8 of the configured capacity
Date: Wed, 28 Jul 2021 05:31:57 -0000 (UTC)
tnn@nygren.pp.se writes:
>ld4 at nvme0 nsid 1
>ld4: 238 GB, 7752 cyl, 128 head, 63 sec, 4096 bytes/sect x 62514774 sectors
>dk3 at ld4: 8388608 blocks at 50397440, type: swap
>... but only 4 GiB is available.
That looks like a bug in the dk driver.
swap/dump are somewhat magic. Drivers have their own entry points to
handle a swap (and dump) partition:
DEVsize -> return the number of blocks
DEVdump -> dump that many bytes to a given block number.
Users of the driver will use DEV_BSIZE blocks, like for regular I/O.
But dksize() returns sc_size and dkdump() checks block numbers against
sc_size and offsets them by sc_offset. Both use the physical sector
sizes (sc_size is 8388608, sc_offset is 50397440).
So it's not only reporting the wrong size, but also writes to the
wrong position on the disk if the physical sector size is not DEV_BSIZE.
The regular I/O code does the right translation. So maybe (untested):
Index: dk.c
===================================================================
RCS file: /cvsroot/src/sys/dev/dkwedge/dk.c,v
retrieving revision 1.105
diff -p -u -r1.105 dk.c
--- dk.c 2 Jun 2021 17:56:40 -0000 1.105
+++ dk.c 28 Jul 2021 05:31:14 -0000
@@ -1639,6 +1639,7 @@ static int
dksize(dev_t dev)
{
struct dkwedge_softc *sc = dkwedge_lookup(dev);
+ uint64_t p_size;
int rv = -1;
if (sc == NULL)
@@ -1651,12 +1652,13 @@ dksize(dev_t dev)
/* Our content type is static, no need to open the device. */
+ p_size = sc->sc_size << sc->sc_parent->dk_blkshift;
if (strcmp(sc->sc_ptype, DKW_PTYPE_SWAP) == 0) {
/* Saturate if we are larger than INT_MAX. */
- if (sc->sc_size > INT_MAX)
+ if (p_size > INT_MAX)
rv = INT_MAX;
else
- rv = (int) sc->sc_size;
+ rv = (int) p_size;
}
mutex_exit(&sc->sc_parent->dk_rawlock);
@@ -1675,6 +1677,7 @@ dkdump(dev_t dev, daddr_t blkno, void *v
{
struct dkwedge_softc *sc = dkwedge_lookup(dev);
const struct bdevsw *bdev;
+ uint64_t p_size, p_offset;
int rv = 0;
if (sc == NULL)
@@ -1697,16 +1700,20 @@ dkdump(dev_t dev, daddr_t blkno, void *v
rv = EINVAL;
goto out;
}
- if (blkno < 0 || blkno + size / DEV_BSIZE > sc->sc_size) {
+
+ p_offset = sc->sc_offset << sc->sc_parent->dk_blkshift;
+ p_size = sc->sc_size << sc->sc_parent->dk_blkshift;
+
+ if (blkno < 0 || blkno + size / DEV_BSIZE > p_size) {
printf("%s: blkno (%" PRIu64 ") + size / DEV_BSIZE (%zu) > "
- "sc->sc_size (%" PRIu64 ")\n", __func__, blkno,
- size / DEV_BSIZE, sc->sc_size);
+ "p_size (%" PRIu64 ")\n", __func__, blkno,
+ size / DEV_BSIZE, p_size);
rv = EINVAL;
goto out;
}
bdev = bdevsw_lookup(sc->sc_pdev);
- rv = (*bdev->d_dump)(sc->sc_pdev, blkno + sc->sc_offset, va, size);
+ rv = (*bdev->d_dump)(sc->sc_pdev, blkno + p_offset, va, size);
out:
mutex_exit(&sc->sc_parent->dk_rawlock);
From: Tobias Nygren <tnn@nygren.pp.se>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56332: swap on 4k sector device uses only 1/8 of the
configured capacity
Date: Wed, 28 Jul 2021 13:27:10 +0200
On Wed, 28 Jul 2021 05:35:01 +0000 (UTC)
Michael van Elst <mlelstv@serpens.de> wrote:
> So it's not only reporting the wrong size, but also writes to the
> wrong position on the disk if the physical sector size is not DEV_BSIZE.
Uff, that sounds bad. I will deconfigure swap ...
> The regular I/O code does the right translation. So maybe (untested):
... and set up another machine to test this fix.
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56332: swap on 4k sector device uses only 1/8 of the configured capacity
Date: Wed, 28 Jul 2021 20:33:40 -0000 (UTC)
tnn@nygren.pp.se (Tobias Nygren) writes:
> On Wed, 28 Jul 2021 05:35:01 +0000 (UTC)
> Michael van Elst <mlelstv@serpens.de> wrote:
>
> > So it's not only reporting the wrong size, but also writes to the
> > wrong position on the disk if the physical sector size is not DEV_BSIZE.
>
> Uff, that sounds bad. I will deconfigure swap ...
Only a cash dump will write the wrong offset, swap uses regular I/O routines
that are correct.
From: Tobias Nygren <tnn@nygren.pp.se>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/56332: swap on 4k sector device uses only 1/8 of the
configured capacity
Date: Wed, 4 Aug 2021 18:05:20 +0200
Seems there are some issues with the patch.
I created 8 GiB swap on a 4k sector NVMe device on 4 GiB RAM aarch64.
swapctl does reports correct size now.
Then I started untaring multiple copies of pkgsrc to tmpfs.
Eventually the system grinds to near halt and top shows this:
Memory: 28K Inact, 4K Wired, 256K Exec, 28K File, 1380K Free
Swap: 8192M Total, 3652M Used, 4540M Free
tmpfs 8.2G 3.8G 4.4G 46% /tmp
There is clearly more swap available but all of the system
RAM pages have leaked somewhere and are no longer accounted for.
Unmounting /tmp released the all of the used swap but
only 50 MiB of system memory came back to the free list.
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.